refactor: chunk PDF rendering for OCR and extraction by PastelStorm · Pull Request #4321 · Unstructured-IO/unstructured

PastelStorm · 2026-04-05T21:49:31Z

Prepare PDF image rendering for chunked and process-isolated execution while preserving public helper compatibility and adding focused OCR/image extraction regressions.

Made-with: Cursor

Note

Medium Risk
Touches PDF OCR and image-extraction rendering paths, changing pagination/chunking behavior and adding stricter validation, which could surface new edge cases in page counting or file-like inputs.

Overview
Prepares PDF rendering to be process-isolation friendly by resolving the PDFium renderer at call time (avoiding stale import-time aliases) and keeps convert_pdf_to_image compatible while adding first_page/last_page support.

Updates OCR and image/table extraction to render PDFs in configurable page chunks (PDFIUM_CHUNK_SIZE, default 8) and to rasterize only the page ranges actually needed; adds guardrails for invalid chunk-size values (warn + fallback) and for page-count/layout mismatches.

Improves robustness for file-like inputs by rewinding/restoring stream positions across OCR/extraction stages, with new tests and updated expected ingest outputs reflecting minor OCR text diffs.

^{Reviewed by Cursor Bugbot for commit 2498e9c. Bugbot is set up for automated code reviews on this repo. Configure here.}

Prepare PDF image rendering for chunked and process-isolated execution while preserving public helper compatibility and adding focused OCR/image extraction regressions. Made-with: Cursor

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 312fcbb. Configure here.}

cursor · 2026-04-06T00:33:15Z

+                        raise ValueError("OCR received an empty layout for a PDF.")
+                    raise ValueError(
+                        "OCR received an empty layout for a PDF."
+                    )


Probe result ignored; both branches raise identical error

Medium Severity

When out_layout.pages is empty, the code renders a probe page to check if the PDF has renderable content, but then raises the exact same ValueError with the identical message regardless of the probe result. The if image_paths: branch raises "OCR received an empty layout for a PDF.", and the unconditional raise on the next line raises the same error for the empty-probe case. This makes the probe render wasted work and likely mishandles the case where both the PDF and layout are genuinely empty (zero-page PDF). Based on the test name test_process_file_with_ocr_raises_when_layout_is_empty_but_pdf_renders, the intent appears to be to only error when the PDF has pages but the layout doesn't; the fallthrough branch likely was meant to return an empty DocumentLayout instead of raising.

^{Reviewed by Cursor Bugbot for commit 312fcbb. Configure here.}

@PastelStorm is this still intentional? It looks like the new empty layout probe still raises in both branches, which changes the zero-page PDF path from an empty partition result to a ValueError:

Before:
zero-page PDF -> empty OCR layout -> empty partition result

After:
zero-page PDF -> empty OCR layout -> ValueError

PastelStorm added 3 commits April 5, 2026 14:43

refactor: chunk PDF rendering for OCR and extraction

5e7d868

Prepare PDF image rendering for chunked and process-isolated execution while preserving public helper compatibility and adding focused OCR/image extraction regressions. Made-with: Cursor

lint

0259d55

fixes

312fcbb

cursor bot reviewed Apr 6, 2026

View reviewed changes

lint

2498e9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: chunk PDF rendering for OCR and extraction#4321

refactor: chunk PDF rendering for OCR and extraction#4321
PastelStorm wants to merge 4 commits intomainfrom
evoss/pdf-rendering-refactor

PastelStorm commented Apr 5, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Apr 6, 2026

Uh oh!

CyMule Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PastelStorm commented Apr 5, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 6, 2026

Choose a reason for hiding this comment

Probe result ignored; both branches raise identical error

Uh oh!

CyMule Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PastelStorm commented Apr 5, 2026 •

edited by cursor bot

Loading