Skip to content

refactor: chunk PDF rendering for OCR and extraction#4321

Open
PastelStorm wants to merge 4 commits intomainfrom
evoss/pdf-rendering-refactor
Open

refactor: chunk PDF rendering for OCR and extraction#4321
PastelStorm wants to merge 4 commits intomainfrom
evoss/pdf-rendering-refactor

Conversation

@PastelStorm
Copy link
Copy Markdown
Contributor

@PastelStorm PastelStorm commented Apr 5, 2026

Prepare PDF image rendering for chunked and process-isolated execution while preserving public helper compatibility and adding focused OCR/image extraction regressions.

Made-with: Cursor


Note

Medium Risk
Touches PDF OCR and image-extraction rendering paths, changing pagination/chunking behavior and adding stricter validation, which could surface new edge cases in page counting or file-like inputs.

Overview
Prepares PDF rendering to be process-isolation friendly by resolving the PDFium renderer at call time (avoiding stale import-time aliases) and keeps convert_pdf_to_image compatible while adding first_page/last_page support.

Updates OCR and image/table extraction to render PDFs in configurable page chunks (PDFIUM_CHUNK_SIZE, default 8) and to rasterize only the page ranges actually needed; adds guardrails for invalid chunk-size values (warn + fallback) and for page-count/layout mismatches.

Improves robustness for file-like inputs by rewinding/restoring stream positions across OCR/extraction stages, with new tests and updated expected ingest outputs reflecting minor OCR text diffs.

Reviewed by Cursor Bugbot for commit 2498e9c. Bugbot is set up for automated code reviews on this repo. Configure here.

Prepare PDF image rendering for chunked and process-isolated execution while preserving public helper compatibility and adding focused OCR/image extraction regressions.

Made-with: Cursor
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 312fcbb. Configure here.

Comment thread unstructured/partition/pdf_image/ocr.py Outdated
raise ValueError("OCR received an empty layout for a PDF.")
raise ValueError(
"OCR received an empty layout for a PDF."
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probe result ignored; both branches raise identical error

Medium Severity

When out_layout.pages is empty, the code renders a probe page to check if the PDF has renderable content, but then raises the exact same ValueError with the identical message regardless of the probe result. The if image_paths: branch raises "OCR received an empty layout for a PDF.", and the unconditional raise on the next line raises the same error for the empty-probe case. This makes the probe render wasted work and likely mishandles the case where both the PDF and layout are genuinely empty (zero-page PDF). Based on the test name test_process_file_with_ocr_raises_when_layout_is_empty_but_pdf_renders, the intent appears to be to only error when the PDF has pages but the layout doesn't; the fallthrough branch likely was meant to return an empty DocumentLayout instead of raising.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 312fcbb. Configure here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PastelStorm is this still intentional? It looks like the new empty layout probe still raises in both branches, which changes the zero-page PDF path from an empty partition result to a ValueError:

Before:
zero-page PDF -> empty OCR layout -> empty partition result

After:
zero-page PDF -> empty OCR layout -> ValueError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants