refactor: chunk PDF rendering for OCR and extraction#4321
refactor: chunk PDF rendering for OCR and extraction#4321PastelStorm wants to merge 4 commits intomainfrom
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 312fcbb. Configure here.
| raise ValueError("OCR received an empty layout for a PDF.") | ||
| raise ValueError( | ||
| "OCR received an empty layout for a PDF." | ||
| ) |
There was a problem hiding this comment.
Probe result ignored; both branches raise identical error
Medium Severity
When out_layout.pages is empty, the code renders a probe page to check if the PDF has renderable content, but then raises the exact same ValueError with the identical message regardless of the probe result. The if image_paths: branch raises "OCR received an empty layout for a PDF.", and the unconditional raise on the next line raises the same error for the empty-probe case. This makes the probe render wasted work and likely mishandles the case where both the PDF and layout are genuinely empty (zero-page PDF). Based on the test name test_process_file_with_ocr_raises_when_layout_is_empty_but_pdf_renders, the intent appears to be to only error when the PDF has pages but the layout doesn't; the fallthrough branch likely was meant to return an empty DocumentLayout instead of raising.
Reviewed by Cursor Bugbot for commit 312fcbb. Configure here.
There was a problem hiding this comment.
@PastelStorm is this still intentional? It looks like the new empty layout probe still raises in both branches, which changes the zero-page PDF path from an empty partition result to a ValueError:
Before:
zero-page PDF -> empty OCR layout -> empty partition result
After:
zero-page PDF -> empty OCR layout -> ValueError


Prepare PDF image rendering for chunked and process-isolated execution while preserving public helper compatibility and adding focused OCR/image extraction regressions.
Made-with: Cursor
Note
Medium Risk
Touches PDF OCR and image-extraction rendering paths, changing pagination/chunking behavior and adding stricter validation, which could surface new edge cases in page counting or file-like inputs.
Overview
Prepares PDF rendering to be process-isolation friendly by resolving the PDFium renderer at call time (avoiding stale import-time aliases) and keeps
convert_pdf_to_imagecompatible while addingfirst_page/last_pagesupport.Updates OCR and image/table extraction to render PDFs in configurable page chunks (
PDFIUM_CHUNK_SIZE, default8) and to rasterize only the page ranges actually needed; adds guardrails for invalid chunk-size values (warn + fallback) and for page-count/layout mismatches.Improves robustness for file-like inputs by rewinding/restoring stream positions across OCR/extraction stages, with new tests and updated expected ingest outputs reflecting minor OCR text diffs.
Reviewed by Cursor Bugbot for commit 2498e9c. Bugbot is set up for automated code reviews on this repo. Configure here.