perf(ingest): bounded-concurrency figure OCR + per-doc spend cap (E3)#57
Merged
Conversation
Audit H-7: figure OCR ran serially (one blocking gpt-5-mini call per
region), making large-PDF ingestion very slow with no cost bound.
Restructure create_figure_nodes into three phases that keep PyMuPDF
single-threaded (it is NOT thread-safe) while parallelizing only the I/O:
A) render all region images in document order (fitz, single-threaded),
applying a deterministic per-doc OCR spend cap;
B) OCR the rendered regions with a bounded ThreadPoolExecutor
(ocr_region is a stateless, thread-safe HTTP call);
C) build Node objects single-threaded in document order from the OCR map.
Behavior-preserving: at concurrency 1 the nodes (ids, text, order, has_ocr)
are identical to before; per-figure OCR failures stay non-fatal (empty
text); beyond the spend cap figures get empty OCR + a logged warning.
New config: ocr_max_concurrency=4, ocr_max_calls_per_doc=60 (0=unlimited).
_create_single_figure_node -> _build_figure_node (pure, fitz-free).
Tests: tests/test_figure_ocr_batching.py (composition, order+parallel,
deterministic cap, failure-non-fatal, skip_ocr). Full suite: 309 passed.
Closes #22
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
E3 — batch/cap figure OCR (audit H-7)
Part of the remediation (
#25). Figure OCR ran serially (one blockinggpt-5-minicall per region), so a ~270-page PDF took many minutes to ingest with no cost bound — the "stuck processing" experience.Approach (thread-safe by construction)
PyMuPDF is not thread-safe, so
create_figure_nodesis restructured into three phases:ThreadPoolExecutor(ocr_regionis a stateless, thread-safe HTTP call via the OpenAI SDK);Nodes single-threaded in document order from the OCR map.Behavior-preserving
At concurrency 1 the nodes (ids, text, order,
has_ocr) are identical to before. Per-figure OCR failures stay non-fatal (empty text). Beyond the cap, figures get empty OCR + a logged warning. New config:ocr_max_concurrency=4,ocr_max_calls_per_doc=60(0 = unlimited)._create_single_figure_node→ pure_build_figure_node.Validation
tests/test_figure_ocr_batching.py: composition, order+parallel, deterministic cap, failure-non-fatal, skip-ocr. Full default suite: 309 passed.Closes #22