Skip to content

perf(ingest): bounded-concurrency figure OCR + per-doc spend cap (E3)#57

Merged
CognitiveCodeAI merged 1 commit into
mainfrom
feat/e3-ocr-batching
Jun 8, 2026
Merged

perf(ingest): bounded-concurrency figure OCR + per-doc spend cap (E3)#57
CognitiveCodeAI merged 1 commit into
mainfrom
feat/e3-ocr-batching

Conversation

@CognitiveCodeAI

Copy link
Copy Markdown
Owner

E3 — batch/cap figure OCR (audit H-7)

Part of the remediation (#25). Figure OCR ran serially (one blocking gpt-5-mini call per region), so a ~270-page PDF took many minutes to ingest with no cost bound — the "stuck processing" experience.

Approach (thread-safe by construction)

PyMuPDF is not thread-safe, so create_figure_nodes is restructured into three phases:

  • A — render all region images in document order (fitz, single-threaded), applying a deterministic per-doc OCR spend cap;
  • B — OCR the rendered regions with a bounded ThreadPoolExecutor (ocr_region is a stateless, thread-safe HTTP call via the OpenAI SDK);
  • C — build Nodes single-threaded in document order from the OCR map.

Behavior-preserving

At concurrency 1 the nodes (ids, text, order, has_ocr) are identical to before. Per-figure OCR failures stay non-fatal (empty text). Beyond the cap, figures get empty OCR + a logged warning. New config: ocr_max_concurrency=4, ocr_max_calls_per_doc=60 (0 = unlimited). _create_single_figure_node → pure _build_figure_node.

Validation

tests/test_figure_ocr_batching.py: composition, order+parallel, deterministic cap, failure-non-fatal, skip-ocr. Full default suite: 309 passed.

Closes #22

Audit H-7: figure OCR ran serially (one blocking gpt-5-mini call per
region), making large-PDF ingestion very slow with no cost bound.

Restructure create_figure_nodes into three phases that keep PyMuPDF
single-threaded (it is NOT thread-safe) while parallelizing only the I/O:
  A) render all region images in document order (fitz, single-threaded),
     applying a deterministic per-doc OCR spend cap;
  B) OCR the rendered regions with a bounded ThreadPoolExecutor
     (ocr_region is a stateless, thread-safe HTTP call);
  C) build Node objects single-threaded in document order from the OCR map.

Behavior-preserving: at concurrency 1 the nodes (ids, text, order, has_ocr)
are identical to before; per-figure OCR failures stay non-fatal (empty
text); beyond the spend cap figures get empty OCR + a logged warning.

New config: ocr_max_concurrency=4, ocr_max_calls_per_doc=60 (0=unlimited).
_create_single_figure_node -> _build_figure_node (pure, fitz-free).

Tests: tests/test_figure_ocr_batching.py (composition, order+parallel,
deterministic cap, failure-non-fatal, skip_ocr). Full suite: 309 passed.

Closes #22
@CognitiveCodeAI CognitiveCodeAI added this to the Phase 3 – Modernize milestone Jun 8, 2026
@CognitiveCodeAI CognitiveCodeAI added WS-E Workstream E: Modernize architecture reliability Reliability, resilience, or operational correctness labels Jun 8, 2026
@CognitiveCodeAI CognitiveCodeAI merged commit 7d5e095 into main Jun 8, 2026
4 checks passed
@CognitiveCodeAI CognitiveCodeAI deleted the feat/e3-ocr-batching branch June 8, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reliability Reliability, resilience, or operational correctness WS-E Workstream E: Modernize architecture

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[E3] Batch and cache figure OCR with bounded concurrency and per-doc spend cap

1 participant