perf(ingest): bounded-concurrency figure OCR + per-doc spend cap (E3) by CognitiveCodeAI · Pull Request #57 · CognitiveCodeAI/rag-main-2

CognitiveCodeAI · 2026-06-08T15:21:28Z

E3 — batch/cap figure OCR (audit H-7)

Part of the remediation (#25). Figure OCR ran serially (one blocking gpt-5-mini call per region), so a ~270-page PDF took many minutes to ingest with no cost bound — the "stuck processing" experience.

Approach (thread-safe by construction)

PyMuPDF is not thread-safe, so create_figure_nodes is restructured into three phases:

A — render all region images in document order (fitz, single-threaded), applying a deterministic per-doc OCR spend cap;
B — OCR the rendered regions with a bounded ThreadPoolExecutor (ocr_region is a stateless, thread-safe HTTP call via the OpenAI SDK);
C — build Nodes single-threaded in document order from the OCR map.

Behavior-preserving

At concurrency 1 the nodes (ids, text, order, has_ocr) are identical to before. Per-figure OCR failures stay non-fatal (empty text). Beyond the cap, figures get empty OCR + a logged warning. New config: ocr_max_concurrency=4, ocr_max_calls_per_doc=60 (0 = unlimited). _create_single_figure_node → pure _build_figure_node.

Validation

tests/test_figure_ocr_batching.py: composition, order+parallel, deterministic cap, failure-non-fatal, skip-ocr. Full default suite: 309 passed.

Closes #22

Audit H-7: figure OCR ran serially (one blocking gpt-5-mini call per region), making large-PDF ingestion very slow with no cost bound. Restructure create_figure_nodes into three phases that keep PyMuPDF single-threaded (it is NOT thread-safe) while parallelizing only the I/O: A) render all region images in document order (fitz, single-threaded), applying a deterministic per-doc OCR spend cap; B) OCR the rendered regions with a bounded ThreadPoolExecutor (ocr_region is a stateless, thread-safe HTTP call); C) build Node objects single-threaded in document order from the OCR map. Behavior-preserving: at concurrency 1 the nodes (ids, text, order, has_ocr) are identical to before; per-figure OCR failures stay non-fatal (empty text); beyond the spend cap figures get empty OCR + a logged warning. New config: ocr_max_concurrency=4, ocr_max_calls_per_doc=60 (0=unlimited). _create_single_figure_node -> _build_figure_node (pure, fitz-free). Tests: tests/test_figure_ocr_batching.py (composition, order+parallel, deterministic cap, failure-non-fatal, skip_ocr). Full suite: 309 passed. Closes #22

CognitiveCodeAI added this to the Phase 3 – Modernize milestone Jun 8, 2026

CognitiveCodeAI added WS-E Workstream E: Modernize architecture reliability Reliability, resilience, or operational correctness labels Jun 8, 2026

CognitiveCodeAI merged commit 7d5e095 into main Jun 8, 2026
4 checks passed

CognitiveCodeAI deleted the feat/e3-ocr-batching branch June 8, 2026 15:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(ingest): bounded-concurrency figure OCR + per-doc spend cap (E3)#57

perf(ingest): bounded-concurrency figure OCR + per-doc spend cap (E3)#57
CognitiveCodeAI merged 1 commit into
mainfrom
feat/e3-ocr-batching

CognitiveCodeAI commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CognitiveCodeAI commented Jun 8, 2026

E3 — batch/cap figure OCR (audit H-7)

Approach (thread-safe by construction)

Behavior-preserving

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant