Add benchmark harness for evaluating OpenContracts against external RAG datasets by JSv4 · Pull Request #1239 · Open-Source-Legal/OpenContracts

JSv4 · 2026-04-13T02:06:13Z

Summary

This PR introduces a comprehensive benchmarking framework for evaluating OpenContracts' extraction and retrieval capabilities against external RAG datasets. The initial implementation includes full support for LegalBench-RAG (Pipitone & Alami, 2024), with an extensible adapter pattern for adding future benchmarks.

Key Changes

New opencontractserver/benchmarks/ app with modular components:
- adapters/base.py: Abstract adapter interface for normalizing benchmark datasets
- adapters/legalbench_rag.py: Concrete adapter for LegalBench-RAG (contractnli, cuad, maud, privacy_qa subsets)
- loader.py: Materializes benchmark data into live Corpus + Extract + Datacells with document ingestion
- runner.py: End-to-end orchestration that loads, extracts, evaluates, and reports
- metrics.py: SQuAD-style answer metrics (exact match, token F1) and span-based retrieval metrics (recall@k, precision@k, char IoU)
- retrieval.py: Independent probe of CoreAnnotationVectorStore for retrieval evaluation
- report.py: Serializable per-task and aggregate results (JSON/CSV output)
Management command (run_benchmark): CLI interface for running benchmarks with configurable:
- Model selection (any pydantic-ai compatible identifier)
- Top-k retrieval parameter
- Subset filtering and task limits
- Custom corpus/extract naming and output directories
Test suite (test_benchmarks.py):
- Pure unit tests for metrics (no DB/Celery dependencies)
- Adapter tests against micro fixture
- End-to-end integration test with mocked LLM agent
Micro fixture (fixtures/benchmarks/legalbench_rag_micro/): Minimal synthetic LegalBench-RAG dataset for testing and CI
Documentation (docs/extract_and_retrieval/benchmarking.md): Usage guide and architecture overview
Integration with extraction pipeline: Modified doc_extract_query_task to accept model_override parameter for benchmark-specific model selection

Implementation Details

Separation of concerns: Answer extraction and retrieval are evaluated independently so each dimension can fail separately and be diagnosed clearly
Eager celery mode: Loader wraps document ingestion in force_celery_eager() context to ensure sentence-level annotations exist before extraction runs
Adapter pattern: Minimal interface (two iterators + describe method) makes it trivial to add new benchmarks without touching core logic
Macro-averaging: Aggregates weight every task equally regardless of document count, matching LegalBench-RAG's reporting convention
Comprehensive metrics: Captures both answer quality (SQuAD-style) and retrieval quality (span overlap) with per-task and aggregate reporting

https://claude.ai/code/session_01MCwUHaGd6EApdz7rehtPXH

Adds a new opencontractserver/benchmarks/ app that generates an OpenContracts corpus from an external RAG benchmark, runs the production extract-grid pipeline against it with a configurable LLM, probes the retrieval layer independently, and reports standard answer and retrieval metrics. Answer metrics use SQuAD-style normalized exact match and token F1 over the extracted Datacell value. Retrieval metrics use character-span recall@k, precision@k, and IoU computed against the benchmark's gold spans; retrieval is probed via CoreAnnotationVectorStore independently of the structured-response extraction path so the two dimensions fail separately. The first supported benchmark is LegalBench-RAG (Pipitone & Alami, 2024; arXiv:2408.10343). New benchmarks are added by subclassing BaseBenchmarkAdapter and registering it in the management command's registry; the loader, runner, evaluator, metrics, and reporter are all benchmark-agnostic. Changes: - New opencontractserver/benchmarks/ app (adapters, loader, runner, retrieval probe, metrics, report, management command). Registered in LOCAL_APPS. - LegalBenchRAGAdapter reads the authoritative ZeroEntropy schema (corpus/<subset>/*.txt + benchmarks/<subset>.json with snippets of shape {"file_path", "span": [start, end]}). - doc_extract_query_task gains an optional model_override kwarg. Backward compatible: when omitted the task still uses openai:gpt-4o-mini as before. - Micro fixture under fixtures/benchmarks/legalbench_rag_micro/ so the test suite can exercise the full flow without downloading the upstream dataset. - Tests: opencontractserver/tests/test_benchmarks.py covers metrics (pure unit tests), adapter parsing, loader materialization, and an end-to-end run with a mocked structured-response agent so CI does not hit a real LLM. - User-facing docs at docs/extract_and_retrieval/benchmarking.md with CLI and Python API examples. - CHANGELOG entry. Usage: docker compose -f local.yml run django python manage.py run_benchmark \ --benchmark legalbench-rag --path /data/legalbench-rag \ --user admin --model openai:gpt-4o-mini --top-k 10 Results (report.json / report.csv / config.json / gold.json) are written under a timestamped run directory. https://claude.ai/code/session_01MCwUHaGd6EApdz7rehtPXH

claude · 2026-04-13T02:08:33Z

Code Review

This is a well-structured PR. The separation of concerns between adapter, loader, runner, metrics, and report is clean, and the adapter pattern makes adding future benchmarks straightforward. The micro fixture for testing is a nice touch. Here are issues worth addressing before merge:

Critical

_null_context defined twice — DRY violation

loader.py defines _null_context as a @contextmanager function; runner.py defines a separate _null_context class. Per CLAUDE.md, no dead code and no duplication. The runner should import the one from the loader, or a shared utility module should own it.

Management command registry pattern is incomplete

BENCHMARK_REGISTRY is defined but then bypassed by an if benchmark_name == "legalbench-rag": ... else: raise block. Adding a second benchmark now requires edits in two places (the registry and the if-else), which defeats the purpose of the registry. Move the adapter construction into a factory callable so the registry is the only change point:

BENCHMARK_REGISTRY = {
    "legalbench-rag": lambda opts: LegalBenchRAGAdapter(
        root=opts["path"],
        subsets=opts.get("subsets") or None,
        limit=opts.get("limit"),
    ),
}

Then the handle method becomes just adapter = BENCHMARK_REGISTRY[benchmark_name](options).

char_iou builds a set of every individual character index

_union_of_spans calls covered.update(range(start, end)) for every span. A span of (0, 50000) allocates 50k integers in a Python set. For LegalBench-RAG the spans are small, but this is a ticking time-bomb for future benchmarks with larger documents. Use interval arithmetic instead (merge sorted intervals, then compute overlap length):

def _union_length(spans: Iterable[Span]) -> int:
    merged = []
    for s, e in sorted(spans):
        if merged and s <= merged[-1][1]:
            merged[-1] = (merged[-1][0], max(merged[-1][1], e))
        else:
            merged.append([s, e])
    return sum(e - s for s, e in merged)

Moderate

Magic strings / numbers not in constants files

CLAUDE.md: "No magic numbers — we have constants files in opencontractserver/constants/."

DEFAULT_MODEL = "openai:gpt-4o-mini" (runner.py) and _COLUMN_NAME_MAX_LEN = 128 (loader.py) should live in the constants module so they're findable from one place and the hardcoded model string isn't scattered across files.

aggregates dict stores counts as float

"task_count": float(total),
"extraction_success_count": float(len(ok_results)),

Task counts are integers. The test already unwraps them with int(report.aggregates["task_count"]). Store them as int and only use float for rates/means to avoid surprising callers.

_extract_prediction_string relies on undocumented data schema

payload = data.get("data") if isinstance(data, dict) else None

There's no comment explaining that Datacell.data is expected to have shape {"data": ...}. If that schema ever changes, this silently returns "" for every task and the benchmark reports 0.0 F1 without any warning. Add a one-line comment documenting the expected shape.

Minor

RST double-colon in a Markdown file

benchmarking.md contains:

LegalBench-RAG ships as::

The trailing :: is reStructuredText syntax. In CommonMark the fence renders as a literal :: after the sentence.

config["finished_at"] mutates report.config by reference

report = BenchmarkReport(..., config=config, ...)
report.compute_aggregates()
config["finished_at"] = timezone.now().isoformat()   # also modifies report.config

Assigning to config after BenchmarkReport is constructed silently updates report.config too, since they point at the same dict. Either do the assignment before constructing the report, or use report.config["finished_at"] = ... explicitly so the intent is clear.

Path-traversal safeguard raises rather than skips

In _load_document, the ValueError from a traversal attempt propagates all the way up and aborts the entire load. That's fine for malicious input, but if a benchmark JSON file references a nonexistent or out-of-bounds path due to data quality, users get a hard crash instead of a skipped-with-warning task. Consider logging a warning and continue-ing in _ensure_loaded (similar to how other malformed snippet errors are handled).

Nit

The dead else: raise CommandError branch (guarded by # pragma: no cover) is unreachable because argparse already validates choices=. Delete it once the factory pattern above is in place.

Overall the architecture is solid and the test coverage is good. The critical items above are all small fixes; the biggest win would be the registry factory refactor since it's a footgun for anyone adding the next adapter.

The benchmarks __init__.py eagerly imported run_benchmark, which transitively pulled in Django models before the app registry was ready. Remove all re-exports from __init__.py and document that callers should import from submodules directly. Update the programmatic API example in docs accordingly. Also fix RST double-colon syntax in markdown docs.

- Consolidate duplicate _null_context into benchmarks/utils.py - Refactor char_iou to use interval merging (O(n log n)) instead of materializing a set of every character index (O(n) memory) - Move magic strings/numbers (DEFAULT_MODEL, DEFAULT_TOP_K, _COLUMN_NAME_MAX_LEN) to opencontractserver/constants/benchmarks.py - Fix aggregates dict storing counts as float instead of int - Add clarifying comment about Datacell.data {"data": ...} schema - Fix config aliasing bug: pass dict(config) to BenchmarkReport so later mutation of config["finished_at"] does not affect the report - Change path-traversal safeguard to skip with warning instead of raising and aborting the entire run - Remove dead else branch in management command (unreachable due to argparse choices= validation) and use registry directly


 from opencontractserver.constants.annotations import *  # noqa: F401, F403
 from opencontractserver.constants.auth import *  # noqa: F401, F403
+from opencontractserver.constants.benchmarks import *  # noqa: F401, F403


claude · 2026-04-13T03:10:03Z

Code Review: Benchmark Harness for External RAG Datasets

Overall this is a well-structured addition. The adapter pattern is clean, the separation of concerns across adapter/loader/runner/metrics/report is solid, and the path traversal check in _load_document is implemented correctly. A few things worth addressing before merge:

Issues

1. null_context re-implements contextlib.nullcontext (stdlib since Python 3.7)

opencontractserver/benchmarks/utils.py defines its own no-op context manager. Python already ships this:

# Before
from opencontractserver.benchmarks.utils import null_context

# After
from contextlib import nullcontext as null_context

The utils.py file and its two import sites in loader.py and runner.py can be removed entirely. This also avoids the cross-module dependency on a trivial helper.

2. Default model is duplicated across two modules

opencontractserver/tasks/data_extract_tasks.py has:

model=model_override or "openai:gpt-4o-mini"

While opencontractserver/constants/benchmarks.py defines BENCHMARK_DEFAULT_MODEL = "openai:gpt-4o-mini" for the same value. These are intentionally separate (benchmark default vs. production default), but the production task doesn't import the constant, so they can silently diverge. Either:

Import the constant in the task (from opencontractserver.constants.benchmarks import BENCHMARK_DEFAULT_MODEL), or
Extract a shared EXTRACT_DEFAULT_MODEL constant from an existing constants file and reference it in both places.

3. Flawed test assertion — structural_annotation_set is always non-None

In test_benchmarks.py (test_loader_materializes_corpus_fieldset_extract_and_datacells):

self.assertIsNotNone(document.structural_annotation_set)

A Django reverse relation manager is never None — this always passes and doesn't verify ingestion actually produced annotations. It should be:

self.assertTrue(document.structural_annotation_set.exists())

4. Aggregation asymmetry between answer and retrieval metrics is undocumented

In report.py, compute_aggregates computes answer metrics only over ok_results (successful extractions) but computes retrieval metrics over all results:

ok_results = [r for r in self.task_results if r.extraction_ok]
...
"answer_exact_match": mean(r.answer_exact_match for r in ok_results),  # only ok
"retrieval_recall_at_k": mean(r.retrieval_recall_at_k for r in self.task_results),  # all

The design rationale is sound (retrieval is independent of extraction success), but it's a subtle asymmetry that will confuse anyone comparing the two numbers. A one-line comment would help: e.g. # Retrieval is independent of extraction success; average over all tasks.

5. force_celery_eager() mutates global Celery config — scope caveat is missing

The docstring correctly notes this is safe for the CLI use case, but the same function is imported and used in the runner, which is callable from notebooks and programmatic code. If a caller ever invokes run_benchmark inside a running Celery worker or web request, the global task_always_eager mutation will affect every concurrent task on that process. A short caveat in the docstring ("do not call from a shared worker process") would prevent misuse.

6. Minor: describe() in LegalBenchRAGAdapter re-scans the filesystem on every call

def describe(self) -> dict[str, Any]:
    return {
        ...
        "subsets": list(self._discover_subset_files().keys()),
        ...
    }

_discover_subset_files() does a glob on every call to describe(). Not a problem in practice since describe() is called once per run, but caching the result after _ensure_loaded() would be tidier.

Positives

Path traversal in _load_document is correctly handled (resolve() + relative_to() pattern).
The adapter interface is minimal and the boundary between adapter/loader/runner is well-drawn.
Constants file correctly warns against Django imports at the top.
TransactionTestCase is the right choice for the integration test given signals + eager Celery.
write_report=False guard in the runner keeps tests filesystem-clean.
The micro fixture makes CI self-contained without downloading the full dataset.
The model_override change to doc_extract_query_task is properly backward-compatible.

- Fix mock target for agents API: use patch.object(AgentAPI, ...) with staticmethod wrapper instead of patching module-level attribute that doesn't exist due to local import in data_extract_tasks.py - Remove flawed structural_annotation_set assertion that depends on TxtParser succeeding in CI; replace with document existence check - Use force_celery_eager for both ingestion and extraction in tests instead of relying on override_settings with transaction.on_commit - Replace custom null_context with stdlib contextlib.nullcontext - Delete opencontractserver/benchmarks/utils.py (only contained null_context reimplementation) - Add clarifying comments for aggregate metric asymmetry in report.py - Add safety warning to force_celery_eager docstring

claude · 2026-04-13T05:37:20Z

Code Review: Benchmark Harness for External RAG Datasets

Overall this is a well-structured addition. Clean adapter pattern, good separation of concerns, proper use of the constants file, path-traversal protection in _load_document, and solid test coverage (pure metric unit tests + adapter + integration with a mocked LLM). A few issues worth addressing before merge:

Bugs

1. _wait_for_document_ready does not handle deleted documents (opencontractserver/benchmarks/loader.py)

When .first() returns None (document deleted mid-poll), None in (COMPLETED, FAILED) is False, so the loop runs silently until TimeoutError. Add a None guard:

if status is None or status in (DocumentProcessingStatus.COMPLETED, DocumentProcessingStatus.FAILED):
    return

2. use_eager_extraction=False evaluates before extraction completes (opencontractserver/benchmarks/runner.py)

_run_extraction always calls doc_extract_query_task.apply(...) which is synchronous in the current process. But when use_eager_extraction=False, child tasks dispatched inside doc_extract_query_task go to the real broker and have not finished when _evaluate is called immediately after. You would see extraction_ok=False and empty predictions for every task — silently wrong results, not an error. Either document that use_eager_extraction=False is unsupported for production runs, or add a post-extraction poll mirroring _wait_for_document_ready.

Design

3. adapters/__init__.py re-exports contradict the top-level __init__ warning

opencontractserver/benchmarks/__init__.py warns against package-level re-exports because submodules transitively depend on Django models. But adapters/__init__.py does re-export LegalBenchRAGAdapter. This is safe today (adapters are Django-free) but is a trap for future adapter authors. Add a comment in adapters/__init__.py stating "adapters must remain Django-free" or drop the re-exports to match the top-level pattern.

4. BenchmarkReport.compute_aggregates() must be called manually

If a caller constructs BenchmarkReport(...) and forgets to call compute_aggregates(), report.aggregates is {} — silently empty rather than a clear error. Consider a __post_init__ that calls it, or a lazy @property. The runner calls it correctly, but notebook/one-off API users could easily miss this.

5. force_celery_eager has no runtime guard against web-worker calls

The docstring is clear about the danger but there is no code to back it up. A simple check (e.g. inspect sys.argv or a gunicorn/uwsgi env var) would prevent someone from accidentally calling run_benchmark from a view or task and globally mutating the celery config for the entire worker process.

Minor

6. Asymmetric averaging in compute_aggregates() should warn loudly

Answer metrics average only ok_results; retrieval metrics average all tasks. If extraction fails broadly, answer metrics look artificially high while retrieval correctly penalizes failures. A logger.warning when extraction_success_rate < 1.0 would make this asymmetry visible in CLI output.

7. describe() re-scans the filesystem after _ensure_loaded() (adapters/legalbench_rag.py)

describe() calls _discover_subset_files() again even when self._loaded is True. Cache the discovered subset names during _ensure_loaded() to avoid the redundant filesystem scan.

8. subsets parameter silently rejects the micro fixture name — document this

subsets=["micro"] raises ValueError because "micro" is not in LEGALBENCH_RAG_SUBSETS. The subsets parameter docstring should note that the micro fixture must be run without subsets= filtering.

9. No cleanup documentation for accumulated corpora

Each run_benchmark call creates a Corpus + Fieldset + Extract + Datacells with no cleanup mechanism. Model-sweep runs accumulate objects in the database. A note in benchmarking.md would help operators.

Positives

Path traversal protection in _load_document via .relative_to(self.corpus_dir) is exactly right.
TransactionTestCase for the integration test is correct per project guidelines (signals).
BENCHMARK_COLUMN_NAME_MAX_LEN = 128 is safely under the Column.name DB limit of 256.
Backward-compatible model_override addition to doc_extract_query_task is clean.
Macro-averaging in compute_aggregates() correctly matches the LegalBench-RAG reporting convention.
write_report=False in run_benchmark is good for test isolation.

codecov · 2026-04-13T06:16:27Z

Codecov Report

❌ Patch coverage is 87.41135% with 71 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
opencontractserver/benchmarks/retrieval.py	56.09%	18 Missing ⚠️
...ntractserver/benchmarks/adapters/legalbench_rag.py	85.84%	16 Missing ⚠️
opencontractserver/benchmarks/loader.py	87.28%	15 Missing ⚠️
opencontractserver/benchmarks/runner.py	84.70%	13 Missing ⚠️
opencontractserver/benchmarks/metrics.py	92.66%	8 Missing ⚠️
opencontractserver/benchmarks/adapters/base.py	96.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…us-generation-SWmBV

…raction - Fix _wait_for_document_ready to handle deleted documents (None guard) so the loop exits immediately instead of spinning until TimeoutError - Raise NotImplementedError when use_eager_extraction=False since child tasks go to the real broker and may not complete before evaluation - Add __post_init__ to BenchmarkReport to auto-compute aggregates - Document Django-free constraint in adapters/__init__.py docstring

claude · 2026-04-13T14:11:53Z

Code Review — PR #1239: Benchmark Harness for External RAG Datasets

Overall this is a well-architected addition. The adapter pattern, separation of concerns across loader/runner/metrics/report, and the macro-averaging convention are all solid. Below are specific issues worth addressing before merge.

Bugs / Correctness

1. force_celery_eager() is not thread-safe

loader.py mutates the global Celery app config with celery.conf.update(...). In any environment with concurrent workers (Django dev server with threading, gunicorn with multiple threads), this will corrupt unrelated task executions for the duration of the benchmark. Even with try/finally, the mutation window is unbounded and non-atomic.

Recommended fix: Use unittest.mock.patch scoped to the context, or better yet restrict the benchmark runner to a standalone subprocess / management command invocation with an explicit guard:

if not getattr(settings, "CELERY_TASK_ALWAYS_EAGER", False):
    os.environ["CELERY_TASK_ALWAYS_EAGER"] = "1"  # subprocess isolation

At minimum, add a hard assert that the current process is not a Celery worker before mutating.

2. Column name collisions when query prefix truncates identically

_make_column_name() in loader.py truncates to BENCHMARK_COLUMN_NAME_MAX_LEN (128 chars). Two tasks whose queries share the same first 128 characters will produce duplicate Column names inside the same Fieldset. The duplicate Column creation will silently succeed and both tasks will point to the same column, corrupting the extract grid.

Suggested fix: append a short uniqueness suffix when truncating:

if len(name) > BENCHMARK_COLUMN_NAME_MAX_LEN:
    suffix = f"…{task_id[-6:]}"
    name = name[:BENCHMARK_COLUMN_NAME_MAX_LEN - len(suffix)] + suffix

3. model_override default is duplicated / inconsistent with constants

doc_extract_query_task has a hardcoded inline default model=model_override or "openai:gpt-4o-mini" while constants/benchmarks.py defines BENCHMARK_DEFAULT_MODEL = "openai:gpt-4o-mini". These two must stay in sync manually. The task should use the constant directly:

from opencontractserver.constants.benchmarks import BENCHMARK_DEFAULT_MODEL
...
model=model_override or BENCHMARK_DEFAULT_MODEL,

Dead Code (CLAUDE.md: "No dead code")

4. use_eager_extraction parameter in runner.py raises NotImplementedError unconditionally on False

if not use_eager_extraction:
    raise NotImplementedError("Non-eager extraction not yet supported …")

This means the parameter can never be anything but True. Per CLAUDE.md, don't keep dead code. Either remove the parameter entirely (always force eager) or implement the non-eager path. A parameter that always raises is confusing to callers.

Test Coverage Gaps

5. probe_retrieval() has zero test coverage in the integration test

BenchmarkRunnerIntegrationTestCase mocks out AgentAPI.get_structured_response_from_document and verifies answer metrics, but the test does not assert anything about retrieval metrics (retrieval_recall_at_k, retrieval_precision_at_k, retrieval_char_iou) in the TaskResult. The _probe_retrieval_safely() path swallows errors, so a broken retrieval probe would silently produce zeros/Nones and the test would still pass.

Consider at minimum asserting that retrieval metrics are non-None (or are 0.0 with a known-zero corpus) in the integration test.

6. _wait_for_document_ready() timeout is untested

The 300-second polling loop has no test for the timeout path (TimeoutError case). If the document never becomes ready (e.g., parser crashes), this would hang indefinitely in production use. At minimum, add a unit test that covers timeout behavior with a mock that never returns a ready status.

Design / Minor Issues

7. gold_answer is the concatenation of raw snippet slices joined with \n

In legalbench_rag.py, the gold answer is built as "\n".join(answer_parts) where each part is the raw text slice. LegalBench-RAG tasks often have multiple gold snippets from different sections. An LLM that summarizes or paraphrases (rather than verbatim-quotes) the concatenation will score 0.0 exact match. This is fine and well-established in SQuAD methodology, but it means the benchmark's answer_exact_match scores will likely be near-zero for most models unless the model is tuned to reproduce verbatim text. Worth noting this explicitly in the docs caveats section so users aren't surprised.

8. BenchmarkReport computes aggregates in __post_init__ — unclear if the dataclass is frozen

If BenchmarkReport is not a frozen dataclass, task_results can be mutated after construction and the aggregates will be stale. If it is frozen, compute_aggregates() can't set attributes in __post_init__. Worth clarifying which invariant is intended (the description says it calls compute_aggregates() in __post_init__, which implies it's mutable).

9. Runner calls doc_extract_query_task.apply() directly, bypassing run_extract chord

This is mentioned in the PR but worth flagging: any pre/post-processing hooks wired to the normal run_extract task (signal handlers, result aggregation, status updates) will be skipped. If those hooks have side effects that are expected to run (e.g., updating Extract status to COMPLETE), the benchmark corpus will be left in an unexpected state after a run. Recommend at minimum verifying the Extract status is set correctly after the benchmark.

Positives Worth Calling Out

The adapter pattern is clean and the two-iterator interface (iter_documents / iter_tasks) is minimal and easy to implement for new benchmarks.
Path-traversal protection in _load_document() is appreciated.
Macro-averaging (equal task weight regardless of document count) matches LegalBench-RAG convention — good decision.
char_iou() correctly merges overlapping spans before computing intersection/union rather than double-counting.
The micro fixture makes CI feasible without downloading the full dataset.
_probe_retrieval_safely() correctly swallows retrieval errors to avoid aborting the entire run on a single bad probe.
Constants are properly isolated in opencontractserver/constants/benchmarks.py with an import guard against Django startup issues.

…us-generation-SWmBV

Set processing_started on test documents to prevent the document processing signal from triggering ingest_doc during test setUp. Under certain pytest-xdist worker conditions, the session-scoped signal disconnection fixture can be ineffective for TransactionTestCase classes, causing Document.objects.create() to fire the processing pipeline which fails on documents without actual files.

…ching - Prevent column name collisions when truncating long names by appending a task_id suffix instead of silently truncating identically for tasks with shared query prefixes. - Cache discovered subset names during _ensure_loaded() so describe() does not re-scan the filesystem after loading.

claude · 2026-04-14T03:05:39Z

PR Review: Benchmark Harness for External RAG Datasets

This is a well-structured addition with a clean adapter pattern and good separation of concerns. A few things worth addressing before merge.

PRE-EXISTING TEST MODIFICATIONS (policy concern)

File: opencontractserver/tests/test_structural_annotations_graphql_backwards_compat.py

Three Document.objects.create() calls had processing_started=timezone.now(), backend_lock=False added. Per CLAUDE.md, pre-existing tests should not be touched without documented justification. The in-code comment explains what was added but not why this PR required the change. Was there a signal interaction triggered by the new benchmarks app being registered, or is this unrelated pre-existing flakiness? Please clarify.

BUG: empty aggregates when task_results=[]

File: opencontractserver/benchmarks/report.py, BenchmarkReport.post_init

If a report is created with an empty task list, aggregates stays as {} because post_init checks "if self.task_results:" (empty list is falsy). The runner then raises KeyError on its own log line: int(report.aggregates["task_count"]).

Fix: always call compute_aggregates() — mean() already returns 0.0 for empty sequences, so the empty-list case is already handled in that method.

METRIC EDGE-CASE ASYMMETRY

File: opencontractserver/benchmarks/metrics.py

token_f1("", "") returns 1.0 but char_iou([], []) returns 0.0. A task with no gold content and no predicted content scores perfectly on answer metrics but zero on retrieval metrics. This asymmetry should at minimum be documented so results are not misinterpreted when tasks with empty gold spans exist in the dataset.

N+1 DB QUERIES IN _evaluate

File: opencontractserver/benchmarks/runner.py, _evaluate()

cell.refresh_from_db() is called per cell in a loop — one SELECT per cell. For larger benchmarks (CUAD has ~13k questions) this is thousands of individual queries. Prefer a single bulk fetch:

cell_ids = [c.id for c in loaded.datacells]
cells_by_id = {c.id: c for c in Datacell.objects.filter(id__in=cell_ids)}

OPAQUE **kwargs ON _probe_retrieval_safely

File: opencontractserver/benchmarks/runner.py

def _probe_retrieval_safely(**kwargs) -> RetrievalResult:
    return probe_retrieval(**kwargs)

A typo in a kwarg name will silently pass through and only surface inside probe_retrieval. Mirror the explicit signature of probe_retrieval instead.

_run_extraction BYPASSES Extract LIFECYCLE TRACKING

File: opencontractserver/benchmarks/runner.py, _run_extraction()

Calling doc_extract_query_task.apply(...) directly bypasses the Extract.started/Extract.finished timestamp tracking that run_extract normally handles. The Extract record will have both fields as None after a benchmark run, which could confuse the UI if someone navigates to the corpus post-benchmark. Consider setting those fields manually in the runner, or threading model_override through run_extract.

PRODUCTION LIMITATION: use_eager_extraction=False raises NotImplementedError

The guard is clear and the PR description calls it out. Noted here as a reminder to file a follow-up issue: benchmarks cannot currently run against a real Celery deployment (e.g. staging), so all benchmark results come from in-process eager mode which may not reflect real production latency or retry behaviour.

MINOR: management command has no test coverage

No tests exercise run_benchmark.py. A test calling call_command("run_benchmark", ...) with a missing user or invalid path and asserting the expected CommandError would cheaply cover those error-handling paths.

POSITIVE NOTES

Path traversal correctly blocked in _load_document via .resolve().relative_to().
No new Django models — reusing Corpus/Extract/Datacell is the right call.
write_report=False flag is a thoughtful test affordance that avoids filesystem side effects in CI.
The two-pointer _intersection_length is correct (inputs are always pre-merged by char_iou) and O(n).
The micro fixture is well-sized: small enough for fast CI, rich enough to catch real adapter bugs.
Pure unit tests for metrics (no DB, no Celery) are exactly the right design for those pure functions.

…us-generation-SWmBV # Conflicts: # opencontractserver/tests/test_structural_annotations_graphql_backwards_compat.py

…s safety - Fix BenchmarkReport.__post_init__ to always call compute_aggregates(), preventing KeyError when task_results is an empty list - Replace per-cell refresh_from_db() loop in _evaluate() with a single bulk Datacell.objects.filter() query to avoid N+1 SELECTs - Replace opaque **kwargs on _probe_retrieval_safely with explicit typed parameters matching probe_retrieval's signature - Document intentional metric asymmetry between token_f1 (1.0 on empty) and char_iou (0.0 on empty) per upstream SQuAD/LegalBench conventions

claude · 2026-04-14T05:44:02Z

Code Review — Benchmark Harness for External RAG Datasets

Overall this is a well-structured addition. The adapter/loader/runner separation is clean, the metrics are correctly implemented, and the end-to-end test using a micro fixture is a good approach for CI. The path-traversal guard in _load_document is a nice touch. Below are the issues I found, roughly ordered by severity.

Bugs / Correctness

1. normalize_answer type annotation lies

opencontractserver/benchmarks/metrics.py

def normalize_answer(text: str) -> str:   # ← says str
    if text is None:                       # ← handles None
        return ""

The annotation claims str but the function guards against None. Change the signature to text: str | None (or add assert text is not None and document the contract). As-is, mypy / pyright will flag callsites that correctly pass None.

2. Magic string not replaced by constant

opencontractserver/tasks/data_extract_tasks.py:290

model=model_override or "openai:gpt-4o-mini",

The PR introduces BENCHMARK_DEFAULT_MODEL = "openai:gpt-4o-mini" in opencontractserver/constants/benchmarks.py but the production fallback in doc_extract_query_task is still a bare string. Per the project's "No magic numbers" rule, this should reference a constant — either the existing BENCHMARK_DEFAULT_MODEL (if the intent is to keep them in sync) or a new DEFAULT_EXTRACT_MODEL constant in a more appropriate constants module.

Design / Architecture

3. use_eager_extraction=False is dead code

opencontractserver/benchmarks/runner.py:262–268

if not use_eager_extraction:
    raise NotImplementedError(
        "Non-eager extraction is not yet supported ..."
    )

The parameter exists in the signature but immediately raises. Every caller has to pass True or accept the exception. Either implement the async path, or remove the parameter entirely until it's needed (per CLAUDE.md's "No dead code" and "Don't design for hypothetical future requirements" rules).

4. No cleanup on failed runs

If load_benchmark_into_corpus or _run_extraction raises midway through, the partially-created Corpus, Extract, Fieldset, Document, and Datacell objects are left in the database with no rollback or teardown mechanism. At minimum the docstring for run_benchmark should call this out. A light mitigation would be to wrap the loader in a try/except that deletes the corpus on failure.

5. Subset validation is asymmetric with fixture use

LegalBenchRAGAdapter.__init__ validates requested subset names against LEGALBENCH_RAG_SUBSETS (the official four), so passing subsets=["micro"] raises ValueError even though the micro fixture uses that name. This is intentional, but it means the validation message is slightly confusing when users try to filter the micro fixture. A single clarifying note in the subsets docstring would help future contributors.

Minor / Nits

6. _intersection_length tie-break advances only j

opencontractserver/benchmarks/metrics.py:1821–1835

if a[i][1] < b[j][1]:
    i += 1
else:
    j += 1

When a[i][1] == b[j][1], only j advances. The algorithm is still correct (the next iteration correctly rejects any remaining overlap of the exhausted a[i] against b[j+1]) but it wastes one iteration per tie. Since spans are disjoint after merging, advancing both pointers on a tie would be more efficient:

if a[i][1] < b[j][1]:
    i += 1
elif b[j][1] < a[i][1]:
    j += 1
else:
    i += 1
    j += 1

7. BENCHMARK_REGISTRY is hard to extend from outside the command

run_benchmark.py defines BENCHMARK_REGISTRY as a module-level dict. The docstring says "Adding a new benchmark means writing an adapter and registering it here", which is fine for now, but it locks extensibility to that one file. Consider elevating it to opencontractserver/benchmarks/registry.py so third-party or downstream adapters can import and extend it without touching the management command.

8. Integration test doesn't verify the mock was called

test_run_benchmark_produces_report_with_perfect_answer_metrics asserts answer metrics are 1.0, but doesn't assert _fake was actually called (e.g. via Mock(side_effect=fake_fn) + assert_called()). If the mock wiring breaks silently (e.g. the target path changes), the test would still pass with zeroes everywhere.

Summary table

#	File	Severity	Category
1	`benchmarks/metrics.py`	Medium	Type-safety / correctness
2	`tasks/data_extract_tasks.py`	Low	Style (magic string)
3	`benchmarks/runner.py`	Low	Dead code
4	`benchmarks/runner.py`	Low	Robustness
5	`benchmarks/adapters/legalbench_rag.py`	Low	Docs/UX
6	`benchmarks/metrics.py`	Very low	Performance nit
7	`benchmarks/management/commands/run_benchmark.py`	Very low	Extensibility
8	`tests/test_benchmarks.py`	Very low	Test fidelity

Items 1–2 are the most worth fixing before merge; the rest are take-or-leave.

…_EXTRACT_MODEL constant - Fix normalize_answer(text: str) -> normalize_answer(text: str | None) to match actual behavior that guards against None input. - Extract DEFAULT_EXTRACT_MODEL = "openai:gpt-4o-mini" to opencontractserver/constants/extraction.py as the single source of truth. - Update doc_extract_query_task to use the constant instead of a bare string. - Update BENCHMARK_DEFAULT_MODEL to re-export the shared constant so benchmarks and production tasks cannot silently diverge.

claude · 2026-04-15T00:44:37Z

Code Review: Add Benchmark Harness for External RAG Datasets

Overall: This is a high-quality, well-architected PR. The separation of concerns is clean (adapters are Django-free, the loader owns DB mutations, the runner orchestrates, metrics are pure functions), the docstrings are thorough, and the micro-fixture approach for CI is smart. The changes to existing production code are minimal and backward-compatible. Comments below are mostly minor.

Correctness Issues

`report.config` is missing `"finished_at"` (`runner.py`)

report = BenchmarkReport(
    ...
    config=dict(config),   # snapshot taken HERE — no "finished_at" yet
    ...
)
config["finished_at"] = timezone.now().isoformat()  # mutates the local, not the snapshot

report.config (and therefore report.json) will not contain finished_at; only config.json will (because _write_run_config receives the mutable config dict after the key is added). Either set the key before snapshotting or pass the mutable dict to both places. Simplest fix:

config["finished_at"] = timezone.now().isoformat()
report = BenchmarkReport(..., config=dict(config), ...)

N+1 query risk in `_evaluate` (`runner.py:262`)

query = gold_entry.get("query", cell.column.query or "")

The cell.column access fires a SELECT for each cell without a prefetch. Since gold_by_datacell_id should always have the "query" key, the fallback should effectively never trigger — but if it ever does (e.g. in tests that don't pre-populate gold data), you get N+1 queries. Safer to add select_related("column") to the bulk-fetch:

refreshed = {c.id: c for c in Datacell.objects.filter(id__in=cell_ids).select_related("column")}

Design / Architecture Observations

`force_celery_eager` mutates global state — warn more prominently

The docstring is correct and the danger is real. In parallel test runs (e.g. pytest -n 4), two workers mutating current_app.conf simultaneously is a data race. Since CELERY_TASK_ALWAYS_EAGER=True is already set via @override_settings on BenchmarkRunnerIntegrationTestCase, use_eager_ingestion=True inside that test class causes force_celery_eager() to run redundantly. Consider adding an assertion or early-return when current_app.conf.task_always_eager is already True so the context manager becomes a no-op in that case, removing the race risk.

`NotImplementedError` for `use_eager_extraction=False` is a footgun

Exposing a parameter and immediately raising on one of its valid values leaks an incomplete abstraction to callers. Either remove the parameter entirely (hard-code eager) and add a TODO comment, or keep the flag but rename it to something that signals "future work" in its docstring. As written, calling run_benchmark(..., use_eager_extraction=False) will always fail at runtime with no static warning.

`BENCHMARK_COLUMN_NAME_MAX_LEN = 128` — verify against `Column.name` field

The constant is well-named, but if Column.name has a max_length smaller than 128 at the DB/model layer, truncation will still raise DataError. Worth a one-line comment pointing to the Column model field, or an assertion in _make_column_name guarding against that.

Minor Issues

`adapters/init.py` breaks the "Django-free" guarantee transitively

The docstring says adapters must stay Django-free, but the __init__.py eagerly imports LegalBenchRAGAdapter. If a future adapter pulls in a Django model and someone puts it in __init__.py, the module-level import will blow up at startup with AppRegistryNotReady. Given the benchmarks/__init__.py deliberately avoids re-exporting symbols for exactly this reason, doing the opposite in adapters/__init__.py is inconsistent. Consider removing the re-exports and requiring callers to import directly from the submodule (as the root package docstring already recommends).

`describe()` does disk I/O on an unloaded adapter

def describe(self) -> dict[str, Any]:
    subsets = (
        list(self._subset_names) if self._loaded
        else list(self._discover_subset_files().keys())   # filesystem scan
    )

describe() is called from the runner to populate config.json — which happens after load_benchmark_into_corpus, so _loaded will always be True at that point. The disk-scan branch is dead in production. If it's kept for standalone use, it should be documented; otherwise removing it simplifies the code.

Ingestion inside `force_celery_eager` not isolated per document

If one document's ingestion raises (e.g. import_content fails), the exception propagates out of the with ingestion_ctx: block and leaves partially-created Corpus/Document rows without a compensating cleanup. For a CLI benchmark tool this is acceptable, but adding a try/except with a logged warning (not a hard abort) would match the defensive pattern used elsewhere in the loader.

Test Coverage

The test suite is solid. A few gaps worth noting for future PRs (not blockers):

Management command is not exercised at all. A simple call_command("run_benchmark", ...) test with the micro fixture would catch CLI argument wiring bugs.
probe_retrieval is only implicitly tested via run_benchmark. A direct unit test checking what happens with empty/malformed annotation JSON would be valuable given the reliance on TxtParser's exact payload shape.
Report serialization round-trip: test_run_benchmark_writes_report_files_when_requested checks file existence and top-level keys but doesn't re-parse and validate span format (list-of-lists vs. list-of-tuples is the common footgun here after dataclasses.asdict).

Nits

_INGEST_POLL_INTERVAL_S = 0.2 and _DEFAULT_INGEST_TIMEOUT_S = 300 are module-level constants (good) but live in loader.py rather than constants/benchmarks.py. If callers ever need to configure them, they'd need to import from an unexpected place.
BenchmarkReport.__post_init__ calls compute_aggregates() unconditionally, which means aggregates are computed at construction time. The comment # compute_aggregates() is auto-called by __post_init__ when task_results is non-empty in runner.py is misleading — it's called regardless of whether task_results is empty.
The _rmtree / _make_tmp_run_dir helpers in BenchmarkRunnerIntegrationTestCase duplicate tempfile.mkdtemp + shutil.rmtree logic that could use tempfile.TemporaryDirectory as a context manager instead (cleaner with addCleanup).

Existing Code Change

The one-line change to doc_extract_query_task is clean and backward-compatible:

- model="openai:gpt-4o-mini",  # Fast and reliable
+ model=model_override or DEFAULT_EXTRACT_MODEL,

DEFAULT_EXTRACT_MODEL = "openai:gpt-4o-mini" so no production behavior changes when model_override is None. The model_override parameter is correctly typed as str | None = None. ✓

Summary: The "finished_at" snapshot bug and the N+1 column access in _evaluate are the two issues I'd fix before merging. Everything else is either a design note for follow-up or a nit.

claude · 2026-04-15T01:16:05Z

Code Review

Overview

This PR introduces a well-structured benchmark harness for evaluating OpenContracts' RAG pipeline against external datasets, starting with LegalBench-RAG. The architecture is clean, the separation of concerns is solid, and the documentation is thorough. Below are specific observations.

Strengths

Adapter pattern is well-designed. The BaseBenchmarkAdapter ABC with iter_documents() / iter_tasks() / describe() is minimal and will be easy to extend.
Django-free adapters. Keeping adapters free of Django dependencies is the right call to avoid AppRegistryNotReady issues and simplify unit testing.
Industry-standard metrics. SQuAD-style exact match / token F1 and character-span IoU are exactly what you want here. The normalize_answer() implementation matches the reference SQuAD script closely.
Frozen dataclasses. BenchmarkDocument, BenchmarkTask, TaskResult being frozen prevents accidental mutations in long pipeline runs.
Backward-compatible model_override. Adding model_override: str | None = None to doc_extract_query_task is clean — existing callers are unaffected.
Micro fixture. Synthetic fixture means the test suite can run in CI without downloading the full LegalBench-RAG dataset.

Issues & Concerns

Functional / Correctness

1. Asymmetric empty-span handling between char_iou and token_f1

token_f1 returns 1.0 when both prediction and gold are empty; char_iou returns 0.0 for the same case. This asymmetry will silently produce different aggregate metrics depending on which function gets called on an "empty" task. Either align both to return 1.0 (vacuously correct) or 0.0 (conservatively penalizing), and document the choice explicitly in the docstring.

2. force_celery_eager() context manager is not thread-safe

# loader.py
@contextmanager
def force_celery_eager():
    old = current_app.conf.task_always_eager
    current_app.conf.task_always_eager = True
    try:
        yield
    finally:
        current_app.conf.task_always_eager = old

Mutating current_app.conf globally will affect any concurrent Celery task submission on the same worker process. If this is ever run in a shared environment or parallel test run, it could cause non-benchmark tasks to execute eagerly. Consider documenting this limitation prominently, or using Celery's override_settings pattern instead.

3. Column name collision is silently lossy

# loader.py – collision avoidance by truncating + counter suffix

When two task queries truncate to the same 128-char prefix, a counter suffix is appended — but this means the column name no longer corresponds to the query text in any readable way. Users reading report.csv won't be able to tell which column maps to which task without consulting gold.json. Consider storing the full query in TaskResult metadata to preserve traceability.

4. use_eager_extraction=False raises NotImplementedError silently in runner

The runner path for async extraction raises NotImplementedError, which will surface as an unhandled exception in a management command invocation rather than a clean CLI error message. The management command should catch this and print a user-friendly message.

5. Unicode / character offset alignment

The loader reads document text as Python str but character spans are computed against the raw file bytes in some paths. If a document contains multi-byte UTF-8 characters, the Python str character offsets (Unicode code points) and the byte offsets produced by the LegalBench-RAG span annotation tooling may diverge. This should be explicitly validated or documented as a known limitation.

Performance

6. Polling loop in loader is unbounded

# loader.py – polling for doc processing
while not doc.processed:
    time.sleep(POLL_INTERVAL)
    doc.refresh_from_db()

There is no timeout. If document processing fails silently, this will spin forever. Add a max_retries / deadline guard.

7. _evaluate() bulk-fetches datacells but still issues per-task retrieval queries

The N+1 on ORM reads is addressed, but probe_retrieval() is called once per task inside a loop, each issuing a separate vector store query. For large benchmarks (1000+ tasks) this will be very slow. Consider batching or parallelizing retrieval probes.

Security

8. Path traversal protection is incomplete

# adapters/legalbench_rag.py
resolved = (base_dir / file_path).resolve()
if not str(resolved).startswith(str(base_dir.resolve())):
    ...

The string prefix check is fragile — a path like /tmp/benchmarks_extra/… would pass the check if base_dir is /tmp/benchmarks. Use resolved.is_relative_to(base_dir.resolve()) (Python 3.9+) which is semantically correct.

9. DEFAULT_EXTRACT_MODEL is hardcoded to "openai:gpt-4o-mini"

# opencontractserver/constants/extraction.py
DEFAULT_EXTRACT_MODEL = "openai:gpt-4o-mini"

This is a new constant that affects the production extraction pipeline, not just benchmarks. Previously, the model was likely configured elsewhere. Verify this doesn't silently change the default model for all production users who haven't explicitly configured a model override.

Code Quality / Style

10. from opencontractserver.constants.benchmarks import * in __init__.py

Wildcard re-exports in constants/__init__.py make it impossible to statically determine what names are available without reading every constants file. Prefer explicit imports: from opencontractserver.constants.benchmarks import BENCHMARK_DEFAULT_MODEL, BENCHMARK_DEFAULT_TOP_K, BENCHMARK_COLUMN_NAME_MAX_LEN.

11. BENCHMARK_REGISTRY is a module-level dict in the management command

BENCHMARK_REGISTRY = {"legalbench_rag": LegalBenchRAGAdapter}

This is fine for now but will diverge from the adapter pattern as new benchmarks are added. Consider moving this to a central registry in benchmarks/__init__.py or benchmarks/adapters/__init__.py so the management command, any future API endpoint, and tests can all share the same source of truth.

12. describe() return type is unspecified

BaseBenchmarkAdapter.describe() has no return type annotation and returns a free-form string. If you later want to surface benchmark metadata in a UI or JSON output, you'll need a structured type. A BenchmarkInfo dataclass now would avoid a future breaking change.

Test Coverage

Pure metric tests are thorough. Good.
Adapter tests cover the happy path and subset/limit filtering.
The integration test mocks the LLM correctly and validates report output.
Missing: A test that exercises the model_override kwarg with a real (mocked) task dispatch — the current integration test uses perfect gold answers but doesn't verify the override actually reaches doc_extract_query_task.
Missing: A test for the column name collision/truncation logic.
Missing: A test for the polling timeout guard (once added).

Minor / Nits

mean() in metrics.py should handle the empty-list case explicitly rather than letting statistics.mean raise StatisticsError.
The run_benchmark management command prints a formatted table but doesn't return a non-zero exit code on partial failures (some tasks errored). Consider self.stderr.write(...) + raise CommandError when report.error_count > 0.
Fixture JSON file paths use forward slashes on all platforms — good. Confirm the adapter's path normalization also handles Windows paths if portability matters.
The CHANGELOG.md entry is detailed but the "Technical Details" section duplicates some of the "Added" content.

Summary

The architecture is sound and the code is well-written. The three issues I'd flag as blocking before merge are:

DEFAULT_EXTRACT_MODEL potentially changing production behavior — verify this is intentional and won't break existing deployments.
Unbounded polling loop — add a timeout to prevent infinite hangs in CI.
Path traversal string prefix check — replace with is_relative_to().

Everything else is non-blocking but worth addressing before this ships to production users.

…us-generation-SWmBV # Conflicts: # CHANGELOG.md

claude · 2026-04-16T01:39:16Z

Code Review: Benchmark Harness for External RAG Datasets

Overview

This is a well-structured addition: a clean adapter pattern, independent extraction/retrieval evaluation axes, SQuAD-compatible metrics, and a full test suite with a micro fixture. The separation of concerns is thoughtful, and the security-conscious path traversal check in the adapter is a good touch. A few issues need attention before merging.

Critical: CHANGELOG Merge Conflict Resolution Drops Two Existing Entries

File: CHANGELOG.md

The diff resolves a pre-existing conflict by removing both sides and inserting only the benchmark entry. This silently drops:

The Frontend E2E integration tests entry (from the left side of the conflict)
The Bounded fullDatacellList payload for extract grid embeds entry (from the right side)

Both entries need to be preserved alongside the new benchmark bullet. The current state of the CHANGELOG will be missing those two features entirely.

Bug: Misleading Comment About `__post_init__`

File: opencontractserver/benchmarks/runner.py — near the BenchmarkReport instantiation

The comment says compute_aggregates() is auto-called when task_results is non-empty, but it is called unconditionally in __post_init__ — including when task_results == []. Remove the "when task_results is non-empty" clause.

N+1 Query: Missing `select_related` in `_evaluate`

File: opencontractserver/benchmarks/runner.py

The bulk-fetch correctly avoids N+1 on the Datacell rows, but the following line inside the loop:

query = gold_entry.get("query", cell.column.query or "")

triggers one extra lazy SELECT per cell to load the related Column. Add .select_related("column") to the queryset.

Magic Numbers in `_make_column_name`

File: opencontractserver/benchmarks/loader.py

The values 64 and 61 used for query-preview truncation are undocumented inline numbers. Per the project convention, they should be named constants in opencontractserver/constants/benchmarks.py.

`use_eager_extraction=False` Is Accepted But Always Raises

File: opencontractserver/benchmarks/runner.py

The parameter is in the public signature but immediately raises NotImplementedError when set to False. This looks like a supported option that silently doesn't work. Either remove it until the feature is ready, or make the limitation obvious in the signature docstring.

Minor: Inconsistent Ellipsis Characters in `_make_column_name`

File: opencontractserver/benchmarks/loader.py

The query-truncation uses ASCII "..." (3 chars) while the column-name truncation uses Unicode "…" (1 char, U+2026). Pick one and be consistent within the same function.

Missing Type Annotations on `user` Parameters

Several public functions (load_benchmark_into_corpus, run_benchmark, _ingest_benchmark_document) take a user argument with no type annotation while the rest of the signature is fully typed. Using AbstractBaseUser or a TYPE_CHECKING guard would improve IDE support and make the API easier to use correctly.

Test Coverage Gap: Retrieval Probe Not Directly Tested

opencontractserver/benchmarks/retrieval.py is only exercised indirectly through the integration test. The _annotation_char_span helper has non-trivial branching (None payload, non-dict, end < start, missing keys). Targeted unit tests would be cheap to add and guard against regressions.

Positive Notes

Path traversal protection in _load_document (resolve() then relative_to()) is correct and a good pattern.
Bulk-fetching datacells before the evaluation loop shows good instinct — just needs the select_related fix above.
The micro fixture is excellent: no external data download for CI and it doubles as a living schema example.
force_celery_eager() is properly scoped as a context manager with a clear safety warning in the docstring.
@override_settings in the test class is the right approach for isolation rather than relying on the runtime context manager.
_intersection_length uses the correct two-pointer O(n+m) algorithm on sorted merged intervals.
describe() returning serializable adapter configuration into config.json is great for run reproducibility.

JSv4 added 2 commits April 12, 2026 22:06

github-code-quality bot found potential problems Apr 13, 2026

View reviewed changes

Comment thread opencontractserver/constants/__init__.py

from opencontractserver.constants.annotations import * # noqa: F401, F403

from opencontractserver.constants.auth import * # noqa: F401, F403

from opencontractserver.constants.benchmarks import * # noqa: F401, F403

JSv4 added 2 commits April 12, 2026 23:39

Merge remote-tracking branch 'origin/main' into pr-1239-fix

9e6c4a0

github-actions bot and others added 3 commits April 13, 2026 06:16

Update auto-generated documentation screenshots

0471f0b

Merge remote-tracking branch 'origin/main' into claude/legal-rag-corp…

c25211e

…us-generation-SWmBV

JSv4 mentioned this pull request Apr 14, 2026

Fix flaky test_ask_document_malformed_result under pytest-xdist #1249

Merged

2 tasks

JSv4 added 3 commits April 13, 2026 20:49

Merge remote-tracking branch 'origin/main' into claude/legal-rag-corp…

3f98653

…us-generation-SWmBV

JSv4 mentioned this pull request Apr 14, 2026

Fix flaky tests: prevent auto-ingest signal for Documents without pdf_file #1252

Merged

2 tasks

JSv4 and others added 2 commits April 14, 2026 00:37

Merge remote-tracking branch 'origin/main' into claude/legal-rag-corp…

9546cb0

…us-generation-SWmBV # Conflicts: # opencontractserver/tests/test_structural_annotations_graphql_backwards_compat.py

github-actions bot and others added 3 commits April 14, 2026 06:28

Update auto-generated documentation screenshots

a70547f

Merge remote-tracking branch 'origin/main' into worktree-agent-a7daf7fb

7b9650c

Fix codecov config: rename file to files param, ignore test/config paths

b1f1cc2

Update auto-generated documentation screenshots

9a62238

Merge remote-tracking branch 'origin/main' into claude/legal-rag-corp…

1044e1e

…us-generation-SWmBV # Conflicts: # CHANGELOG.md

Update auto-generated documentation screenshots

a820ef3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmark harness for evaluating OpenContracts against external RAG datasets#1239

Add benchmark harness for evaluating OpenContracts against external RAG datasets#1239
JSv4 wants to merge 20 commits intomainfrom
claude/legal-rag-corpus-generation-SWmBV

JSv4 commented Apr 13, 2026

Uh oh!

claude bot commented Apr 13, 2026

Uh oh!

claude bot commented Apr 13, 2026

Uh oh!

claude bot commented Apr 13, 2026

Uh oh!

codecov bot commented Apr 13, 2026 •

edited

Loading

Uh oh!

claude bot commented Apr 13, 2026

Uh oh!

claude bot commented Apr 14, 2026

Uh oh!

claude bot commented Apr 14, 2026

Uh oh!

claude bot commented Apr 15, 2026

Uh oh!

claude bot commented Apr 15, 2026

Uh oh!

claude bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JSv4 commented Apr 13, 2026

Summary

Key Changes

Implementation Details

Uh oh!

claude bot commented Apr 13, 2026

Code Review

Critical

Moderate

Minor

Nit

Uh oh!

claude bot commented Apr 13, 2026

Code Review: Benchmark Harness for External RAG Datasets

Issues

Positives

Uh oh!

claude bot commented Apr 13, 2026

Code Review: Benchmark Harness for External RAG Datasets

Bugs

Design

Minor

Positives

Uh oh!

codecov bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude bot commented Apr 13, 2026

Code Review — PR #1239: Benchmark Harness for External RAG Datasets

Bugs / Correctness

Dead Code (CLAUDE.md: "No dead code")

Test Coverage Gaps

Design / Minor Issues

Positives Worth Calling Out

Uh oh!

claude bot commented Apr 14, 2026

Uh oh!

claude bot commented Apr 14, 2026

Code Review — Benchmark Harness for External RAG Datasets

Bugs / Correctness

Design / Architecture

Minor / Nits

Summary table

Uh oh!

claude bot commented Apr 15, 2026

Code Review: Add Benchmark Harness for External RAG Datasets

Correctness Issues

report.config is missing "finished_at" (runner.py)

N+1 query risk in _evaluate (runner.py:262)

Design / Architecture Observations

force_celery_eager mutates global state — warn more prominently

NotImplementedError for use_eager_extraction=False is a footgun

BENCHMARK_COLUMN_NAME_MAX_LEN = 128 — verify against Column.name field

Minor Issues

adapters/__init__.py breaks the "Django-free" guarantee transitively

describe() does disk I/O on an unloaded adapter

Ingestion inside force_celery_eager not isolated per document

Test Coverage

Nits

Existing Code Change

Uh oh!

claude bot commented Apr 15, 2026

Code Review

Overview

Strengths

Issues & Concerns

Functional / Correctness

Performance

Security

Code Quality / Style

Test Coverage

Minor / Nits

Summary

Uh oh!

claude bot commented Apr 16, 2026

Code Review: Benchmark Harness for External RAG Datasets

Overview

codecov bot commented Apr 13, 2026 •

edited

Loading

`report.config` is missing `"finished_at"` (`runner.py`)

N+1 query risk in `_evaluate` (`runner.py:262`)

`force_celery_eager` mutates global state — warn more prominently

`NotImplementedError` for `use_eager_extraction=False` is a footgun

`BENCHMARK_COLUMN_NAME_MAX_LEN = 128` — verify against `Column.name` field

`adapters/init.py` breaks the "Django-free" guarantee transitively

`describe()` does disk I/O on an unloaded adapter

Ingestion inside `force_celery_eager` not isolated per document

Bug: Misleading Comment About `__post_init__`

N+1 Query: Missing `select_related` in `_evaluate`

Magic Numbers in `_make_column_name`

`use_eager_extraction=False` Is Accepted But Always Raises

Minor: Inconsistent Ellipsis Characters in `_make_column_name`

Missing Type Annotations on `user` Parameters