docs: add first nightly research report (2026-03-06)

sjarmak · sjarmak · commit 300163fc7720 · 2026-03-06T23:06:42.000-05:00
Covers task registry metadata drift, missing CI test gate,
duplicated constants across 8+ scripts, and testing gaps.

Recommends task registry reconciliation + CI test workflow
as highest-impact next feature.

Also adds .gitignore exception for reports/nightly/.
diff --git a/.gitignore b/.gitignore
@@ -57,6 +57,8 @@ scripts/plot_csb_mcp_blog_figures.py
 ralph/
 ralph-*/
 reports/
+!reports/nightly/
+!reports/nightly/**
 eval_reports/
 tmp/
 *.log
diff --git a/reports/nightly/2026-03-06-review.md b/reports/nightly/2026-03-06-review.md
@@ -0,0 +1,243 @@
+# Nightly Research Report - 2026-03-06
+
+First nightly review of the CodeContextBench repository. Covers code quality,
+configuration integrity, testing infrastructure, and architectural debt.
+
+---
+
+## 1. Code & Architecture Review
+
+### 1.1 CRITICAL: Task Registry Metadata Drift
+
+`configs/selected_benchmark_tasks.json` has multiple internal inconsistencies:
+
+| Issue | Details |
+|-------|---------|
+| **total_selected vs target_total** | `total_selected: 404` but `target_total: 370`. The file contains 404 task entries but only 20 have `status: "active"`; the remaining 384 have no status field at all. |
+| **per_suite counts wrong** | 14 of 20 suites have `per_suite` metadata that doesn't match the actual number of tasks in the file. Example: `csb_sdlc_fix` says 26 but file has 34 entries. |
+| **Missing suite in mcp_unique_suites** | `csb_org_crossrepo` (14 tasks) appears in `per_suite` but is absent from the `mcp_unique_suites` array, which claims 11 suites but lists only 10. |
+
+**Impact**: Any script relying on metadata counts for DOE power calculations, progress
+reporting, or suite-level statistics will produce incorrect results. The `sync_task_metadata.py`
+script exists but apparently hasn't been run recently enough to catch this drift.
+
+### 1.2 Repo Health Gate Failing
+
+`python3 scripts/repo_health.py --quick` currently fails:
+
+```
+docs_consistency: FAILED
+  - missing_ref_all_docs:docs/DAYTONA.md:scripts/daytona_snapshot_cleanup.py
+```
+
+`docs/DAYTONA.md` line 270 references `scripts/daytona_snapshot_cleanup.py`, which does not
+exist (it's listed in `.gitignore` but was never committed). This blocks CI on push to main.
+
+### 1.3 Failing Test
+
+`tests/test_abc_audit.py::TestR2NoContamination::test_fail_contaminated` fails (206 pass, 1 fail).
+
+**Root cause**: The contamination regex in `scripts/abc_audit.py` (lines 569-574) was narrowed
+to require a suffix after "sourcegraph" (e.g., "sourcegraph mcp", "sourcegraph tools"), but the
+test fixture at `tests/test_abc_audit.py` line 81 still uses bare "sourcegraph" which no longer
+matches. Either the regex or the test fixture needs updating.
+
+### 1.4 Duplicated Constants Across 8+ Scripts
+
+Two critical constants are copy-pasted with **inconsistent values** across scripts:
+
+**SKIP_PATTERNS** (6+ variations found):
+
+| Script | Missing from its list |
+|--------|-----------------------|
+| `aggregate_status.py` | `__archived` |
+| `generate_manifest.py` | `__archived`, `__aborted` |
+| `audit_traces.py` | `__archived`, `__aborted`, `__v1_hinted` |
+| `normalize_retrieval_events.py` | (superset - has everything) |
+
+**DIR_PREFIX_TO_SUITE** mapping (~70 lines): Identical block duplicated in `aggregate_status.py`,
+`generate_manifest.py`, `audit_traces.py`, `cost_breakdown_analysis.py`, `ds_audit.py`,
+`fix_h3_tokens.py`, `ir_analysis.py`, `mcp_audit.py`. Any new suite prefix must be added to
+all 8 files.
+
+**Recommendation**: Extract both into `scripts/config_utils.py` (which already exists but
+doesn't contain these).
+
+### 1.5 No Shared Utility Library
+
+166 Python scripts with no shared library for common operations:
+- JSON file loading with error handling (5+ local `load_json()` definitions)
+- Run directory discovery and filtering
+- Suite name resolution from legacy prefixes
+- Logging setup
+
+The `scripts/csb_metrics/` package handles metrics extraction well, but operational scripts
+outside it re-implement the same patterns independently.
+
+### 1.6 Script Size Violations
+
+Several scripts significantly exceed the 800-line guideline:
+
+| Script | Lines | Purpose |
+|--------|------:|---------|
+| `context_retrieval_agent.py` | 3,299 | IR retrieval agent |
+| `export_official_results.py` | 2,866 | Results export + HTML browser |
+| `ir_analysis.py` | 2,152 | IR metrics analysis |
+| `daytona_curator_runner.py` | 1,628 | Oracle curation runner |
+| `comprehensive_analysis.py` | 1,596 | Full analysis pipeline |
+| `validate_on_contextbench.py` | 1,435 | Submission validation |
+| `audit_official_scores.py` | 1,365 | Score auditing |
+
+### 1.7 Stale Code and References
+
+| Item | Location | Status |
+|------|----------|--------|
+| `validators.py` duplication gotcha | `CLAUDE.md` line 84 | **Stale** - no validators.py files exist in repo |
+| `ccb_build` benchmark reference | `scripts/daytona_poc_runner.py` line 36 | **Broken** - benchmark was renamed/removed |
+| Wave integration scripts | `scripts/integrate_answer_json_wave{1,2,3}.py` | ~90% code overlap, should be parameterized |
+| 3 overlapping Daytona runners | `daytona_poc_runner.py`, `daytona_runner.py`, `daytona_curator_runner.py` | High duplication in SDK init, error handling, result parsing |
+
+### 1.8 Unused JSON Schemas
+
+9 of 13 schemas in `schemas/` are never referenced by any script:
+
+- `economic_metrics_schema.json`, `failure_analysis_schema.json`, `governance_report_schema.json`
+- `icp_profile_schema.json`, `mcp_task_spec.schema.json`, `reliability_metrics_schema.json`
+- `run_triage.schema.json`, `use_case_registry.schema.json`, `workflow_metrics_schema.json`
+
+Only 4 scripts use `jsonschema` validation at all: `generate_enterprise_report.py`,
+`validate_submission.py`, `package_submission.py`, `ingest_judge_results.py`.
+
+---
+
+## 2. Feature & UX Improvements
+
+### 2.1 Testing Infrastructure Is Nearly Absent
+
+| Metric | Current | Target |
+|--------|---------|--------|
+| Scripts with tests | 7 of 166 (4.2%) | Critical-path scripts at minimum |
+| Test configuration | None (no pytest.ini, pyproject.toml, or .coveragerc) | pyproject.toml with pytest + coverage |
+| CI test workflow | None | GitHub Actions workflow running pytest |
+| Coverage reporting | None | pytest-cov with threshold gate |
+
+**Critical untested scripts**: `repo_health.py` (master commit gate), `daytona_runner.py`
+(task execution engine), `validate_task_run.py` (post-run validation), `aggregate_status.py`
+(results processing).
+
+### 2.2 Missing MCP Instructions in Org Tasks
+
+`instruction_mcp.md` is missing from significant portions of Org suites:
+
+| Suite | Missing | Total | Gap |
+|-------|--------:|------:|----:|
+| `csb_org_onboarding` | 11 | 28 | 39% |
+| `csb_org_security` | 12 | 26 | 46% |
+| `csb_org_migration` | 9 | 28 | 32% |
+| `csb_org_compliance` | 8 | 20 | 40% |
+| `csb_org_platform` | 6 | 20 | 30% |
+
+All SDLC suites have 100% coverage. The gap means MCP-enabled runs on these Org tasks
+may not get task-specific MCP guidance, potentially affecting MCP vs baseline comparisons.
+
+### 2.3 Inconsistent CLI Patterns
+
+~16 scripts use raw `sys.argv` instead of `argparse`. These produce no help text and
+fail silently with wrong arguments. Standardizing on argparse would improve operability.
+
+### 2.4 Dashboard Removed
+
+The `dashboard/` directory referenced in MEMORY.md (`dashboard/app.py`, Streamlit) no longer
+exists. The `data/codecontextbench.db` reference is also stale. The only interactive
+visualization is the `export_official_results.py --serve` HTML browser.
+
+### 2.5 Four Linux Debug Tasks Have Invalid GT Schemas
+
+These tasks in `csb_sdlc_debug` use `{buggy_files[], buggy_functions[]}` instead of the
+standard GT format, making them incompatible with the checklist scorer:
+
+- `linux-acpi-backlight-fault-001`
+- `linux-hda-intel-suspend-fault-001`
+- `linux-iwlwifi-subdevice-fault-001`
+- `linux-nfs-inode-revalidate-fault-001`
+
+---
+
+## 3. Research Recommendations
+
+### 3.1 Add `pyproject.toml` with Test Infrastructure
+
+A single `pyproject.toml` would provide:
+- pytest configuration (test paths, markers, fixture discovery)
+- Coverage settings (minimum threshold, branch coverage)
+- Package metadata for `scripts/csb_metrics/`
+- Dependency pinning for development tools
+
+### 3.2 CI Test Workflow
+
+Add `.github/workflows/test.yml` to run `pytest tests/ --cov` on push/PR. The existing
+`repo_health.yml` and `docs-consistency.yml` handle config validation but nothing runs
+the 207 existing tests. This means regressions (like the current test_abc_audit failure)
+go undetected.
+
+### 3.3 Consolidate Shared Constants into `config_utils.py`
+
+`scripts/config_utils.py` already exists but doesn't contain the most-duplicated patterns.
+Adding `SKIP_PATTERNS`, `DIR_PREFIX_TO_SUITE`, and common JSON/path utilities here would
+eliminate 8+ duplication sites and make suite renames a single-file change.
+
+### 3.4 Validate Schemas at Runtime
+
+The 13 JSON schemas in `schemas/` represent a significant investment. Using them for runtime
+validation in `generate_manifest.py`, `aggregate_status.py`, and task scaffolding scripts
+would catch data format regressions early. Consider `jsonschema` validation in the
+`csb_metrics` package.
+
+### 3.5 Script Consolidation Candidates
+
+| Current | Proposed |
+|---------|----------|
+| `integrate_answer_json_wave{1,2,3}.py` (3 files, ~1150 lines) | Single parameterized `integrate_answer_json.py --wave N` |
+| `daytona_poc_runner.py` + `daytona_runner.py` | Merge PoC into main runner with `--quick` flag |
+| `scaffold_feature_tasks.py` + `scaffold_refactor_tasks.py` (2039 lines combined) | Extract shared scaffolding base |
+
+---
+
+## 4. Recommended Next Feature
+
+### Task Registry Reconciliation and CI Test Gate
+
+**The single most impactful improvement** is fixing the data integrity layer and adding
+the missing CI test gate. Specifically:
+
+**Part A - Fix `selected_benchmark_tasks.json` metadata** (1-2 hours)
+1. Run `python3 scripts/sync_task_metadata.py --fix` to reconcile task entries
+2. Update `total_selected` to match actual count
+3. Fix `per_suite` counts to match actual task entries per suite
+4. Add `csb_org_crossrepo` to `mcp_unique_suites` array
+5. Decide on status field: either mark all 404 tasks with explicit status, or remove
+   the 34 extra tasks that exceed the 370 target
+
+**Part B - Add CI test workflow** (30 minutes)
+1. Create `pyproject.toml` with `[tool.pytest.ini_options]` and `[tool.coverage]` sections
+2. Fix the failing `test_fail_contaminated` test
+3. Add `.github/workflows/test.yml` that runs `pytest tests/` on push/PR
+4. Fix the `docs/DAYTONA.md` broken reference to unblock repo_health
+
+**Part C - Extract shared constants** (1-2 hours)
+1. Add `SKIP_PATTERNS` and `DIR_PREFIX_TO_SUITE` to `scripts/config_utils.py`
+2. Update the 8+ scripts that duplicate these constants to import from config_utils
+3. Add a repo_health check that greps for duplicate definitions
+
+**Why this is highest impact**: Every other improvement (new analysis scripts, better
+reporting, additional benchmarks) depends on the task registry being accurate and changes
+being caught by CI. Currently, metadata drift accumulates silently and test regressions
+go undetected. Fixing this foundation makes all subsequent work more reliable.
+
+**PRD-ready description**: "Implement a task registry integrity system consisting of: (1) a
+reconciliation pass that makes `selected_benchmark_tasks.json` metadata exactly match its
+contents and the filesystem, (2) a `pyproject.toml` with pytest/coverage configuration,
+(3) a GitHub Actions test workflow that runs all 207 tests on push/PR, (4) extraction of
+duplicated SKIP_PATTERNS and DIR_PREFIX_TO_SUITE constants into `config_utils.py` with
+imports replacing all 8+ duplication sites. Success criteria: repo_health passes, all tests
+pass in CI, and `sync_task_metadata.py --check` validates registry consistency."