|
| 1 | +# Nightly Research Report - 2026-03-06 |
| 2 | + |
| 3 | +First nightly review of the CodeContextBench repository. Covers code quality, |
| 4 | +configuration integrity, testing infrastructure, and architectural debt. |
| 5 | + |
| 6 | +--- |
| 7 | + |
| 8 | +## 1. Code & Architecture Review |
| 9 | + |
| 10 | +### 1.1 CRITICAL: Task Registry Metadata Drift |
| 11 | + |
| 12 | +`configs/selected_benchmark_tasks.json` has multiple internal inconsistencies: |
| 13 | + |
| 14 | +| Issue | Details | |
| 15 | +|-------|---------| |
| 16 | +| **total_selected vs target_total** | `total_selected: 404` but `target_total: 370`. The file contains 404 task entries but only 20 have `status: "active"`; the remaining 384 have no status field at all. | |
| 17 | +| **per_suite counts wrong** | 14 of 20 suites have `per_suite` metadata that doesn't match the actual number of tasks in the file. Example: `csb_sdlc_fix` says 26 but file has 34 entries. | |
| 18 | +| **Missing suite in mcp_unique_suites** | `csb_org_crossrepo` (14 tasks) appears in `per_suite` but is absent from the `mcp_unique_suites` array, which claims 11 suites but lists only 10. | |
| 19 | + |
| 20 | +**Impact**: Any script relying on metadata counts for DOE power calculations, progress |
| 21 | +reporting, or suite-level statistics will produce incorrect results. The `sync_task_metadata.py` |
| 22 | +script exists but apparently hasn't been run recently enough to catch this drift. |
| 23 | + |
| 24 | +### 1.2 Repo Health Gate Failing |
| 25 | + |
| 26 | +`python3 scripts/repo_health.py --quick` currently fails: |
| 27 | + |
| 28 | +``` |
| 29 | +docs_consistency: FAILED |
| 30 | + - missing_ref_all_docs:docs/DAYTONA.md:scripts/daytona_snapshot_cleanup.py |
| 31 | +``` |
| 32 | + |
| 33 | +`docs/DAYTONA.md` line 270 references `scripts/daytona_snapshot_cleanup.py`, which does not |
| 34 | +exist (it's listed in `.gitignore` but was never committed). This blocks CI on push to main. |
| 35 | + |
| 36 | +### 1.3 Failing Test |
| 37 | + |
| 38 | +`tests/test_abc_audit.py::TestR2NoContamination::test_fail_contaminated` fails (206 pass, 1 fail). |
| 39 | + |
| 40 | +**Root cause**: The contamination regex in `scripts/abc_audit.py` (lines 569-574) was narrowed |
| 41 | +to require a suffix after "sourcegraph" (e.g., "sourcegraph mcp", "sourcegraph tools"), but the |
| 42 | +test fixture at `tests/test_abc_audit.py` line 81 still uses bare "sourcegraph" which no longer |
| 43 | +matches. Either the regex or the test fixture needs updating. |
| 44 | + |
| 45 | +### 1.4 Duplicated Constants Across 8+ Scripts |
| 46 | + |
| 47 | +Two critical constants are copy-pasted with **inconsistent values** across scripts: |
| 48 | + |
| 49 | +**SKIP_PATTERNS** (6+ variations found): |
| 50 | + |
| 51 | +| Script | Missing from its list | |
| 52 | +|--------|-----------------------| |
| 53 | +| `aggregate_status.py` | `__archived` | |
| 54 | +| `generate_manifest.py` | `__archived`, `__aborted` | |
| 55 | +| `audit_traces.py` | `__archived`, `__aborted`, `__v1_hinted` | |
| 56 | +| `normalize_retrieval_events.py` | (superset - has everything) | |
| 57 | + |
| 58 | +**DIR_PREFIX_TO_SUITE** mapping (~70 lines): Identical block duplicated in `aggregate_status.py`, |
| 59 | +`generate_manifest.py`, `audit_traces.py`, `cost_breakdown_analysis.py`, `ds_audit.py`, |
| 60 | +`fix_h3_tokens.py`, `ir_analysis.py`, `mcp_audit.py`. Any new suite prefix must be added to |
| 61 | +all 8 files. |
| 62 | + |
| 63 | +**Recommendation**: Extract both into `scripts/config_utils.py` (which already exists but |
| 64 | +doesn't contain these). |
| 65 | + |
| 66 | +### 1.5 No Shared Utility Library |
| 67 | + |
| 68 | +166 Python scripts with no shared library for common operations: |
| 69 | +- JSON file loading with error handling (5+ local `load_json()` definitions) |
| 70 | +- Run directory discovery and filtering |
| 71 | +- Suite name resolution from legacy prefixes |
| 72 | +- Logging setup |
| 73 | + |
| 74 | +The `scripts/csb_metrics/` package handles metrics extraction well, but operational scripts |
| 75 | +outside it re-implement the same patterns independently. |
| 76 | + |
| 77 | +### 1.6 Script Size Violations |
| 78 | + |
| 79 | +Several scripts significantly exceed the 800-line guideline: |
| 80 | + |
| 81 | +| Script | Lines | Purpose | |
| 82 | +|--------|------:|---------| |
| 83 | +| `context_retrieval_agent.py` | 3,299 | IR retrieval agent | |
| 84 | +| `export_official_results.py` | 2,866 | Results export + HTML browser | |
| 85 | +| `ir_analysis.py` | 2,152 | IR metrics analysis | |
| 86 | +| `daytona_curator_runner.py` | 1,628 | Oracle curation runner | |
| 87 | +| `comprehensive_analysis.py` | 1,596 | Full analysis pipeline | |
| 88 | +| `validate_on_contextbench.py` | 1,435 | Submission validation | |
| 89 | +| `audit_official_scores.py` | 1,365 | Score auditing | |
| 90 | + |
| 91 | +### 1.7 Stale Code and References |
| 92 | + |
| 93 | +| Item | Location | Status | |
| 94 | +|------|----------|--------| |
| 95 | +| `validators.py` duplication gotcha | `CLAUDE.md` line 84 | **Stale** - no validators.py files exist in repo | |
| 96 | +| `ccb_build` benchmark reference | `scripts/daytona_poc_runner.py` line 36 | **Broken** - benchmark was renamed/removed | |
| 97 | +| Wave integration scripts | `scripts/integrate_answer_json_wave{1,2,3}.py` | ~90% code overlap, should be parameterized | |
| 98 | +| 3 overlapping Daytona runners | `daytona_poc_runner.py`, `daytona_runner.py`, `daytona_curator_runner.py` | High duplication in SDK init, error handling, result parsing | |
| 99 | + |
| 100 | +### 1.8 Unused JSON Schemas |
| 101 | + |
| 102 | +9 of 13 schemas in `schemas/` are never referenced by any script: |
| 103 | + |
| 104 | +- `economic_metrics_schema.json`, `failure_analysis_schema.json`, `governance_report_schema.json` |
| 105 | +- `icp_profile_schema.json`, `mcp_task_spec.schema.json`, `reliability_metrics_schema.json` |
| 106 | +- `run_triage.schema.json`, `use_case_registry.schema.json`, `workflow_metrics_schema.json` |
| 107 | + |
| 108 | +Only 4 scripts use `jsonschema` validation at all: `generate_enterprise_report.py`, |
| 109 | +`validate_submission.py`, `package_submission.py`, `ingest_judge_results.py`. |
| 110 | + |
| 111 | +--- |
| 112 | + |
| 113 | +## 2. Feature & UX Improvements |
| 114 | + |
| 115 | +### 2.1 Testing Infrastructure Is Nearly Absent |
| 116 | + |
| 117 | +| Metric | Current | Target | |
| 118 | +|--------|---------|--------| |
| 119 | +| Scripts with tests | 7 of 166 (4.2%) | Critical-path scripts at minimum | |
| 120 | +| Test configuration | None (no pytest.ini, pyproject.toml, or .coveragerc) | pyproject.toml with pytest + coverage | |
| 121 | +| CI test workflow | None | GitHub Actions workflow running pytest | |
| 122 | +| Coverage reporting | None | pytest-cov with threshold gate | |
| 123 | + |
| 124 | +**Critical untested scripts**: `repo_health.py` (master commit gate), `daytona_runner.py` |
| 125 | +(task execution engine), `validate_task_run.py` (post-run validation), `aggregate_status.py` |
| 126 | +(results processing). |
| 127 | + |
| 128 | +### 2.2 Missing MCP Instructions in Org Tasks |
| 129 | + |
| 130 | +`instruction_mcp.md` is missing from significant portions of Org suites: |
| 131 | + |
| 132 | +| Suite | Missing | Total | Gap | |
| 133 | +|-------|--------:|------:|----:| |
| 134 | +| `csb_org_onboarding` | 11 | 28 | 39% | |
| 135 | +| `csb_org_security` | 12 | 26 | 46% | |
| 136 | +| `csb_org_migration` | 9 | 28 | 32% | |
| 137 | +| `csb_org_compliance` | 8 | 20 | 40% | |
| 138 | +| `csb_org_platform` | 6 | 20 | 30% | |
| 139 | + |
| 140 | +All SDLC suites have 100% coverage. The gap means MCP-enabled runs on these Org tasks |
| 141 | +may not get task-specific MCP guidance, potentially affecting MCP vs baseline comparisons. |
| 142 | + |
| 143 | +### 2.3 Inconsistent CLI Patterns |
| 144 | + |
| 145 | +~16 scripts use raw `sys.argv` instead of `argparse`. These produce no help text and |
| 146 | +fail silently with wrong arguments. Standardizing on argparse would improve operability. |
| 147 | + |
| 148 | +### 2.4 Dashboard Removed |
| 149 | + |
| 150 | +The `dashboard/` directory referenced in MEMORY.md (`dashboard/app.py`, Streamlit) no longer |
| 151 | +exists. The `data/codecontextbench.db` reference is also stale. The only interactive |
| 152 | +visualization is the `export_official_results.py --serve` HTML browser. |
| 153 | + |
| 154 | +### 2.5 Four Linux Debug Tasks Have Invalid GT Schemas |
| 155 | + |
| 156 | +These tasks in `csb_sdlc_debug` use `{buggy_files[], buggy_functions[]}` instead of the |
| 157 | +standard GT format, making them incompatible with the checklist scorer: |
| 158 | + |
| 159 | +- `linux-acpi-backlight-fault-001` |
| 160 | +- `linux-hda-intel-suspend-fault-001` |
| 161 | +- `linux-iwlwifi-subdevice-fault-001` |
| 162 | +- `linux-nfs-inode-revalidate-fault-001` |
| 163 | + |
| 164 | +--- |
| 165 | + |
| 166 | +## 3. Research Recommendations |
| 167 | + |
| 168 | +### 3.1 Add `pyproject.toml` with Test Infrastructure |
| 169 | + |
| 170 | +A single `pyproject.toml` would provide: |
| 171 | +- pytest configuration (test paths, markers, fixture discovery) |
| 172 | +- Coverage settings (minimum threshold, branch coverage) |
| 173 | +- Package metadata for `scripts/csb_metrics/` |
| 174 | +- Dependency pinning for development tools |
| 175 | + |
| 176 | +### 3.2 CI Test Workflow |
| 177 | + |
| 178 | +Add `.github/workflows/test.yml` to run `pytest tests/ --cov` on push/PR. The existing |
| 179 | +`repo_health.yml` and `docs-consistency.yml` handle config validation but nothing runs |
| 180 | +the 207 existing tests. This means regressions (like the current test_abc_audit failure) |
| 181 | +go undetected. |
| 182 | + |
| 183 | +### 3.3 Consolidate Shared Constants into `config_utils.py` |
| 184 | + |
| 185 | +`scripts/config_utils.py` already exists but doesn't contain the most-duplicated patterns. |
| 186 | +Adding `SKIP_PATTERNS`, `DIR_PREFIX_TO_SUITE`, and common JSON/path utilities here would |
| 187 | +eliminate 8+ duplication sites and make suite renames a single-file change. |
| 188 | + |
| 189 | +### 3.4 Validate Schemas at Runtime |
| 190 | + |
| 191 | +The 13 JSON schemas in `schemas/` represent a significant investment. Using them for runtime |
| 192 | +validation in `generate_manifest.py`, `aggregate_status.py`, and task scaffolding scripts |
| 193 | +would catch data format regressions early. Consider `jsonschema` validation in the |
| 194 | +`csb_metrics` package. |
| 195 | + |
| 196 | +### 3.5 Script Consolidation Candidates |
| 197 | + |
| 198 | +| Current | Proposed | |
| 199 | +|---------|----------| |
| 200 | +| `integrate_answer_json_wave{1,2,3}.py` (3 files, ~1150 lines) | Single parameterized `integrate_answer_json.py --wave N` | |
| 201 | +| `daytona_poc_runner.py` + `daytona_runner.py` | Merge PoC into main runner with `--quick` flag | |
| 202 | +| `scaffold_feature_tasks.py` + `scaffold_refactor_tasks.py` (2039 lines combined) | Extract shared scaffolding base | |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## 4. Recommended Next Feature |
| 207 | + |
| 208 | +### Task Registry Reconciliation and CI Test Gate |
| 209 | + |
| 210 | +**The single most impactful improvement** is fixing the data integrity layer and adding |
| 211 | +the missing CI test gate. Specifically: |
| 212 | + |
| 213 | +**Part A - Fix `selected_benchmark_tasks.json` metadata** (1-2 hours) |
| 214 | +1. Run `python3 scripts/sync_task_metadata.py --fix` to reconcile task entries |
| 215 | +2. Update `total_selected` to match actual count |
| 216 | +3. Fix `per_suite` counts to match actual task entries per suite |
| 217 | +4. Add `csb_org_crossrepo` to `mcp_unique_suites` array |
| 218 | +5. Decide on status field: either mark all 404 tasks with explicit status, or remove |
| 219 | + the 34 extra tasks that exceed the 370 target |
| 220 | + |
| 221 | +**Part B - Add CI test workflow** (30 minutes) |
| 222 | +1. Create `pyproject.toml` with `[tool.pytest.ini_options]` and `[tool.coverage]` sections |
| 223 | +2. Fix the failing `test_fail_contaminated` test |
| 224 | +3. Add `.github/workflows/test.yml` that runs `pytest tests/` on push/PR |
| 225 | +4. Fix the `docs/DAYTONA.md` broken reference to unblock repo_health |
| 226 | + |
| 227 | +**Part C - Extract shared constants** (1-2 hours) |
| 228 | +1. Add `SKIP_PATTERNS` and `DIR_PREFIX_TO_SUITE` to `scripts/config_utils.py` |
| 229 | +2. Update the 8+ scripts that duplicate these constants to import from config_utils |
| 230 | +3. Add a repo_health check that greps for duplicate definitions |
| 231 | + |
| 232 | +**Why this is highest impact**: Every other improvement (new analysis scripts, better |
| 233 | +reporting, additional benchmarks) depends on the task registry being accurate and changes |
| 234 | +being caught by CI. Currently, metadata drift accumulates silently and test regressions |
| 235 | +go undetected. Fixing this foundation makes all subsequent work more reliable. |
| 236 | + |
| 237 | +**PRD-ready description**: "Implement a task registry integrity system consisting of: (1) a |
| 238 | +reconciliation pass that makes `selected_benchmark_tasks.json` metadata exactly match its |
| 239 | +contents and the filesystem, (2) a `pyproject.toml` with pytest/coverage configuration, |
| 240 | +(3) a GitHub Actions test workflow that runs all 207 tests on push/PR, (4) extraction of |
| 241 | +duplicated SKIP_PATTERNS and DIR_PREFIX_TO_SUITE constants into `config_utils.py` with |
| 242 | +imports replacing all 8+ duplication sites. Success criteria: repo_health passes, all tests |
| 243 | +pass in CI, and `sync_task_metadata.py --check` validates registry consistency." |
0 commit comments