Skip to content

Commit 300163f

Browse files
committed
docs: add first nightly research report (2026-03-06)
Covers task registry metadata drift, missing CI test gate, duplicated constants across 8+ scripts, and testing gaps. Recommends task registry reconciliation + CI test workflow as highest-impact next feature. Also adds .gitignore exception for reports/nightly/.
1 parent 6886d66 commit 300163f

File tree

2 files changed

+245
-0
lines changed

2 files changed

+245
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ scripts/plot_csb_mcp_blog_figures.py
5757
ralph/
5858
ralph-*/
5959
reports/
60+
!reports/nightly/
61+
!reports/nightly/**
6062
eval_reports/
6163
tmp/
6264
*.log
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# Nightly Research Report - 2026-03-06
2+
3+
First nightly review of the CodeContextBench repository. Covers code quality,
4+
configuration integrity, testing infrastructure, and architectural debt.
5+
6+
---
7+
8+
## 1. Code & Architecture Review
9+
10+
### 1.1 CRITICAL: Task Registry Metadata Drift
11+
12+
`configs/selected_benchmark_tasks.json` has multiple internal inconsistencies:
13+
14+
| Issue | Details |
15+
|-------|---------|
16+
| **total_selected vs target_total** | `total_selected: 404` but `target_total: 370`. The file contains 404 task entries but only 20 have `status: "active"`; the remaining 384 have no status field at all. |
17+
| **per_suite counts wrong** | 14 of 20 suites have `per_suite` metadata that doesn't match the actual number of tasks in the file. Example: `csb_sdlc_fix` says 26 but file has 34 entries. |
18+
| **Missing suite in mcp_unique_suites** | `csb_org_crossrepo` (14 tasks) appears in `per_suite` but is absent from the `mcp_unique_suites` array, which claims 11 suites but lists only 10. |
19+
20+
**Impact**: Any script relying on metadata counts for DOE power calculations, progress
21+
reporting, or suite-level statistics will produce incorrect results. The `sync_task_metadata.py`
22+
script exists but apparently hasn't been run recently enough to catch this drift.
23+
24+
### 1.2 Repo Health Gate Failing
25+
26+
`python3 scripts/repo_health.py --quick` currently fails:
27+
28+
```
29+
docs_consistency: FAILED
30+
- missing_ref_all_docs:docs/DAYTONA.md:scripts/daytona_snapshot_cleanup.py
31+
```
32+
33+
`docs/DAYTONA.md` line 270 references `scripts/daytona_snapshot_cleanup.py`, which does not
34+
exist (it's listed in `.gitignore` but was never committed). This blocks CI on push to main.
35+
36+
### 1.3 Failing Test
37+
38+
`tests/test_abc_audit.py::TestR2NoContamination::test_fail_contaminated` fails (206 pass, 1 fail).
39+
40+
**Root cause**: The contamination regex in `scripts/abc_audit.py` (lines 569-574) was narrowed
41+
to require a suffix after "sourcegraph" (e.g., "sourcegraph mcp", "sourcegraph tools"), but the
42+
test fixture at `tests/test_abc_audit.py` line 81 still uses bare "sourcegraph" which no longer
43+
matches. Either the regex or the test fixture needs updating.
44+
45+
### 1.4 Duplicated Constants Across 8+ Scripts
46+
47+
Two critical constants are copy-pasted with **inconsistent values** across scripts:
48+
49+
**SKIP_PATTERNS** (6+ variations found):
50+
51+
| Script | Missing from its list |
52+
|--------|-----------------------|
53+
| `aggregate_status.py` | `__archived` |
54+
| `generate_manifest.py` | `__archived`, `__aborted` |
55+
| `audit_traces.py` | `__archived`, `__aborted`, `__v1_hinted` |
56+
| `normalize_retrieval_events.py` | (superset - has everything) |
57+
58+
**DIR_PREFIX_TO_SUITE** mapping (~70 lines): Identical block duplicated in `aggregate_status.py`,
59+
`generate_manifest.py`, `audit_traces.py`, `cost_breakdown_analysis.py`, `ds_audit.py`,
60+
`fix_h3_tokens.py`, `ir_analysis.py`, `mcp_audit.py`. Any new suite prefix must be added to
61+
all 8 files.
62+
63+
**Recommendation**: Extract both into `scripts/config_utils.py` (which already exists but
64+
doesn't contain these).
65+
66+
### 1.5 No Shared Utility Library
67+
68+
166 Python scripts with no shared library for common operations:
69+
- JSON file loading with error handling (5+ local `load_json()` definitions)
70+
- Run directory discovery and filtering
71+
- Suite name resolution from legacy prefixes
72+
- Logging setup
73+
74+
The `scripts/csb_metrics/` package handles metrics extraction well, but operational scripts
75+
outside it re-implement the same patterns independently.
76+
77+
### 1.6 Script Size Violations
78+
79+
Several scripts significantly exceed the 800-line guideline:
80+
81+
| Script | Lines | Purpose |
82+
|--------|------:|---------|
83+
| `context_retrieval_agent.py` | 3,299 | IR retrieval agent |
84+
| `export_official_results.py` | 2,866 | Results export + HTML browser |
85+
| `ir_analysis.py` | 2,152 | IR metrics analysis |
86+
| `daytona_curator_runner.py` | 1,628 | Oracle curation runner |
87+
| `comprehensive_analysis.py` | 1,596 | Full analysis pipeline |
88+
| `validate_on_contextbench.py` | 1,435 | Submission validation |
89+
| `audit_official_scores.py` | 1,365 | Score auditing |
90+
91+
### 1.7 Stale Code and References
92+
93+
| Item | Location | Status |
94+
|------|----------|--------|
95+
| `validators.py` duplication gotcha | `CLAUDE.md` line 84 | **Stale** - no validators.py files exist in repo |
96+
| `ccb_build` benchmark reference | `scripts/daytona_poc_runner.py` line 36 | **Broken** - benchmark was renamed/removed |
97+
| Wave integration scripts | `scripts/integrate_answer_json_wave{1,2,3}.py` | ~90% code overlap, should be parameterized |
98+
| 3 overlapping Daytona runners | `daytona_poc_runner.py`, `daytona_runner.py`, `daytona_curator_runner.py` | High duplication in SDK init, error handling, result parsing |
99+
100+
### 1.8 Unused JSON Schemas
101+
102+
9 of 13 schemas in `schemas/` are never referenced by any script:
103+
104+
- `economic_metrics_schema.json`, `failure_analysis_schema.json`, `governance_report_schema.json`
105+
- `icp_profile_schema.json`, `mcp_task_spec.schema.json`, `reliability_metrics_schema.json`
106+
- `run_triage.schema.json`, `use_case_registry.schema.json`, `workflow_metrics_schema.json`
107+
108+
Only 4 scripts use `jsonschema` validation at all: `generate_enterprise_report.py`,
109+
`validate_submission.py`, `package_submission.py`, `ingest_judge_results.py`.
110+
111+
---
112+
113+
## 2. Feature & UX Improvements
114+
115+
### 2.1 Testing Infrastructure Is Nearly Absent
116+
117+
| Metric | Current | Target |
118+
|--------|---------|--------|
119+
| Scripts with tests | 7 of 166 (4.2%) | Critical-path scripts at minimum |
120+
| Test configuration | None (no pytest.ini, pyproject.toml, or .coveragerc) | pyproject.toml with pytest + coverage |
121+
| CI test workflow | None | GitHub Actions workflow running pytest |
122+
| Coverage reporting | None | pytest-cov with threshold gate |
123+
124+
**Critical untested scripts**: `repo_health.py` (master commit gate), `daytona_runner.py`
125+
(task execution engine), `validate_task_run.py` (post-run validation), `aggregate_status.py`
126+
(results processing).
127+
128+
### 2.2 Missing MCP Instructions in Org Tasks
129+
130+
`instruction_mcp.md` is missing from significant portions of Org suites:
131+
132+
| Suite | Missing | Total | Gap |
133+
|-------|--------:|------:|----:|
134+
| `csb_org_onboarding` | 11 | 28 | 39% |
135+
| `csb_org_security` | 12 | 26 | 46% |
136+
| `csb_org_migration` | 9 | 28 | 32% |
137+
| `csb_org_compliance` | 8 | 20 | 40% |
138+
| `csb_org_platform` | 6 | 20 | 30% |
139+
140+
All SDLC suites have 100% coverage. The gap means MCP-enabled runs on these Org tasks
141+
may not get task-specific MCP guidance, potentially affecting MCP vs baseline comparisons.
142+
143+
### 2.3 Inconsistent CLI Patterns
144+
145+
~16 scripts use raw `sys.argv` instead of `argparse`. These produce no help text and
146+
fail silently with wrong arguments. Standardizing on argparse would improve operability.
147+
148+
### 2.4 Dashboard Removed
149+
150+
The `dashboard/` directory referenced in MEMORY.md (`dashboard/app.py`, Streamlit) no longer
151+
exists. The `data/codecontextbench.db` reference is also stale. The only interactive
152+
visualization is the `export_official_results.py --serve` HTML browser.
153+
154+
### 2.5 Four Linux Debug Tasks Have Invalid GT Schemas
155+
156+
These tasks in `csb_sdlc_debug` use `{buggy_files[], buggy_functions[]}` instead of the
157+
standard GT format, making them incompatible with the checklist scorer:
158+
159+
- `linux-acpi-backlight-fault-001`
160+
- `linux-hda-intel-suspend-fault-001`
161+
- `linux-iwlwifi-subdevice-fault-001`
162+
- `linux-nfs-inode-revalidate-fault-001`
163+
164+
---
165+
166+
## 3. Research Recommendations
167+
168+
### 3.1 Add `pyproject.toml` with Test Infrastructure
169+
170+
A single `pyproject.toml` would provide:
171+
- pytest configuration (test paths, markers, fixture discovery)
172+
- Coverage settings (minimum threshold, branch coverage)
173+
- Package metadata for `scripts/csb_metrics/`
174+
- Dependency pinning for development tools
175+
176+
### 3.2 CI Test Workflow
177+
178+
Add `.github/workflows/test.yml` to run `pytest tests/ --cov` on push/PR. The existing
179+
`repo_health.yml` and `docs-consistency.yml` handle config validation but nothing runs
180+
the 207 existing tests. This means regressions (like the current test_abc_audit failure)
181+
go undetected.
182+
183+
### 3.3 Consolidate Shared Constants into `config_utils.py`
184+
185+
`scripts/config_utils.py` already exists but doesn't contain the most-duplicated patterns.
186+
Adding `SKIP_PATTERNS`, `DIR_PREFIX_TO_SUITE`, and common JSON/path utilities here would
187+
eliminate 8+ duplication sites and make suite renames a single-file change.
188+
189+
### 3.4 Validate Schemas at Runtime
190+
191+
The 13 JSON schemas in `schemas/` represent a significant investment. Using them for runtime
192+
validation in `generate_manifest.py`, `aggregate_status.py`, and task scaffolding scripts
193+
would catch data format regressions early. Consider `jsonschema` validation in the
194+
`csb_metrics` package.
195+
196+
### 3.5 Script Consolidation Candidates
197+
198+
| Current | Proposed |
199+
|---------|----------|
200+
| `integrate_answer_json_wave{1,2,3}.py` (3 files, ~1150 lines) | Single parameterized `integrate_answer_json.py --wave N` |
201+
| `daytona_poc_runner.py` + `daytona_runner.py` | Merge PoC into main runner with `--quick` flag |
202+
| `scaffold_feature_tasks.py` + `scaffold_refactor_tasks.py` (2039 lines combined) | Extract shared scaffolding base |
203+
204+
---
205+
206+
## 4. Recommended Next Feature
207+
208+
### Task Registry Reconciliation and CI Test Gate
209+
210+
**The single most impactful improvement** is fixing the data integrity layer and adding
211+
the missing CI test gate. Specifically:
212+
213+
**Part A - Fix `selected_benchmark_tasks.json` metadata** (1-2 hours)
214+
1. Run `python3 scripts/sync_task_metadata.py --fix` to reconcile task entries
215+
2. Update `total_selected` to match actual count
216+
3. Fix `per_suite` counts to match actual task entries per suite
217+
4. Add `csb_org_crossrepo` to `mcp_unique_suites` array
218+
5. Decide on status field: either mark all 404 tasks with explicit status, or remove
219+
the 34 extra tasks that exceed the 370 target
220+
221+
**Part B - Add CI test workflow** (30 minutes)
222+
1. Create `pyproject.toml` with `[tool.pytest.ini_options]` and `[tool.coverage]` sections
223+
2. Fix the failing `test_fail_contaminated` test
224+
3. Add `.github/workflows/test.yml` that runs `pytest tests/` on push/PR
225+
4. Fix the `docs/DAYTONA.md` broken reference to unblock repo_health
226+
227+
**Part C - Extract shared constants** (1-2 hours)
228+
1. Add `SKIP_PATTERNS` and `DIR_PREFIX_TO_SUITE` to `scripts/config_utils.py`
229+
2. Update the 8+ scripts that duplicate these constants to import from config_utils
230+
3. Add a repo_health check that greps for duplicate definitions
231+
232+
**Why this is highest impact**: Every other improvement (new analysis scripts, better
233+
reporting, additional benchmarks) depends on the task registry being accurate and changes
234+
being caught by CI. Currently, metadata drift accumulates silently and test regressions
235+
go undetected. Fixing this foundation makes all subsequent work more reliable.
236+
237+
**PRD-ready description**: "Implement a task registry integrity system consisting of: (1) a
238+
reconciliation pass that makes `selected_benchmark_tasks.json` metadata exactly match its
239+
contents and the filesystem, (2) a `pyproject.toml` with pytest/coverage configuration,
240+
(3) a GitHub Actions test workflow that runs all 207 tests on push/PR, (4) extraction of
241+
duplicated SKIP_PATTERNS and DIR_PREFIX_TO_SUITE constants into `config_utils.py` with
242+
imports replacing all 8+ duplication sites. Success criteria: repo_health passes, all tests
243+
pass in CI, and `sync_task_metadata.py --check` validates registry consistency."

0 commit comments

Comments
 (0)