You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+25-25Lines changed: 25 additions & 25 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,12 +56,11 @@ full operations manual.
56
56
## Common Gotchas (from session history)
57
57
58
58
### Documentation Generation
59
-
-**NEVER edit root `CLAUDE.md` or `AGENTS.md` directly.** Edit canonical sources under`docs/ops/` and regenerate. Direct edits cause `agent_guides_drift` failures in `repo_health.py`.
59
+
-**NEVER edit root `CLAUDE.md`/`AGENTS.md` directly.** Edit sources in`docs/ops/` and regenerate. Direct edits cause `agent_guides_drift` failures.
60
60
- After removing directories from the repo, also clean references from `scripts/sync_agent_guides.py` (`LOCAL_SOURCES`) and `scripts/docs_consistency_check.py` (`LOCAL_AGENT_TARGET_DIRS`).
61
61
62
62
### Daytona / Harbor
63
-
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (pre-built GHCR images need separate rebuild).
64
-
- Harbor+Daytona (`harbor run --environment-type daytona`) is recommended. `scripts/daytona_runner.py` is for quick validation only.
63
+
- Daytona builds from Dockerfiles at creation; fixes on `main` take effect next run (GHCR images need separate rebuild). Harbor+Daytona preferred; `daytona_runner.py` for quick validation only.
- Timing fields (`started_at`, `finished_at`) at **top level** of `result.json`, not nested under `timing`.
85
-
-`trajectory.json` generated by Harbor's `_convert_events_to_trajectory()`, not by Claude Code CLI.
83
+
- Timing fields at **top level** of `result.json` (not under `timing`). `trajectory.json` from Harbor's `_convert_events_to_trajectory()`, not CLI.
86
84
- SWE-bench `test.sh` redirects stdout to temp file; Harbor never sees `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers.
87
85
- Token usage in `trajectory.json`; transcript parsers don't see it. Contract: write `/logs/verifier/reward.txt`.
88
86
@@ -91,49 +89,50 @@ full operations manual.
91
89
-`sanitize_secrets.py` IS integrated into `export_official_results.py` (line 32), but allowlist bypass (`_FAKE_INDICATORS` substring matching too broad) undermines it. Use exact-match `FAKE_KEY_ALLOWLIST`.
92
90
93
91
### Harness-Agnostic Verifiers
94
-
-**no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
95
-
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
96
-
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
97
-
-**122 active tasks** hardcode `ANSWER_PATH="/workspace/answer.json"`. Check `ANSWER_JSON` in verifier lib. Bulk fix feasible; zero scores on non-Harbor.
92
+
-**no_changes_guard**: use `git diff origin/main HEAD` (not `HEAD`) for auto-committing agents.
-`GOWORK=off` in test.sh when sg_only verifier restores full repo.
95
+
-**122 active tasks** hardcode `ANSWER_PATH="/workspace/answer.json"`. Check `ANSWER_JSON` in verifier lib. Zero scores on non-Harbor.
96
+
-**Verifier lib duplication**: 401 copies of `answer_json_verifier_lib.sh` (13 suites; task copies diverged with extra funcs). 275 copies of `dual_score_lib.sh` (csb/ only). `benchmarks/_shared/` missing; every fix requires touching 401+ files.
98
97
99
98
### Scripts / Code Quality
100
-
-**abc_audit.py duplicate functions**: `check_oa_equivalent_solutions`, `check_ob_negated_solutions`, `check_og_determinism`, `check_t10_shared_state` each defined twice. Python uses last definition silently.
99
+
-**abc_audit.py**: 4 functions defined twice (`check_oa_*`, `check_ob_*`, `check_og_*`, `check_t10_*`); Python silently uses last definition.
-**`verification_modes` + `use_case_category` missing from all 274 tasks**: Breaks auto-detection (always defaults to artifact-only) and `--use-case-category` filter (silently filters everything).
102
+
-**`--skip-completed`**: requires both `result.json`+`task_metrics.json`. Fix: check only `result.json`.
103
+
-**Task registry header stale**: claims 436, actual 274. `sync_task_metadata.py --fix` doesn't update it.
104
+
-**`verification_modes`/`use_case_category` missing from all 274 tasks**: breaks auto-detection + `--use-case-category` filter (silently filters all).
105
105
106
106
### Validation / Scoring
107
107
-`validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
108
-
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash after it.
109
-
-`timeout 600` on test runners. `--forceExit` for Jest. Jest+TS: `memory_mb = 8192`.
108
+
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash. `timeout 600` on runners; `--forceExit` for Jest; Jest+TS: `memory_mb = 8192`.
-**Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env.
114
+
-**Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147``mcp_mode or config_name` falls through on empty string.
115
+
-`models.py``from_dict()` mutates caller's dict via `.pop()`.
- 3 schemas use deprecated `ccb_mcp_*` enums; actual names are `csb_org_*`. 8 schema files have zero consumers.
126
+
- 3 schemas use deprecated `ccb_mcp_*` enums; 8 have zero consumers. Examples embed legacy names (`ccb_crossrepo`); should be `csb_org_*`/`csb_sdlc_*`.
126
127
-**16 copies of `DIR_PREFIX_TO_SUITE`** across 30+ scripts with divergent definitions. Centralize in `csb_metrics/suite_registry.py`.
127
-
- Schema examples embed legacy suite names (`ccb_crossrepo`, `ccb_locobench`); should be `csb_org_*`/`csb_sdlc_*`.
128
128
129
129
### Skills / Automation
130
130
-**54 stale paths**: 25 skill files hardcode `~/CodeScaleBench` (actual `~/CodeContextBench`). Use `$(git rev-parse --show-toplevel)`.
131
-
-**21 stale config refs**: `sourcegraph_full` in 14 skill files + 5 schemas. `BASELINE_MCP_TYPE=sourcegraph_full` is invalid (accepts `none`/`sourcegraph`/`deepsearch`).
131
+
-**21 stale `sourcegraph_full` refs**: 14 skill files + 5 schemas. Invalid `BASELINE_MCP_TYPE` value (accepts `none`/`sourcegraph`/`deepsearch`).
132
132
-**3 deprecated model IDs**: `claude-opus-4-5-20251101` → `claude-opus-4-6` in skills.
- Secret-detection false-positives: use `--no-verify` when flagged code is detection logic. Classes `TestPlan`/`TestCase`/`TestResult` auto-collected by pytest; rename.
0 commit comments