Skip to content

Commit 2eaf9fe

Browse files
committed
docs: add learnings from Mar 16 JSONL sessions
New findings: - Verifier lib duplication: 401 copies of answer_json_verifier_lib.sh (13 suites; task copies diverged with extra funcs). 275 copies of dual_score_lib.sh (csb/ only). benchmarks/_shared/ missing. - rerun_failed.py: shell=True injection risk; sourcegraph_full->deepsearch mapping wrong (invalid MCP type); contains deprecated model ID. - 3/4 CI workflows missing top-level permissions: block (broad token scope). - models.py from_dict() mutates caller's dict via .pop(). - generate_eval_report.py:147 mcp_mode or config_name falls through on empty string. - _common.sh sparse array re-indexing bug (unset + re-index doesn't compact). - grep -P macOS: 12 task test.sh files also affected (not just run_selected_tasks.sh).
1 parent 390e7d7 commit 2eaf9fe

File tree

3 files changed

+75
-75
lines changed

3 files changed

+75
-75
lines changed

AGENTS.md

Lines changed: 25 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -56,12 +56,11 @@ full operations manual.
5656
## Common Gotchas (from session history)
5757

5858
### Documentation Generation
59-
- **NEVER edit root `CLAUDE.md` or `AGENTS.md` directly.** Edit canonical sources under `docs/ops/` and regenerate. Direct edits cause `agent_guides_drift` failures in `repo_health.py`.
59+
- **NEVER edit root `CLAUDE.md`/`AGENTS.md` directly.** Edit sources in `docs/ops/` and regenerate. Direct edits cause `agent_guides_drift` failures.
6060
- After removing directories from the repo, also clean references from `scripts/sync_agent_guides.py` (`LOCAL_SOURCES`) and `scripts/docs_consistency_check.py` (`LOCAL_AGENT_TARGET_DIRS`).
6161

6262
### Daytona / Harbor
63-
- Daytona builds from Dockerfiles at sandbox creation. Fixes on `main` take effect next run (pre-built GHCR images need separate rebuild).
64-
- Harbor+Daytona (`harbor run --environment-type daytona`) is recommended. `scripts/daytona_runner.py` is for quick validation only.
63+
- Daytona builds from Dockerfiles at creation; fixes on `main` take effect next run (GHCR images need separate rebuild). Harbor+Daytona preferred; `daytona_runner.py` for quick validation only.
6564
- `BASELINE_MCP_TYPE` env var: `none`, `sourcegraph`, `deepsearch`.
6665
- Use Daytona SDK (`daytona_sdk`) over CLI (CLI is interactive-only for SSH).
6766
- GHCR packages default **private** for personal accounts; visibility change requires GitHub web UI.
@@ -75,14 +74,13 @@ full operations manual.
7574
- `jefzda/``ghcr.io/sg-evals/` migration incomplete (33 Dockerfiles).
7675

7776
### MCP Configuration (inside sandboxes)
78-
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`. Claude Code needs `--mcp-config` flag.
77+
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (`/logs/agent/sessions/`), not `/app/`. Needs `--mcp-config` flag.
7978
- `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in containers.
80-
- Sourcegraph: **stdio** (`npx @sourcegraph/cody --stdio`), NOT HTTP. Skills empty in headless -- embed in CLAUDE.md.
81-
- Sourcegraph env vars: `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN` (NOT `_ENDPOINT` or `_TOKEN`).
79+
- Sourcegraph: **stdio** (`npx @sourcegraph/cody --stdio`), NOT HTTP. Skills empty in headless embed in CLAUDE.md.
80+
- Sourcegraph env vars: `SOURCEGRAPH_URL`, `SOURCEGRAPH_ACCESS_TOKEN` (NOT `_ENDPOINT`/`_TOKEN`).
8281

8382
### Harbor Result Format
84-
- Timing fields (`started_at`, `finished_at`) at **top level** of `result.json`, not nested under `timing`.
85-
- `trajectory.json` generated by Harbor's `_convert_events_to_trajectory()`, not by Claude Code CLI.
83+
- Timing fields at **top level** of `result.json` (not under `timing`). `trajectory.json` from Harbor's `_convert_events_to_trajectory()`, not CLI.
8684
- SWE-bench `test.sh` redirects stdout to temp file; Harbor never sees `START_TEST_OUTPUT`/`END_TEST_OUTPUT` markers.
8785
- Token usage in `trajectory.json`; transcript parsers don't see it. Contract: write `/logs/verifier/reward.txt`.
8886

@@ -91,49 +89,50 @@ full operations manual.
9189
- `sanitize_secrets.py` IS integrated into `export_official_results.py` (line 32), but allowlist bypass (`_FAKE_INDICATORS` substring matching too broad) undermines it. Use exact-match `FAKE_KEY_ALLOWLIST`.
9290

9391
### Harness-Agnostic Verifiers
94-
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
95-
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
96-
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
97-
- **122 active tasks** hardcode `ANSWER_PATH="/workspace/answer.json"`. Check `ANSWER_JSON` in verifier lib. Bulk fix feasible; zero scores on non-Harbor.
92+
- **no_changes_guard**: use `git diff origin/main HEAD` (not `HEAD`) for auto-committing agents.
93+
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}`, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}`.
94+
- `GOWORK=off` in test.sh when sg_only verifier restores full repo.
95+
- **122 active tasks** hardcode `ANSWER_PATH="/workspace/answer.json"`. Check `ANSWER_JSON` in verifier lib. Zero scores on non-Harbor.
96+
- **Verifier lib duplication**: 401 copies of `answer_json_verifier_lib.sh` (13 suites; task copies diverged with extra funcs). 275 copies of `dual_score_lib.sh` (csb/ only). `benchmarks/_shared/` missing; every fix requires touching 401+ files.
9897

9998
### Scripts / Code Quality
100-
- **abc_audit.py duplicate functions**: `check_oa_equivalent_solutions`, `check_ob_negated_solutions`, `check_og_determinism`, `check_t10_shared_state` each defined twice. Python uses last definition silently.
99+
- **abc_audit.py**: 4 functions defined twice (`check_oa_*`, `check_ob_*`, `check_og_*`, `check_t10_*`); Python silently uses last definition.
100+
- **`rerun_failed.py`**: `shell=True` + dynamic commands (injection risk); `sourcegraph_full → deepsearch` mapping wrong (invalid MCP type); contains deprecated model ID.
101101
- **ir_metrics.py `tt_all_r` bug**: Line 749 set comparison may report time-to-first-relevant instead of time-to-all-relevant.
102-
- **`--skip-completed` defect** in `run_selected_tasks.sh`: requires both `result.json` AND `task_metrics.json`. Fix: check only `result.json`.
103-
- **Task registry metadata header stale**: claims 436 tasks, actual 274. `sync_task_metadata.py --fix` doesn't update header block.
104-
- **`verification_modes` + `use_case_category` missing from all 274 tasks**: Breaks auto-detection (always defaults to artifact-only) and `--use-case-category` filter (silently filters everything).
102+
- **`--skip-completed`**: requires both `result.json` + `task_metrics.json`. Fix: check only `result.json`.
103+
- **Task registry header stale**: claims 436, actual 274. `sync_task_metadata.py --fix` doesn't update it.
104+
- **`verification_modes`/`use_case_category` missing from all 274 tasks**: breaks auto-detection + `--use-case-category` filter (silently filters all).
105105

106106
### Validation / Scoring
107107
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
108-
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash after it.
109-
- `timeout 600` on test runners. `--forceExit` for Jest. Jest+TS: `memory_mb = 8192`.
108+
- Agent <2s = never ran. `no_changes_guard`: write `reward.txt` in Python, not bash. `timeout 600` on runners; `--forceExit` for Jest; Jest+TS: `memory_mb = 8192`.
110109
- **CSB dual-score**: file edits + `answer.json` independent. Fallback: `promoted_verifier.py``oracle_checks.py` → heuristic.
111110
- Rate-limited (score=0, <30s): `quarantine_invalid_tasks.py --execute`. Bare `$VAR` in `instruction.md` → use `<placeholder>`.
112111
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`.
113112
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0`. Use `or 1`.
114113
- **TARGET_SUITE**: 55 stale, 220 missing. `dual_score_lib.sh` `scorer_artifact` always `"auto"`.
115-
- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env.
114+
- **Falsy bugs**: `max_score=0` as false; `None` MCP metrics misclassified. `promote_run.py` crashes on non-dict env. `generate_eval_report.py:147` `mcp_mode or config_name` falls through on empty string.
115+
- `models.py` `from_dict()` mutates caller's dict via `.pop()`.
116116

117117
### Agent / Runner Robustness
118118
- **Agent `/tmp` race**: `claude_baseline_agent.py:1134` uses fixed `/tmp/claude_system_prompt.txt`, `/tmp/claude_run.sh`. Concurrent tasks cross-contaminate. Use `mktemp`.
119119
- **Token refresh**: `claude_baseline_agent.py:1523` only catches `HTTPError`. Add `URLError`/`socket.timeout`. `e.read()` leaks socket FD; use `with e:`.
120120
- **Runner pipefail**: `run_selected_tasks.sh:681` `harbor_run_guarded | tee || echo` -- `||` applies to `tee` (always 0). Add `set -o pipefail`.
121121
- **Runner cleanup**: No `trap` for temp dirs on early exit. `mktemp` failure (line 648) silently copies to CWD.
122-
- **`grep -P` macOS**: `run_selected_tasks.sh:726` silently fails on BSD grep. Use `sed -n` instead.
122+
- **`grep -P` macOS**: `run_selected_tasks.sh:726` + 12 task test.sh files silently fail on BSD grep. Use `sed -n` or POSIX alternatives.
123+
- **`_common.sh` sparse array**: `unset` + `pids=("${pids[@]}")` doesn't compact sparse arrays in Bash; gaps persist (lines 1344-1352).
123124

124125
### Schema / Suite Naming
125-
- 3 schemas use deprecated `ccb_mcp_*` enums; actual names are `csb_org_*`. 8 schema files have zero consumers.
126+
- 3 schemas use deprecated `ccb_mcp_*` enums; 8 have zero consumers. Examples embed legacy names (`ccb_crossrepo`); should be `csb_org_*`/`csb_sdlc_*`.
126127
- **16 copies of `DIR_PREFIX_TO_SUITE`** across 30+ scripts with divergent definitions. Centralize in `csb_metrics/suite_registry.py`.
127-
- Schema examples embed legacy suite names (`ccb_crossrepo`, `ccb_locobench`); should be `csb_org_*`/`csb_sdlc_*`.
128128

129129
### Skills / Automation
130130
- **54 stale paths**: 25 skill files hardcode `~/CodeScaleBench` (actual `~/CodeContextBench`). Use `$(git rev-parse --show-toplevel)`.
131-
- **21 stale config refs**: `sourcegraph_full` in 14 skill files + 5 schemas. `BASELINE_MCP_TYPE=sourcegraph_full` is invalid (accepts `none`/`sourcegraph`/`deepsearch`).
131+
- **21 stale `sourcegraph_full` refs**: 14 skill files + 5 schemas. Invalid `BASELINE_MCP_TYPE` value (accepts `none`/`sourcegraph`/`deepsearch`).
132132
- **3 deprecated model IDs**: `claude-opus-4-5-20251101``claude-opus-4-6` in skills.
133133

134134
### Git / Auth
135-
- `gh auth refresh -h github.com -s write:packages` (explicit scope needed).
136-
- Env vars must be **exported** for Harbor subprocesses (`set -a` before sourcing `.env.local`).
135+
- `gh auth refresh -h github.com -s write:packages`. Env vars must be **exported** for Harbor subprocesses (`set -a` before sourcing `.env.local`).
137136
- GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
138137
- Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
139138
- **gitignore negation**: `!child/` doesn't work when parent dir is ignored. Use `git add -f`.
@@ -154,6 +153,7 @@ full operations manual.
154153
### CI / Workflows
155154
- `docs-consistency.yml` redundant (subsumed by `repo_health.yml`). Export HTML truncates at 1200 rows.
156155
- 4 workflows use 3 Python versions (3.10/3.11/3.12); standardize to 3.10. `roam.yml` unpinned `pip install roam-code`.
156+
- 3/4 CI workflows missing top-level `permissions:` block → overly broad default GitHub Actions token scope.
157157

158158
### Pre-commit / Pytest / Ralph
159159
- Secret-detection false-positives: use `--no-verify` when flagged code is detection logic. Classes `TestPlan`/`TestCase`/`TestResult` auto-collected by pytest; rename.

0 commit comments

Comments
 (0)