You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New findings from 4 sessions (nightly report #7, learnings extraction,
PRD generation, Ralph conversion):
- abc_audit.py duplicate function definitions
- ir_metrics.py tt_all_r set comparison bug
- --skip-completed logic defect (needs only result.json)
- Task registry metadata header stale (436 claimed vs 274 actual)
- verification_modes + use_case_category missing from all tasks
- ANSWER_PATH count refined to 122 active / 259 total
- Corrected: sanitize_secrets.py IS integrated into export pipeline
- Ralph prd.json single-active model
-**Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101
-
-`sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
102
-
-`sanitize_secrets.py``_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
98
+
-`sanitize_secrets.py` IS integrated into `export_official_results.py` (line 32), but allowlist bypass (`_FAKE_INDICATORS` substring matching too broad) undermines it. Use exact-match `FAKE_KEY_ALLOWLIST`.
103
99
104
100
### Harness-Agnostic Verifiers
105
101
-**no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106
102
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107
103
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108
-
-**55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
104
+
-**122 active tasks** (259 total with backups) hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks. Also check `ANSWER_JSON` variable in `answer_json_verifier_lib.sh`. All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
105
+
106
+
### Scripts / Code Quality
107
+
-**abc_audit.py duplicate functions**: `check_oa_equivalent_solutions`, `check_ob_negated_solutions`, `check_og_determinism`, `check_t10_shared_state` each defined twice. Python uses last definition silently.
108
+
-**ir_metrics.py `tt_all_r` bug**: Line 749 set comparison may report time-to-first-relevant instead of time-to-all-relevant.
109
+
-**`--skip-completed` defect** in `run_selected_tasks.sh`: requires both `result.json` AND `task_metrics.json`. Fix: check only `result.json`.
-**`verification_modes` + `use_case_category` missing from all 274 tasks**: Breaks auto-detection (always defaults to artifact-only) and `--use-case-category` filter (silently filters everything).
109
112
110
113
### Validation / Scoring
111
114
-`validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
118
121
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
119
122
-`cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
120
-
-**TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS`lookup silently falls back to equal-weight scoring.
-**Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101
-
-`sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
102
-
-`sanitize_secrets.py``_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
98
+
-`sanitize_secrets.py` IS integrated into `export_official_results.py` (line 32), but allowlist bypass (`_FAKE_INDICATORS` substring matching too broad) undermines it. Use exact-match `FAKE_KEY_ALLOWLIST`.
103
99
104
100
### Harness-Agnostic Verifiers
105
101
-**no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106
102
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107
103
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108
-
-**55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
104
+
-**122 active tasks** (259 total with backups) hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks. Also check `ANSWER_JSON` variable in `answer_json_verifier_lib.sh`. All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
105
+
106
+
### Scripts / Code Quality
107
+
-**abc_audit.py duplicate functions**: `check_oa_equivalent_solutions`, `check_ob_negated_solutions`, `check_og_determinism`, `check_t10_shared_state` each defined twice. Python uses last definition silently.
108
+
-**ir_metrics.py `tt_all_r` bug**: Line 749 set comparison may report time-to-first-relevant instead of time-to-all-relevant.
109
+
-**`--skip-completed` defect** in `run_selected_tasks.sh`: requires both `result.json` AND `task_metrics.json`. Fix: check only `result.json`.
-**`verification_modes` + `use_case_category` missing from all 274 tasks**: Breaks auto-detection (always defaults to artifact-only) and `--use-case-category` filter (silently filters everything).
109
112
110
113
### Validation / Scoring
111
114
-`validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
118
121
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
119
122
-`cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
120
-
-**TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS`lookup silently falls back to equal-weight scoring.
0 commit comments