Skip to content

Commit 3135f18

Browse files
committed
docs: add learnings from Mar 13 JSONL sessions
New findings from 4 sessions (nightly report #7, learnings extraction, PRD generation, Ralph conversion): - abc_audit.py duplicate function definitions - ir_metrics.py tt_all_r set comparison bug - --skip-completed logic defect (needs only result.json) - Task registry metadata header stale (436 claimed vs 274 actual) - verification_modes + use_case_category missing from all tasks - ANSWER_PATH count refined to 122 active / 259 total - Corrected: sanitize_secrets.py IS integrated into export pipeline - Ralph prd.json single-active model
1 parent 5a51fa6 commit 3135f18

File tree

3 files changed

+66
-69
lines changed

3 files changed

+66
-69
lines changed

AGENTS.md

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -78,16 +78,13 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7878
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona (native x86_64).
7979
- Build-push-clean pattern for limited disk (~45GB): build, push, clean before next image.
8080
- Colons in agent names break Docker volume mounts. Replace `:` with `__`.
81-
- Add `|| git init` fallback to all `git clone` commands in Dockerfiles for network resilience. Applied to 269 Dockerfiles.
82-
- Add `chown claude:claude /logs` and `adduser claude` to Dockerfiles for cross-harness (OH) permission compatibility.
83-
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 active Dockerfiles in `csb/debug/` and `csb/fix/` still reference `jefzda/`.
81+
- Add `|| git init` fallback to `git clone` in Dockerfiles. Add `chown claude:claude /logs` + `adduser claude` for OH compat.
82+
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 Dockerfiles in `csb/debug/` and `csb/fix/`.
8483

8584
### MCP Configuration (inside sandboxes)
86-
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`.
87-
- Claude Code needs `--mcp-config` flag; it does not auto-detect. Inject MCP usage instructions into the task prompt.
85+
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`. Claude Code needs `--mcp-config` flag.
8886
- `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in containers.
89-
- Sourcegraph: **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP. HTTP 405 = wrong protocol.
90-
- Sourcegraph skills show empty in headless mode. Embed prompt content in CLAUDE.md.
87+
- Sourcegraph: **stdio** (`npx @sourcegraph/cody --stdio`), NOT HTTP. Skills empty in headless -- embed in CLAUDE.md.
9188
- Sourcegraph env vars: `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN` (NOT `_ENDPOINT` or `_TOKEN`).
9289

9390
### Harbor Result Format
@@ -98,14 +95,20 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
9895

9996
### Security / Credentials
10097
- **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101-
- `sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
102-
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
98+
- `sanitize_secrets.py` IS integrated into `export_official_results.py` (line 32), but allowlist bypass (`_FAKE_INDICATORS` substring matching too broad) undermines it. Use exact-match `FAKE_KEY_ALLOWLIST`.
10399

104100
### Harness-Agnostic Verifiers
105101
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106102
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107103
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108-
- **55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
104+
- **122 active tasks** (259 total with backups) hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks. Also check `ANSWER_JSON` variable in `answer_json_verifier_lib.sh`. All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
105+
106+
### Scripts / Code Quality
107+
- **abc_audit.py duplicate functions**: `check_oa_equivalent_solutions`, `check_ob_negated_solutions`, `check_og_determinism`, `check_t10_shared_state` each defined twice. Python uses last definition silently.
108+
- **ir_metrics.py `tt_all_r` bug**: Line 749 set comparison may report time-to-first-relevant instead of time-to-all-relevant.
109+
- **`--skip-completed` defect** in `run_selected_tasks.sh`: requires both `result.json` AND `task_metrics.json`. Fix: check only `result.json`.
110+
- **Task registry metadata header stale**: claims 436 tasks, actual 274. `sync_task_metadata.py --fix` doesn't update header block.
111+
- **`verification_modes` + `use_case_category` missing from all 274 tasks**: Breaks auto-detection (always defaults to artifact-only) and `--use-case-category` filter (silently filters everything).
109112

110113
### Validation / Scoring
111114
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
@@ -117,15 +120,14 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
117120
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
118121
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
119122
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
120-
- **TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS` lookup silently falls back to equal-weight scoring.
121-
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` due to `.setdefault()` overwrite. Scoring audit trail broken.
122-
- **Falsy value bugs**: `max_score=0` treated as false (inflates scores); `None` MCP metrics misclassified as "rate-limited". Always use explicit `is None` / `== 0` checks.
123-
- **promote_run.py**: Crashes on non-dict environment config. Validate types before `.get()`.
123+
- **TARGET_SUITE misalignment**: 55 stale suite names, 220 missing. `SUITE_WEIGHTS` falls back to equal-weight.
124+
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` (`.setdefault()` overwrite). Audit trail broken.
125+
- **Falsy value bugs**: `max_score=0` treated as false; `None` MCP metrics misclassified. Use `is None` / `== 0`.
126+
- **promote_run.py**: Crashes on non-dict env config. Validate types before `.get()`.
124127

125128
### Git / Auth
126-
- `gh auth refresh` needs explicit `-s <scope>`: `gh auth refresh -h github.com -s write:packages`.
127-
- Env vars must be **exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
128-
- Account readiness: `runs/state/account_health.json`. Launchers source `configs/_common.sh`.
129+
- `gh auth refresh -h github.com -s write:packages` (explicit scope needed).
130+
- Env vars must be **exported** for Harbor subprocesses (`set -a` before sourcing `.env.local`).
129131
- GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
130132
- Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
131133
- **gitignore negation**: `!child/` doesn't work when parent dir is ignored. Use `git add -f`.
@@ -142,13 +144,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
142144
- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
143145

144146
### OpenHands
145-
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout).
146-
- `shlex.quote()` breaks on shell metacharacters. Base64-encode instructions on host, decode inside container.
147-
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
148-
- Alpine lacks `apt-get` (OH requirement). Use `bookworm` variants.
147+
- Strip ALL `sandbox_plugins` (`= []`). Base64-encode instructions (not `shlex.quote()`). Alpine lacks `apt-get` -- use `bookworm`.
149148
- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
150-
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` directly.
151-
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages shadowing installed deps.
149+
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py`. Set `PYTHONSAFEPATH=1`.
152150

153151
### CI / Workflows
154152
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
@@ -158,6 +156,7 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
158156
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
159157
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
160158
- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
159+
- **Ralph `prd.json` is single-active**: Overwriting loses tracking state. Archive old `prd.json` before replacement.
161160

162161
## Maintenance
163162
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

CLAUDE.md

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -78,16 +78,13 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
7878
- `uv tool install` segfaults on ARM64/QEMU. Use `pip install` or Daytona (native x86_64).
7979
- Build-push-clean pattern for limited disk (~45GB): build, push, clean before next image.
8080
- Colons in agent names break Docker volume mounts. Replace `:` with `__`.
81-
- Add `|| git init` fallback to all `git clone` commands in Dockerfiles for network resilience. Applied to 269 Dockerfiles.
82-
- Add `chown claude:claude /logs` and `adduser claude` to Dockerfiles for cross-harness (OH) permission compatibility.
83-
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 active Dockerfiles in `csb/debug/` and `csb/fix/` still reference `jefzda/`.
81+
- Add `|| git init` fallback to `git clone` in Dockerfiles. Add `chown claude:claude /logs` + `adduser claude` for OH compat.
82+
- `jefzda/``ghcr.io/sg-evals/` migration incomplete: 33 Dockerfiles in `csb/debug/` and `csb/fix/`.
8483

8584
### MCP Configuration (inside sandboxes)
86-
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`.
87-
- Claude Code needs `--mcp-config` flag; it does not auto-detect. Inject MCP usage instructions into the task prompt.
85+
- `.mcp.json` at `$CLAUDE_CONFIG_DIR` (typically `/logs/agent/sessions/`), not `/app/` or `/root/`. Claude Code needs `--mcp-config` flag.
8886
- `NODE_TLS_REJECT_UNAUTHORIZED=0` for Node.js SSL in containers.
89-
- Sourcegraph: **stdio transport** (`npx @sourcegraph/cody --stdio`), NOT HTTP. HTTP 405 = wrong protocol.
90-
- Sourcegraph skills show empty in headless mode. Embed prompt content in CLAUDE.md.
87+
- Sourcegraph: **stdio** (`npx @sourcegraph/cody --stdio`), NOT HTTP. Skills empty in headless -- embed in CLAUDE.md.
9188
- Sourcegraph env vars: `SOURCEGRAPH_URL` and `SOURCEGRAPH_ACCESS_TOKEN` (NOT `_ENDPOINT` or `_TOKEN`).
9289

9390
### Harbor Result Format
@@ -98,14 +95,20 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
9895

9996
### Security / Credentials
10097
- **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101-
- `sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
102-
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
98+
- `sanitize_secrets.py` IS integrated into `export_official_results.py` (line 32), but allowlist bypass (`_FAKE_INDICATORS` substring matching too broad) undermines it. Use exact-match `FAKE_KEY_ALLOWLIST`.
10399

104100
### Harness-Agnostic Verifiers
105101
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106102
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107103
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108-
- **55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
104+
- **122 active tasks** (259 total with backups) hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks. Also check `ANSWER_JSON` variable in `answer_json_verifier_lib.sh`. All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
105+
106+
### Scripts / Code Quality
107+
- **abc_audit.py duplicate functions**: `check_oa_equivalent_solutions`, `check_ob_negated_solutions`, `check_og_determinism`, `check_t10_shared_state` each defined twice. Python uses last definition silently.
108+
- **ir_metrics.py `tt_all_r` bug**: Line 749 set comparison may report time-to-first-relevant instead of time-to-all-relevant.
109+
- **`--skip-completed` defect** in `run_selected_tasks.sh`: requires both `result.json` AND `task_metrics.json`. Fix: check only `result.json`.
110+
- **Task registry metadata header stale**: claims 436 tasks, actual 274. `sync_task_metadata.py --fix` doesn't update header block.
111+
- **`verification_modes` + `use_case_category` missing from all 274 tasks**: Breaks auto-detection (always defaults to artifact-only) and `--use-case-category` filter (silently filters everything).
109112

110113
### Validation / Scoring
111114
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
@@ -117,15 +120,14 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
117120
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
118121
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
119122
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
120-
- **TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS` lookup silently falls back to equal-weight scoring.
121-
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` due to `.setdefault()` overwrite. Scoring audit trail broken.
122-
- **Falsy value bugs**: `max_score=0` treated as false (inflates scores); `None` MCP metrics misclassified as "rate-limited". Always use explicit `is None` / `== 0` checks.
123-
- **promote_run.py**: Crashes on non-dict environment config. Validate types before `.get()`.
123+
- **TARGET_SUITE misalignment**: 55 stale suite names, 220 missing. `SUITE_WEIGHTS` falls back to equal-weight.
124+
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` (`.setdefault()` overwrite). Audit trail broken.
125+
- **Falsy value bugs**: `max_score=0` treated as false; `None` MCP metrics misclassified. Use `is None` / `== 0`.
126+
- **promote_run.py**: Crashes on non-dict env config. Validate types before `.get()`.
124127

125128
### Git / Auth
126-
- `gh auth refresh` needs explicit `-s <scope>`: `gh auth refresh -h github.com -s write:packages`.
127-
- Env vars must be **exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
128-
- Account readiness: `runs/state/account_health.json`. Launchers source `configs/_common.sh`.
129+
- `gh auth refresh -h github.com -s write:packages` (explicit scope needed).
130+
- Env vars must be **exported** for Harbor subprocesses (`set -a` before sourcing `.env.local`).
129131
- GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
130132
- Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
131133
- **gitignore negation**: `!child/` doesn't work when parent dir is ignored. Use `git add -f`.
@@ -142,13 +144,9 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
142144
- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
143145

144146
### OpenHands
145-
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout).
146-
- `shlex.quote()` breaks on shell metacharacters. Base64-encode instructions on host, decode inside container.
147-
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
148-
- Alpine lacks `apt-get` (OH requirement). Use `bookworm` variants.
147+
- Strip ALL `sandbox_plugins` (`= []`). Base64-encode instructions (not `shlex.quote()`). Alpine lacks `apt-get` -- use `bookworm`.
149148
- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
150-
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` directly.
151-
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages shadowing installed deps.
149+
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py`. Set `PYTHONSAFEPATH=1`.
152150

153151
### CI / Workflows
154152
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
@@ -158,6 +156,7 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
158156
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
159157
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
160158
- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
159+
- **Ralph `prd.json` is single-active**: Overwriting loses tracking state. Archive old `prd.json` before replacement.
161160

162161
## Maintenance
163162
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.

0 commit comments

Comments
 (0)