docs: add learnings from Mar 10-12 JSONL sessions

sjarmak · sjarmak · commit 42e41e4602ad · 2026-03-13T22:37:22.000-04:00
New gotchas: TARGET_SUITE misalignment (silent weight fallback),
dual_score_lib.sh audit trail broken, falsy value bugs in judge/quarantine,
promote_run.py type crash, 55+ hardcoded ANSWER_PATH tasks (scope expansion),
gitignore negation broken for ignored parent dirs.
diff --git a/AGENTS.md b/AGENTS.md
@@ -98,36 +98,37 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 
 ### Security / Credentials
 - **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
-- `scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
-- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
+- `sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
+- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
 
 ### Harness-Agnostic Verifiers
 - **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
 - Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
 - Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
-- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
+- **55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
 
 ### Validation / Scoring
-- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
-- Install scripts printing "INSTALL_SUCCESS" regardless of outcome are common. Verify binary exists.
-- Agent completing in **<2s** = never installed/ran. Trial dir names truncated with hash; real name in `config.json` at `task.path`.
-- LoCoBench task IDs have multi-word fields. Use 3-digit task number as positional anchor.
+- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
+- Agent completing in **<2s** = never installed/ran. Real name in `config.json` at `task.path`.
 - **no_changes_guard**: write `reward.txt` inside Python block, not in bash after it.
 - `timeout 600` on all test runners. `--forceExit` for Jest. Jest+TS needs `memory_mb = 8192`.
 - **CSB dual-score**: file edits + `answer.json` scored independently. Fallback: `promoted_verifier.py` -> `oracle_checks.py` -> heuristic.
 - Rate-limited results (score=0, <30s): `scripts/quarantine_invalid_tasks.py --execute`.
 - Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
 - Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
-- `repo_health.py` fallback dict missing `prompt_hygiene` and `launch_policy` checks (3 of 5 only).
 - `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
+- **TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS` lookup silently falls back to equal-weight scoring.
+- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` due to `.setdefault()` overwrite. Scoring audit trail broken.
+- **Falsy value bugs**: `max_score=0` treated as false (inflates scores); `None` MCP metrics misclassified as "rate-limited". Always use explicit `is None` / `== 0` checks.
+- **promote_run.py**: Crashes on non-dict environment config. Validate types before `.get()`.
 
 ### Git / Auth
 - `gh auth refresh` needs explicit `-s <scope>`: `gh auth refresh -h github.com -s write:packages`.
 - Env vars must be **exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
 - Account readiness: `runs/state/account_health.json`. Launchers source `configs/_common.sh`.
 - GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
 - Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
-- GitHub secret scanning: unblock via `/security/secret-scanning/unblock-secret/` URL.
+- **gitignore negation**: `!child/` doesn't work when parent dir is ignored. Use `git add -f`.
 
 ### Python / Subprocess
 - `dict.get(key, default)` does NOT protect against `None` values. Use `data.get("key") or default_value`.
@@ -136,18 +137,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
 
 ### LLM Judge
-- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
-- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
-- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
+- Always include "Respond with valid JSON only" in judge prompts. Unescaped quotes break parsing.
+- Judge should use task-type-aware evaluation: different rubrics per task type.
+- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
 
 ### OpenHands
-- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
-- `shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
+- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout).
+- `shlex.quote()` breaks on shell metacharacters. Base64-encode instructions on host, decode inside container.
 - Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
-- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
-- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
-- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
-- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
+- Alpine lacks `apt-get` (OH requirement). Use `bookworm` variants.
+- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
+- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` directly.
+- Set `PYTHONSAFEPATH=1` to prevent repo-local packages shadowing installed deps.
 
 ### CI / Workflows
 - `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -98,36 +98,37 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 
 ### Security / Credentials
 - **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
-- `scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
-- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
+- `sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
+- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
 
 ### Harness-Agnostic Verifiers
 - **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
 - Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
 - Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
-- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
+- **55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
 
 ### Validation / Scoring
-- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
-- Install scripts printing "INSTALL_SUCCESS" regardless of outcome are common. Verify binary exists.
-- Agent completing in **<2s** = never installed/ran. Trial dir names truncated with hash; real name in `config.json` at `task.path`.
-- LoCoBench task IDs have multi-word fields. Use 3-digit task number as positional anchor.
+- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
+- Agent completing in **<2s** = never installed/ran. Real name in `config.json` at `task.path`.
 - **no_changes_guard**: write `reward.txt` inside Python block, not in bash after it.
 - `timeout 600` on all test runners. `--forceExit` for Jest. Jest+TS needs `memory_mb = 8192`.
 - **CSB dual-score**: file edits + `answer.json` scored independently. Fallback: `promoted_verifier.py` -> `oracle_checks.py` -> heuristic.
 - Rate-limited results (score=0, <30s): `scripts/quarantine_invalid_tasks.py --execute`.
 - Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
 - Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
-- `repo_health.py` fallback dict missing `prompt_hygiene` and `launch_policy` checks (3 of 5 only).
 - `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
+- **TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS` lookup silently falls back to equal-weight scoring.
+- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` due to `.setdefault()` overwrite. Scoring audit trail broken.
+- **Falsy value bugs**: `max_score=0` treated as false (inflates scores); `None` MCP metrics misclassified as "rate-limited". Always use explicit `is None` / `== 0` checks.
+- **promote_run.py**: Crashes on non-dict environment config. Validate types before `.get()`.
 
 ### Git / Auth
 - `gh auth refresh` needs explicit `-s <scope>`: `gh auth refresh -h github.com -s write:packages`.
 - Env vars must be **exported** for Harbor subprocesses. Use `set -a` before sourcing `.env.local`.
 - Account readiness: `runs/state/account_health.json`. Launchers source `configs/_common.sh`.
 - GitHub push protection blocks synthetic keys. Squash with `git reset --soft origin/main`.
 - Shallow clones fail on push. Some repos use `master`; detect with `git symbolic-ref refs/remotes/origin/HEAD`.
-- GitHub secret scanning: unblock via `/security/secret-scanning/unblock-secret/` URL.
+- **gitignore negation**: `!child/` doesn't work when parent dir is ignored. Use `git add -f`.
 
 ### Python / Subprocess
 - `dict.get(key, default)` does NOT protect against `None` values. Use `data.get("key") or default_value`.
@@ -136,18 +137,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
 
 ### LLM Judge
-- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
-- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
-- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
+- Always include "Respond with valid JSON only" in judge prompts. Unescaped quotes break parsing.
+- Judge should use task-type-aware evaluation: different rubrics per task type.
+- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
 
 ### OpenHands
-- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
-- `shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
+- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout).
+- `shlex.quote()` breaks on shell metacharacters. Base64-encode instructions on host, decode inside container.
 - Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
-- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
-- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
-- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
-- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
+- Alpine lacks `apt-get` (OH requirement). Use `bookworm` variants.
+- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
+- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` directly.
+- Set `PYTHONSAFEPATH=1` to prevent repo-local packages shadowing installed deps.
 
 ### CI / Workflows
 - `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
diff --git a/docs/ops/ROOT_AGENT_GUIDE.md b/docs/ops/ROOT_AGENT_GUIDE.md