You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101
-
- `scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
102
-
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
101
+
- `sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
102
+
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
103
103
104
104
### Harness-Agnostic Verifiers
105
105
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106
106
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107
107
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108
-
- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
108
+
- **55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
109
109
110
110
### Validation / Scoring
111
-
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
112
-
- Install scripts printing "INSTALL_SUCCESS" regardless of outcome are common. Verify binary exists.
113
-
- Agent completing in **<2s** = never installed/ran. Trial dir names truncated with hash; real name in `config.json` at `task.path`.
114
-
- LoCoBench task IDs have multi-word fields. Use 3-digit task number as positional anchor.
111
+
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
112
+
- Agent completing in **<2s** = never installed/ran. Real name in `config.json` at `task.path`.
115
113
- **no_changes_guard**: write `reward.txt` inside Python block, not in bash after it.
116
114
- `timeout 600` on all test runners. `--forceExit` for Jest. Jest+TS needs `memory_mb = 8192`.
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
120
118
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
121
-
- `repo_health.py` fallback dict missing `prompt_hygiene` and `launch_policy` checks (3 of 5 only).
122
119
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
120
+
- **TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS` lookup silently falls back to equal-weight scoring.
121
+
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` due to `.setdefault()` overwrite. Scoring audit trail broken.
122
+
- **Falsy value bugs**: `max_score=0` treated as false (inflates scores); `None` MCP metrics misclassified as "rate-limited". Always use explicit `is None` / `== 0` checks.
123
+
- **promote_run.py**: Crashes on non-dict environment config. Validate types before `.get()`.
- macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
137
138
138
139
### LLM Judge
139
-
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
140
-
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
141
-
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
140
+
- Always include "Respond with valid JSON only" in judge prompts. Unescaped quotes break parsing.
141
+
- Judge should use task-type-aware evaluation: different rubrics per task type.
142
+
- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
142
143
143
144
### OpenHands
144
-
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
145
-
- `shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
145
+
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout).
146
+
- `shlex.quote()` breaks on shell metacharacters. Base64-encode instructions on host, decode inside container.
146
147
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
147
-
- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
148
-
- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
149
-
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
150
-
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
148
+
- Alpine lacks `apt-get` (OH requirement). Use `bookworm` variants.
149
+
- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
150
+
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` directly.
151
+
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages shadowing installed deps.
151
152
152
153
### CI / Workflows
153
154
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
- **Never pass credentials via Docker `-e` flags** (leak into trajectory HTML). Use file-based injection: `/logs/agent/.credentials.json` with `chmod 600`.
101
-
- `scripts/sanitize_secrets.py` redacts real API keys at result generation time. Not yet integrated into `export_official_results.py` (manual invocation required).
102
-
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching is too broad -- `"example"`, `"test_key"`, `"dummy"` can bypass redaction of real secrets. Use exact-match `FAKE_KEY_ALLOWLIST` instead.
101
+
- `sanitize_secrets.py` redacts API keys but not yet integrated into `export_official_results.py` (manual invocation).
102
+
- `sanitize_secrets.py` `_FAKE_INDICATORS` substring matching too broad -- use exact-match `FAKE_KEY_ALLOWLIST` instead.
103
103
104
104
### Harness-Agnostic Verifiers
105
105
- **no_changes_guard** must use `git diff origin/main HEAD` (not `git diff HEAD`) for auto-committing agents (e.g., OpenHands).
106
106
- Verifier fallbacks: `${TASK_WORKDIR:-/workspace}` for workdir, `${TASK_REPO_ROOT:-${VERIFY_REPO:-/workspace}}` for repo root.
107
107
- Set `GOWORK=off` in test.sh when sg_only verifier restores full repo (go.work may need newer Go).
108
-
- 6 tasks still hardcode `/workspace` without fallbacks: 3 in `csb_sdlc_understand` (document search), 3 in `csb_org_onboarding` (`answer.json`). Zero scores on non-Harbor harnesses.
108
+
- **55+ tasks** hardcode `ANSWER_PATH="/workspace/answer.json"` without fallbacks (originally reported as 6 -- actual scope much larger). All use same template pattern; bulk fix feasible. Zero scores on non-Harbor harnesses.
109
109
110
110
### Validation / Scoring
111
-
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (verify with `sha256sum`).
112
-
- Install scripts printing "INSTALL_SUCCESS" regardless of outcome are common. Verify binary exists.
113
-
- Agent completing in **<2s** = never installed/ran. Trial dir names truncated with hash; real name in `config.json` at `task.path`.
114
-
- LoCoBench task IDs have multi-word fields. Use 3-digit task number as positional anchor.
111
+
- `validators.py` duplicated across `ccb_build` tasks. Changes must hit **all copies** (`sha256sum`).
112
+
- Agent completing in **<2s** = never installed/ran. Real name in `config.json` at `task.path`.
115
113
- **no_changes_guard**: write `reward.txt` inside Python block, not in bash after it.
116
114
- `timeout 600` on all test runners. `--forceExit` for Jest. Jest+TS needs `memory_mb = 8192`.
- Bare `$VAR` in `instruction.md` gets expanded. Use `<placeholder>` syntax.
120
118
- Pass rate logic duplicated in `generate_eval_report.py` and `csb_metrics/models.py`. Sync both on changes.
121
-
- `repo_health.py` fallback dict missing `prompt_hygiene` and `launch_policy` checks (3 of 5 only).
122
119
- `cost_report.py`: `defaultdict(int)` + `.get("baseline", 1)` returns `0` when key exists. Use `or 1`.
120
+
- **TARGET_SUITE misalignment**: 55 tasks had stale legacy suite names, 220 had none. `SUITE_WEIGHTS` lookup silently falls back to equal-weight scoring.
121
+
- **dual_score_lib.sh**: `scorer_artifact` always `"auto"` due to `.setdefault()` overwrite. Scoring audit trail broken.
122
+
- **Falsy value bugs**: `max_score=0` treated as false (inflates scores); `None` MCP metrics misclassified as "rate-limited". Always use explicit `is None` / `== 0` checks.
123
+
- **promote_run.py**: Crashes on non-dict environment config. Validate types before `.get()`.
- macOS Bash 3.2 lacks `declare -A`. Use pipe-delimited strings with `IFS='|' read -r`.
137
138
138
139
### LLM Judge
139
-
- Always include "Respond with valid JSON only (escape all quotes and special characters)" in judge prompts. Unescaped quotes in LLM-generated JSON break parsing.
140
-
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
141
-
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
140
+
- Always include "Respond with valid JSON only" in judge prompts. Unescaped quotes break parsing.
141
+
- Judge should use task-type-aware evaluation: different rubrics per task type.
142
+
- Tool categorization: check MCP prefix (`mcp__`) before substring checks to avoid miscategorization.
142
143
143
144
### OpenHands
144
-
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout). TOML config has no effect in v1.4.0.
145
-
- `shlex.quote()` breaks on shell metacharacters (0% execution). Base64-encode instructions on host, decode inside container.
145
+
- Strip ALL `sandbox_plugins` (`= []`) -- `agent_skills` indexes `/workspace` at startup (120s timeout).
146
+
- `shlex.quote()` breaks on shell metacharacters. Base64-encode instructions on host, decode inside container.
146
147
- Background daemons hang Daytona poll. Wrap with `pkill` cleanup; guard with `shutil.which('pkill')`.
147
-
- Alpine lacks `apt-get` (OH installer requirement). Use `bookworm` variants.
148
-
- OH MCP client has ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
149
-
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` source directly.
150
-
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages from shadowing installed deps.
148
+
- Alpine lacks `apt-get` (OH requirement). Use `bookworm` variants.
149
+
- OH MCP client ~30s timeout. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
150
+
- `chown -R /workspace` blocks >120s on large repos. Edit installed `runtime_init.py` directly.
151
+
- Set `PYTHONSAFEPATH=1` to prevent repo-local packages shadowing installed deps.
151
152
152
153
### CI / Workflows
153
154
- `docs-consistency.yml` is redundant -- subsumed by `repo_health.yml`. Doubles CI minutes.
0 commit comments