Skip to content

Commit 1e50565

Browse files
committed
docs: add OpenHands, pre-commit, pytest, Ralph gotchas from 74 session reviews
Reviewed all remaining 74 single-page sessions from claude-archive (106 total now fully processed). New gotchas added: - OpenHands: jupyter monkey-patch, shlex.quote, background daemons, Alpine compat, MCP timeout - Pre-commit hook false positives on secret-detection code - Pytest class naming conflicts (TestPlan auto-collected) - Ralph workflow: learnings stay on feature branches Also fixes pre-existing repo_health failure: removed dangling reference to scripts/daytona_snapshot_cleanup.py from docs/DAYTONA.md.
1 parent 163cdc1 commit 1e50565

File tree

6 files changed

+40
-130
lines changed

6 files changed

+40
-130
lines changed

AGENTS.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
124124
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
125125
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
126126

127+
### OpenHands
128+
- Disable Jupyter: monkey-patch `CodeActAgent.sandbox_plugins` (list, not property) to filter out `JupyterRequirement`. TOML `[sandbox] plugins` and `[core] enable_jupyter` have no effect in v1.4.0.
129+
- `shlex.quote()` breaks on shell metacharacters in task instructions (0% execution). Fix: base64-encode on host, decode inside container.
130+
- Background daemons (tmux, jupyter, ipykernel) outlive the main process and hang Daytona poll. Fix: wrap with `pkill` cleanup.
131+
- Alpine images lack `apt-get` (required by OH installer). Use `bookworm` variants. Images without `python3` break MCP auth proxy silently.
132+
- OH MCP client has ~30s timeout that kills deepsearch. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
133+
134+
### Pre-commit / Pytest / Ralph
135+
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
136+
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
137+
- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
138+
127139
## Maintenance
128140
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
129141
- `docs/START_HERE_BY_TASK.md` is generated from `docs/ops/task_routes.json`.

CLAUDE.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
124124
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
125125
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
126126

127+
### OpenHands
128+
- Disable Jupyter: monkey-patch `CodeActAgent.sandbox_plugins` (list, not property) to filter out `JupyterRequirement`. TOML `[sandbox] plugins` and `[core] enable_jupyter` have no effect in v1.4.0.
129+
- `shlex.quote()` breaks on shell metacharacters in task instructions (0% execution). Fix: base64-encode on host, decode inside container.
130+
- Background daemons (tmux, jupyter, ipykernel) outlive the main process and hang Daytona poll. Fix: wrap with `pkill` cleanup.
131+
- Alpine images lack `apt-get` (required by OH installer). Use `bookworm` variants. Images without `python3` break MCP auth proxy silently.
132+
- OH MCP client has ~30s timeout that kills deepsearch. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
133+
134+
### Pre-commit / Pytest / Ralph
135+
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
136+
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
137+
- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
138+
127139
## Maintenance
128140
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
129141
- `docs/START_HERE_BY_TASK.md` is generated from `docs/ops/task_routes.json`.

docs/DAYTONA.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -317,7 +317,7 @@ Each **sandbox** is deleted after the trial (Harbor + Daytona and the standalone
317317

318318
- **Dashboard**: [Daytona Snapshots](https://app.daytona.io/dashboard/snapshots) — list and delete snapshots you no longer need. Snapshots auto-deactivate after 2 weeks of no use; deleting them frees storage.
319319
- **CLI**: `daytona snapshot list` and `daytona snapshot delete <name>` (see [Daytona CLI](https://daytona.io/docs/en/tools/cli)).
320-
- **Script**: `python3 scripts/daytona_snapshot_cleanup.py --list` to list snapshots; `--delete-all` or `--delete-prefix PREFIX` to remove them (requires confirmation). Use this to prune one-off or old snapshots after big runs.
320+
- **API**: Use the Daytona SDK (`daytona_sdk`) to list and delete snapshots programmatically if you need to prune one-off or old snapshots after big runs.
321321

322322
### Orphan sandboxes
323323

docs/ops/ROOT_AGENT_GUIDE.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
124124
- Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
125125
- Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
126126

127+
### OpenHands
128+
- Disable Jupyter: monkey-patch `CodeActAgent.sandbox_plugins` (list, not property) to filter out `JupyterRequirement`. TOML `[sandbox] plugins` and `[core] enable_jupyter` have no effect in v1.4.0.
129+
- `shlex.quote()` breaks on shell metacharacters in task instructions (0% execution). Fix: base64-encode on host, decode inside container.
130+
- Background daemons (tmux, jupyter, ipykernel) outlive the main process and hang Daytona poll. Fix: wrap with `pkill` cleanup.
131+
- Alpine images lack `apt-get` (required by OH installer). Use `bookworm` variants. Images without `python3` break MCP auth proxy silently.
132+
- OH MCP client has ~30s timeout that kills deepsearch. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
133+
134+
### Pre-commit / Pytest / Ralph
135+
- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
136+
- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
137+
- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
138+
127139
## Maintenance
128140
- Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
129141
- `docs/START_HERE_BY_TASK.md` is generated from `docs/ops/task_routes.json`.

docs/ops/SCRIPT_INDEX.md

Lines changed: 0 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -32,13 +32,9 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
3232

3333
## Analysis & Comparison
3434

35-
- `scripts/analyze_harness_design.py` - Analysis/comparison script for analyze harness design.
3635
- `scripts/analyze_mcp_unique_haiku.py` - Analysis/comparison script for analyze mcp unique haiku.
37-
- `scripts/analyze_minimum_subset.py` - Analysis/comparison script for analyze minimum subset.
3836
- `scripts/analyze_paired_cost_official_raw.py` - Analysis/comparison script for analyze paired cost official raw.
39-
- `scripts/analyze_rq_power.py` - Analysis/comparison script for analyze rq power.
4037
- `scripts/analyze_run_coverage.py` - Analysis/comparison script for analyze run coverage.
41-
- `scripts/analyze_size_effects.py` - Analysis/comparison script for analyze size effects.
4238
- `scripts/audit_traces.py` - Analysis/comparison script for audit traces.
4339
- `scripts/compare_configs.py` - Compares benchmark outcomes across configs on matched task sets.
4440
- `scripts/comprehensive_analysis.py` - Analysis/comparison script for comprehensive analysis.
@@ -115,7 +111,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
115111

116112
## Infra & Mirrors
117113

118-
- `scripts/build_conversation_db.py` - Infrastructure or mirror management script for build conversation db.
119114
- `scripts/build_core_manifest.py` - Infrastructure or mirror management script for build core manifest.
120115
- `scripts/build_daytona_registry.py` - Infrastructure or mirror management script for build daytona registry.
121116
- `scripts/build_linux_base_images.sh` - Infrastructure or mirror management script for build linux base images.
@@ -191,8 +186,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
191186
- `scripts/check_harness_readiness.py` - Utility script for check harness readiness.
192187
- `scripts/collect_repo_cloc.py` - Utility script for collect repo cloc.
193188
- `scripts/compare_contextbench_results.py` - Utility script for compare contextbench results.
194-
- `scripts/compare_old_new_ground_truth.py` - Utility script for compare old new ground truth.
195-
- `scripts/compute_analysis_ir_metrics.py` - Utility script for compute analysis ir metrics.
196189
- `scripts/compute_bootstrap_cis.py` - Utility script for compute bootstrap cis.
197190
- `scripts/context_retrieval_agent.py` - Utility script for context retrieval agent.
198191
- `scripts/control_plane.py` - Utility script for control plane.
@@ -203,7 +196,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
203196
- `scripts/daytona_curator_runner.py` - Utility script for daytona curator runner.
204197
- `scripts/daytona_poc_runner.py` - Utility script for daytona poc runner.
205198
- `scripts/daytona_runner.py` - Utility script for daytona runner.
206-
- `scripts/daytona_snapshot_cleanup.py` - Utility script for daytona snapshot cleanup.
207199
- `scripts/dependeval_eval_dr.py` - Utility script for dependeval eval dr.
208200
- `scripts/dependeval_eval_me.py` - Utility script for dependeval eval me.
209201
- `scripts/derive_n_repos.py` - Utility script for derive n repos.
@@ -212,8 +204,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
212204
- `scripts/doe_select_tasks.py` - Utility script for doe select tasks.
213205
- `scripts/ds_hybrid_retrieval.py` - Utility script for ds hybrid retrieval.
214206
- `scripts/ds_wrapper.sh` - Utility script for ds wrapper.
215-
- `scripts/export_conversation_blog_assets.py` - Utility script for export conversation blog assets.
216-
- `scripts/export_engineering_diary_assets.py` - Utility script for export engineering diary assets.
217207
- `scripts/export_official_results.py` - Utility script for export official results.
218208
- `scripts/extract_analysis_metrics.py` - Utility script for extract analysis metrics.
219209
- `scripts/extract_build_diary.py` - Utility script for extract build diary.
@@ -238,8 +228,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
238228
- `scripts/plot_build_diary.py` - Utility script for plot build diary.
239229
- `scripts/plot_build_diary_supplementary.py` - Utility script for plot build diary supplementary.
240230
- `scripts/plot_build_narrative.py` - Utility script for plot build narrative.
241-
- `scripts/plot_conversation_blog_svgs.py` - Utility script for plot conversation blog svgs.
242-
- `scripts/plot_csb_mcp_blog_figures.py` - Utility script for plot csb mcp blog figures.
243231
- `scripts/prepare_analysis_runs.py` - Utility script for prepare analysis runs.
244232
- `scripts/promote_agent_oracles.py` - Utility script for promote agent oracles.
245233
- `scripts/promote_blocked.py` - Utility script for promote blocked.
@@ -261,8 +249,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
261249
- `scripts/run_judge.py` - Utility script for run judge.
262250
- `scripts/run_missing_oracles.sh` - Utility script for run missing oracles.
263251
- `scripts/run_scaling_gap_oracles.sh` - Utility script for run scaling gap oracles.
264-
- `scripts/run_sg_local.sh` - Utility script for run sg local.
265-
- `scripts/run_sg_validation.py` - Utility script for run sg validation.
266252
- `scripts/scaffold_contextbench_tasks.py` - Utility script for scaffold contextbench tasks.
267253
- `scripts/scaffold_feature_tasks.py` - Utility script for scaffold feature tasks.
268254
- `scripts/scaffold_refactor_tasks.py` - Utility script for scaffold refactor tasks.

scripts/registry.json

Lines changed: 3 additions & 115 deletions
Original file line numberDiff line numberDiff line change
@@ -50,14 +50,6 @@
5050
"language": "python",
5151
"summary": "Scans run directories, classifies task status, and supports watch mode for active runs."
5252
},
53-
{
54-
"name": "analyze_harness_design.py",
55-
"path": "scripts/analyze_harness_design.py",
56-
"category": "analysis_comparison",
57-
"status": "maintained",
58-
"language": "python",
59-
"summary": "Analysis/comparison script for analyze harness design."
60-
},
6153
{
6254
"name": "analyze_mcp_unique_haiku.py",
6355
"path": "scripts/analyze_mcp_unique_haiku.py",
@@ -66,14 +58,6 @@
6658
"language": "python",
6759
"summary": "Analysis/comparison script for analyze mcp unique haiku."
6860
},
69-
{
70-
"name": "analyze_minimum_subset.py",
71-
"path": "scripts/analyze_minimum_subset.py",
72-
"category": "analysis_comparison",
73-
"status": "maintained",
74-
"language": "python",
75-
"summary": "Analysis/comparison script for analyze minimum subset."
76-
},
7761
{
7862
"name": "analyze_paired_cost_official_raw.py",
7963
"path": "scripts/analyze_paired_cost_official_raw.py",
@@ -82,14 +66,6 @@
8266
"language": "python",
8367
"summary": "Analysis/comparison script for analyze paired cost official raw."
8468
},
85-
{
86-
"name": "analyze_rq_power.py",
87-
"path": "scripts/analyze_rq_power.py",
88-
"category": "analysis_comparison",
89-
"status": "maintained",
90-
"language": "python",
91-
"summary": "Analysis/comparison script for analyze rq power."
92-
},
9369
{
9470
"name": "analyze_run_coverage.py",
9571
"path": "scripts/analyze_run_coverage.py",
@@ -98,14 +74,6 @@
9874
"language": "python",
9975
"summary": "Analysis/comparison script for analyze run coverage."
10076
},
101-
{
102-
"name": "analyze_size_effects.py",
103-
"path": "scripts/analyze_size_effects.py",
104-
"category": "analysis_comparison",
105-
"status": "maintained",
106-
"language": "python",
107-
"summary": "Analysis/comparison script for analyze size effects."
108-
},
10977
{
11078
"name": "answer_json_verifier_lib.sh",
11179
"path": "scripts/answer_json_verifier_lib.sh",
@@ -218,14 +186,6 @@
218186
"language": "python",
219187
"summary": "Historical one-off script: backfill triage from manifest."
220188
},
221-
{
222-
"name": "build_conversation_db.py",
223-
"path": "scripts/build_conversation_db.py",
224-
"category": "infra_mirrors",
225-
"status": "maintained",
226-
"language": "python",
227-
"summary": "Infrastructure or mirror management script for build conversation db."
228-
},
229189
{
230190
"name": "build_core_manifest.py",
231191
"path": "scripts/build_core_manifest.py",
@@ -298,14 +258,6 @@
298258
"language": "python",
299259
"summary": "Utility script for compare contextbench results."
300260
},
301-
{
302-
"name": "compare_old_new_ground_truth.py",
303-
"path": "scripts/compare_old_new_ground_truth.py",
304-
"category": "misc",
305-
"status": "maintained",
306-
"language": "python",
307-
"summary": "Utility script for compare old new ground truth."
308-
},
309261
{
310262
"name": "comprehensive_analysis.py",
311263
"path": "scripts/comprehensive_analysis.py",
@@ -314,14 +266,6 @@
314266
"language": "python",
315267
"summary": "Analysis/comparison script for comprehensive analysis."
316268
},
317-
{
318-
"name": "compute_analysis_ir_metrics.py",
319-
"path": "scripts/compute_analysis_ir_metrics.py",
320-
"category": "misc",
321-
"status": "maintained",
322-
"language": "python",
323-
"summary": "Utility script for compute analysis ir metrics."
324-
},
325269
{
326270
"name": "compute_bootstrap_cis.py",
327271
"path": "scripts/compute_bootstrap_cis.py",
@@ -506,14 +450,6 @@
506450
"language": "python",
507451
"summary": "Utility script for daytona runner."
508452
},
509-
{
510-
"name": "daytona_snapshot_cleanup.py",
511-
"path": "scripts/daytona_snapshot_cleanup.py",
512-
"category": "misc",
513-
"status": "maintained",
514-
"language": "python",
515-
"summary": "Utility script for daytona snapshot cleanup."
516-
},
517453
{
518454
"name": "dependeval_eval_dr.py",
519455
"path": "scripts/dependeval_eval_dr.py",
@@ -618,22 +554,6 @@
618554
"language": "python",
619555
"summary": "Helper library/wrapper used by other scripts (eval matrix)."
620556
},
621-
{
622-
"name": "export_conversation_blog_assets.py",
623-
"path": "scripts/export_conversation_blog_assets.py",
624-
"category": "misc",
625-
"status": "maintained",
626-
"language": "python",
627-
"summary": "Utility script for export conversation blog assets."
628-
},
629-
{
630-
"name": "export_engineering_diary_assets.py",
631-
"path": "scripts/export_engineering_diary_assets.py",
632-
"category": "misc",
633-
"status": "maintained",
634-
"language": "python",
635-
"summary": "Utility script for export engineering diary assets."
636-
},
637557
{
638558
"name": "export_official_results.py",
639559
"path": "scripts/export_official_results.py",
@@ -1194,22 +1114,6 @@
11941114
"language": "python",
11951115
"summary": "Utility script for plot build narrative."
11961116
},
1197-
{
1198-
"name": "plot_conversation_blog_svgs.py",
1199-
"path": "scripts/plot_conversation_blog_svgs.py",
1200-
"category": "misc",
1201-
"status": "maintained",
1202-
"language": "python",
1203-
"summary": "Utility script for plot conversation blog svgs."
1204-
},
1205-
{
1206-
"name": "plot_csb_mcp_blog_figures.py",
1207-
"path": "scripts/plot_csb_mcp_blog_figures.py",
1208-
"category": "misc",
1209-
"status": "maintained",
1210-
"language": "python",
1211-
"summary": "Utility script for plot csb mcp blog figures."
1212-
},
12131117
{
12141118
"name": "prebuild_images.sh",
12151119
"path": "scripts/prebuild_images.sh",
@@ -1482,22 +1386,6 @@
14821386
"language": "shell",
14831387
"summary": "Utility script for run scaling gap oracles."
14841388
},
1485-
{
1486-
"name": "run_sg_local.sh",
1487-
"path": "scripts/run_sg_local.sh",
1488-
"category": "misc",
1489-
"status": "maintained",
1490-
"language": "shell",
1491-
"summary": "Utility script for run sg local."
1492-
},
1493-
{
1494-
"name": "run_sg_validation.py",
1495-
"path": "scripts/run_sg_validation.py",
1496-
"category": "misc",
1497-
"status": "maintained",
1498-
"language": "python",
1499-
"summary": "Utility script for run sg validation."
1500-
},
15011389
{
15021390
"name": "scaffold_contextbench_tasks.py",
15031391
"path": "scripts/scaffold_contextbench_tasks.py",
@@ -1820,14 +1708,14 @@
18201708
}
18211709
],
18221710
"category_counts": {
1823-
"analysis_comparison": 28,
1711+
"analysis_comparison": 24,
18241712
"core_operations": 13,
18251713
"data_management": 10,
18261714
"generation": 9,
1827-
"infra_mirrors": 23,
1715+
"infra_mirrors": 22,
18281716
"library_helpers": 7,
18291717
"migration": 5,
1830-
"misc": 99,
1718+
"misc": 90,
18311719
"qa_quality": 10,
18321720
"submission_reporting": 7,
18331721
"task_creation_selection": 13,

0 commit comments

Comments
 (0)