docs: add OpenHands, pre-commit, pytest, Ralph gotchas from 74 session reviews

sjarmak · sjarmak · commit 1e50565644e8 · 2026-03-09T22:47:37.000-04:00
Reviewed all remaining 74 single-page sessions from claude-archive
(106 total now fully processed). New gotchas added:
- OpenHands: jupyter monkey-patch, shlex.quote, background daemons,
  Alpine compat, MCP timeout
- Pre-commit hook false positives on secret-detection code
- Pytest class naming conflicts (TestPlan auto-collected)
- Ralph workflow: learnings stay on feature branches

Also fixes pre-existing repo_health failure: removed dangling reference
to scripts/daytona_snapshot_cleanup.py from docs/DAYTONA.md.
diff --git a/AGENTS.md b/AGENTS.md
@@ -124,6 +124,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
 - Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
 
+### OpenHands
+- Disable Jupyter: monkey-patch `CodeActAgent.sandbox_plugins` (list, not property) to filter out `JupyterRequirement`. TOML `[sandbox] plugins` and `[core] enable_jupyter` have no effect in v1.4.0.
+- `shlex.quote()` breaks on shell metacharacters in task instructions (0% execution). Fix: base64-encode on host, decode inside container.
+- Background daemons (tmux, jupyter, ipykernel) outlive the main process and hang Daytona poll. Fix: wrap with `pkill` cleanup.
+- Alpine images lack `apt-get` (required by OH installer). Use `bookworm` variants. Images without `python3` break MCP auth proxy silently.
+- OH MCP client has ~30s timeout that kills deepsearch. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
+
+### Pre-commit / Pytest / Ralph
+- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
+- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
+- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
+
 ## Maintenance
 - Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
 - `docs/START_HERE_BY_TASK.md` is generated from `docs/ops/task_routes.json`.
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -124,6 +124,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
 - Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
 
+### OpenHands
+- Disable Jupyter: monkey-patch `CodeActAgent.sandbox_plugins` (list, not property) to filter out `JupyterRequirement`. TOML `[sandbox] plugins` and `[core] enable_jupyter` have no effect in v1.4.0.
+- `shlex.quote()` breaks on shell metacharacters in task instructions (0% execution). Fix: base64-encode on host, decode inside container.
+- Background daemons (tmux, jupyter, ipykernel) outlive the main process and hang Daytona poll. Fix: wrap with `pkill` cleanup.
+- Alpine images lack `apt-get` (required by OH installer). Use `bookworm` variants. Images without `python3` break MCP auth proxy silently.
+- OH MCP client has ~30s timeout that kills deepsearch. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
+
+### Pre-commit / Pytest / Ralph
+- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
+- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
+- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
+
 ## Maintenance
 - Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
 - `docs/START_HERE_BY_TASK.md` is generated from `docs/ops/task_routes.json`.
diff --git a/docs/DAYTONA.md b/docs/DAYTONA.md
@@ -317,7 +317,7 @@ Each **sandbox** is deleted after the trial (Harbor + Daytona and the standalone
 
 - **Dashboard**: [Daytona Snapshots](https://app.daytona.io/dashboard/snapshots) — list and delete snapshots you no longer need. Snapshots auto-deactivate after 2 weeks of no use; deleting them frees storage.
 - **CLI**: `daytona snapshot list` and `daytona snapshot delete <name>` (see [Daytona CLI](https://daytona.io/docs/en/tools/cli)).
-- **Script**: `python3 scripts/daytona_snapshot_cleanup.py --list` to list snapshots; `--delete-all` or `--delete-prefix PREFIX` to remove them (requires confirmation). Use this to prune one-off or old snapshots after big runs.
+- **API**: Use the Daytona SDK (`daytona_sdk`) to list and delete snapshots programmatically if you need to prune one-off or old snapshots after big runs.
 
 ### Orphan sandboxes
 
diff --git a/docs/ops/ROOT_AGENT_GUIDE.md b/docs/ops/ROOT_AGENT_GUIDE.md
@@ -124,6 +124,18 @@ curl -fsSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/insta
 - Judge should use task-type-aware evaluation: different rubrics for code implementation, architectural understanding, and bug fix tasks.
 - Tool categorization order matters: check MCP prefix (`mcp__`) before substring checks (e.g., `deep_search`) to avoid miscategorization of `mcp__deep_search`.
 
+### OpenHands
+- Disable Jupyter: monkey-patch `CodeActAgent.sandbox_plugins` (list, not property) to filter out `JupyterRequirement`. TOML `[sandbox] plugins` and `[core] enable_jupyter` have no effect in v1.4.0.
+- `shlex.quote()` breaks on shell metacharacters in task instructions (0% execution). Fix: base64-encode on host, decode inside container.
+- Background daemons (tmux, jupyter, ipykernel) outlive the main process and hang Daytona poll. Fix: wrap with `pkill` cleanup.
+- Alpine images lack `apt-get` (required by OH installer). Use `bookworm` variants. Images without `python3` break MCP auth proxy silently.
+- OH MCP client has ~30s timeout that kills deepsearch. Block `deepsearch`/`deepsearch_read` in auth proxy; redirect to `keyword_search`/`nls_search`.
+
+### Pre-commit / Pytest / Ralph
+- Secret-detection hooks false-positive on code that _detects_ secrets. Use `--no-verify` when flagged code is detection logic.
+- Classes named `TestPlan`/`TestCase`/`TestResult` get auto-collected by pytest. Rename to `EvaluationPlan` etc.
+- Ralph sessions write learnings to `progress.txt` on feature branches, not main. Compound back after merge.
+
 ## Maintenance
 - Root and local `AGENTS.md` / `CLAUDE.md` files are generated from sources in `docs/ops/`.
 - `docs/START_HERE_BY_TASK.md` is generated from `docs/ops/task_routes.json`.
diff --git a/docs/ops/SCRIPT_INDEX.md b/docs/ops/SCRIPT_INDEX.md
@@ -32,13 +32,9 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 
 ## Analysis & Comparison
 
-- `scripts/analyze_harness_design.py` - Analysis/comparison script for analyze harness design.
 - `scripts/analyze_mcp_unique_haiku.py` - Analysis/comparison script for analyze mcp unique haiku.
-- `scripts/analyze_minimum_subset.py` - Analysis/comparison script for analyze minimum subset.
 - `scripts/analyze_paired_cost_official_raw.py` - Analysis/comparison script for analyze paired cost official raw.
-- `scripts/analyze_rq_power.py` - Analysis/comparison script for analyze rq power.
 - `scripts/analyze_run_coverage.py` - Analysis/comparison script for analyze run coverage.
-- `scripts/analyze_size_effects.py` - Analysis/comparison script for analyze size effects.
 - `scripts/audit_traces.py` - Analysis/comparison script for audit traces.
 - `scripts/compare_configs.py` - Compares benchmark outcomes across configs on matched task sets.
 - `scripts/comprehensive_analysis.py` - Analysis/comparison script for comprehensive analysis.
@@ -115,7 +111,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 
 ## Infra & Mirrors
 
-- `scripts/build_conversation_db.py` - Infrastructure or mirror management script for build conversation db.
 - `scripts/build_core_manifest.py` - Infrastructure or mirror management script for build core manifest.
 - `scripts/build_daytona_registry.py` - Infrastructure or mirror management script for build daytona registry.
 - `scripts/build_linux_base_images.sh` - Infrastructure or mirror management script for build linux base images.
@@ -191,8 +186,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/check_harness_readiness.py` - Utility script for check harness readiness.
 - `scripts/collect_repo_cloc.py` - Utility script for collect repo cloc.
 - `scripts/compare_contextbench_results.py` - Utility script for compare contextbench results.
-- `scripts/compare_old_new_ground_truth.py` - Utility script for compare old new ground truth.
-- `scripts/compute_analysis_ir_metrics.py` - Utility script for compute analysis ir metrics.
 - `scripts/compute_bootstrap_cis.py` - Utility script for compute bootstrap cis.
 - `scripts/context_retrieval_agent.py` - Utility script for context retrieval agent.
 - `scripts/control_plane.py` - Utility script for control plane.
@@ -203,7 +196,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/daytona_curator_runner.py` - Utility script for daytona curator runner.
 - `scripts/daytona_poc_runner.py` - Utility script for daytona poc runner.
 - `scripts/daytona_runner.py` - Utility script for daytona runner.
-- `scripts/daytona_snapshot_cleanup.py` - Utility script for daytona snapshot cleanup.
 - `scripts/dependeval_eval_dr.py` - Utility script for dependeval eval dr.
 - `scripts/dependeval_eval_me.py` - Utility script for dependeval eval me.
 - `scripts/derive_n_repos.py` - Utility script for derive n repos.
@@ -212,8 +204,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/doe_select_tasks.py` - Utility script for doe select tasks.
 - `scripts/ds_hybrid_retrieval.py` - Utility script for ds hybrid retrieval.
 - `scripts/ds_wrapper.sh` - Utility script for ds wrapper.
-- `scripts/export_conversation_blog_assets.py` - Utility script for export conversation blog assets.
-- `scripts/export_engineering_diary_assets.py` - Utility script for export engineering diary assets.
 - `scripts/export_official_results.py` - Utility script for export official results.
 - `scripts/extract_analysis_metrics.py` - Utility script for extract analysis metrics.
 - `scripts/extract_build_diary.py` - Utility script for extract build diary.
@@ -238,8 +228,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/plot_build_diary.py` - Utility script for plot build diary.
 - `scripts/plot_build_diary_supplementary.py` - Utility script for plot build diary supplementary.
 - `scripts/plot_build_narrative.py` - Utility script for plot build narrative.
-- `scripts/plot_conversation_blog_svgs.py` - Utility script for plot conversation blog svgs.
-- `scripts/plot_csb_mcp_blog_figures.py` - Utility script for plot csb mcp blog figures.
 - `scripts/prepare_analysis_runs.py` - Utility script for prepare analysis runs.
 - `scripts/promote_agent_oracles.py` - Utility script for promote agent oracles.
 - `scripts/promote_blocked.py` - Utility script for promote blocked.
@@ -261,8 +249,6 @@ Generated from `scripts/registry.json` by `scripts/generate_script_index.py`.
 - `scripts/run_judge.py` - Utility script for run judge.
 - `scripts/run_missing_oracles.sh` - Utility script for run missing oracles.
 - `scripts/run_scaling_gap_oracles.sh` - Utility script for run scaling gap oracles.
-- `scripts/run_sg_local.sh` - Utility script for run sg local.
-- `scripts/run_sg_validation.py` - Utility script for run sg validation.
 - `scripts/scaffold_contextbench_tasks.py` - Utility script for scaffold contextbench tasks.
 - `scripts/scaffold_feature_tasks.py` - Utility script for scaffold feature tasks.
 - `scripts/scaffold_refactor_tasks.py` - Utility script for scaffold refactor tasks.
diff --git a/scripts/registry.json b/scripts/registry.json
@@ -50,14 +50,6 @@
       "language": "python",
       "summary": "Scans run directories, classifies task status, and supports watch mode for active runs."
     },
-    {
-      "name": "analyze_harness_design.py",
-      "path": "scripts/analyze_harness_design.py",
-      "category": "analysis_comparison",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Analysis/comparison script for analyze harness design."
-    },
     {
       "name": "analyze_mcp_unique_haiku.py",
       "path": "scripts/analyze_mcp_unique_haiku.py",
@@ -66,14 +58,6 @@
       "language": "python",
       "summary": "Analysis/comparison script for analyze mcp unique haiku."
     },
-    {
-      "name": "analyze_minimum_subset.py",
-      "path": "scripts/analyze_minimum_subset.py",
-      "category": "analysis_comparison",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Analysis/comparison script for analyze minimum subset."
-    },
     {
       "name": "analyze_paired_cost_official_raw.py",
       "path": "scripts/analyze_paired_cost_official_raw.py",
@@ -82,14 +66,6 @@
       "language": "python",
       "summary": "Analysis/comparison script for analyze paired cost official raw."
     },
-    {
-      "name": "analyze_rq_power.py",
-      "path": "scripts/analyze_rq_power.py",
-      "category": "analysis_comparison",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Analysis/comparison script for analyze rq power."
-    },
     {
       "name": "analyze_run_coverage.py",
       "path": "scripts/analyze_run_coverage.py",
@@ -98,14 +74,6 @@
       "language": "python",
       "summary": "Analysis/comparison script for analyze run coverage."
     },
-    {
-      "name": "analyze_size_effects.py",
-      "path": "scripts/analyze_size_effects.py",
-      "category": "analysis_comparison",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Analysis/comparison script for analyze size effects."
-    },
     {
       "name": "answer_json_verifier_lib.sh",
       "path": "scripts/answer_json_verifier_lib.sh",
@@ -218,14 +186,6 @@
       "language": "python",
       "summary": "Historical one-off script: backfill triage from manifest."
     },
-    {
-      "name": "build_conversation_db.py",
-      "path": "scripts/build_conversation_db.py",
-      "category": "infra_mirrors",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Infrastructure or mirror management script for build conversation db."
-    },
     {
       "name": "build_core_manifest.py",
       "path": "scripts/build_core_manifest.py",
@@ -298,14 +258,6 @@
       "language": "python",
       "summary": "Utility script for compare contextbench results."
     },
-    {
-      "name": "compare_old_new_ground_truth.py",
-      "path": "scripts/compare_old_new_ground_truth.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for compare old new ground truth."
-    },
     {
       "name": "comprehensive_analysis.py",
       "path": "scripts/comprehensive_analysis.py",
@@ -314,14 +266,6 @@
       "language": "python",
       "summary": "Analysis/comparison script for comprehensive analysis."
     },
-    {
-      "name": "compute_analysis_ir_metrics.py",
-      "path": "scripts/compute_analysis_ir_metrics.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for compute analysis ir metrics."
-    },
     {
       "name": "compute_bootstrap_cis.py",
       "path": "scripts/compute_bootstrap_cis.py",
@@ -506,14 +450,6 @@
       "language": "python",
       "summary": "Utility script for daytona runner."
     },
-    {
-      "name": "daytona_snapshot_cleanup.py",
-      "path": "scripts/daytona_snapshot_cleanup.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for daytona snapshot cleanup."
-    },
     {
       "name": "dependeval_eval_dr.py",
       "path": "scripts/dependeval_eval_dr.py",
@@ -618,22 +554,6 @@
       "language": "python",
       "summary": "Helper library/wrapper used by other scripts (eval matrix)."
     },
-    {
-      "name": "export_conversation_blog_assets.py",
-      "path": "scripts/export_conversation_blog_assets.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for export conversation blog assets."
-    },
-    {
-      "name": "export_engineering_diary_assets.py",
-      "path": "scripts/export_engineering_diary_assets.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for export engineering diary assets."
-    },
     {
       "name": "export_official_results.py",
       "path": "scripts/export_official_results.py",
@@ -1194,22 +1114,6 @@
       "language": "python",
       "summary": "Utility script for plot build narrative."
     },
-    {
-      "name": "plot_conversation_blog_svgs.py",
-      "path": "scripts/plot_conversation_blog_svgs.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for plot conversation blog svgs."
-    },
-    {
-      "name": "plot_csb_mcp_blog_figures.py",
-      "path": "scripts/plot_csb_mcp_blog_figures.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for plot csb mcp blog figures."
-    },
     {
       "name": "prebuild_images.sh",
       "path": "scripts/prebuild_images.sh",
@@ -1482,22 +1386,6 @@
       "language": "shell",
       "summary": "Utility script for run scaling gap oracles."
     },
-    {
-      "name": "run_sg_local.sh",
-      "path": "scripts/run_sg_local.sh",
-      "category": "misc",
-      "status": "maintained",
-      "language": "shell",
-      "summary": "Utility script for run sg local."
-    },
-    {
-      "name": "run_sg_validation.py",
-      "path": "scripts/run_sg_validation.py",
-      "category": "misc",
-      "status": "maintained",
-      "language": "python",
-      "summary": "Utility script for run sg validation."
-    },
     {
       "name": "scaffold_contextbench_tasks.py",
       "path": "scripts/scaffold_contextbench_tasks.py",
@@ -1820,14 +1708,14 @@
     }
   ],
   "category_counts": {
-    "analysis_comparison": 28,
+    "analysis_comparison": 24,
     "core_operations": 13,
     "data_management": 10,
     "generation": 9,
-    "infra_mirrors": 23,
+    "infra_mirrors": 22,
     "library_helpers": 7,
     "migration": 5,
-    "misc": 99,
+    "misc": 90,
     "qa_quality": 10,
     "submission_reporting": 7,
     "task_creation_selection": 13,