docs: add 2026-03-07 nightly research report

sjarmak · sjarmak · commit 8da3d2c62f75 · 2026-03-07T23:11:40.000-05:00
Covers security findings (token-on-disk, shell injection), broken
operational tooling (rerun_failed.py, daytona_cost_guard.py, resume
logic), evaluation pipeline gaps, and statistical methodology issues.
Recommends operational recovery system as next feature.
diff --git a/reports/nightly/2026-03-07-review.md b/reports/nightly/2026-03-07-review.md
@@ -0,0 +1,382 @@
+# Nightly Research Report - 2026-03-07
+
+Second nightly review. Covers security, evaluation pipeline integrity, operational
+tooling gaps, and statistical methodology. All findings are new -- nothing from the
+2026-03-06 report is repeated.
+
+---
+
+## 1. Code & Architecture Review
+
+### 1.1 CRITICAL: `rerun_failed.py` Is Completely Inoperative
+
+`scripts/rerun_failed.py` lines 36-47 define `SUITE_TO_BENCHMARK_DIR` using the old
+`ccb_*` naming convention. All current benchmark directories use `csb_sdlc_*` and
+`csb_org_*`. When `aggregate_status.py` returns tasks with `suite="csb_sdlc_fix"`,
+`SUITE_TO_BENCHMARK_DIR.get("csb_sdlc_fix")` returns `None`, the task prints
+`"SKIP (no benchmark path)"`, and zero rerun commands are generated. **This script
+silently produces no output for any current benchmark task.**
+
+### 1.2 CRITICAL: `daytona_cost_guard.py` Does Not Exist
+
+`scripts/daytona_cost_guard.py` is referenced in 12+ files:
+- `configs/run_selected_tasks.sh` lines 422, 436
+- `configs/openhands_2config.sh`, `codex_2config.sh`, `cursor_2config.sh`,
+  `copilot_2config.sh`, `gemini_2config.sh`, `sdlc_suite_2config.sh`,
+  `validate_one_per_benchmark.sh`, `multi_harness_compare.sh`
+- `docs/DAYTONA.md` lines 67, 164, 169, 181, 203
+- `docs/ops/WORKFLOWS.md` lines 13, 26
+- `docs/ops/SCRIPT_INDEX.md` line 199
+
+The script does not exist. Any Daytona launch via these configs will fail at the
+cost-guard preflight check.
+
+### 1.3 HIGH: Sourcegraph Token Written to Disk Without Permission Restrictions
+
+`agents/claude_baseline_agent.py` embeds the `SOURCEGRAPH_ACCESS_TOKEN` in an
+MCP config dict and writes it to `self.logs_dir / ".mcp.json"` at four locations
+(lines ~1532, ~1682, ~1770, ~1912). The file is written with default permissions
+(world-readable). The Harbor logs directory is typically archived in run output
+directories, meaning tokens persist in archived run artifacts.
+
+**Fix**: Add `os.chmod(mcp_config_path, 0o600)` after each write, and add
+`.mcp.json` to the archive exclusion list.
+
+### 1.4 HIGH: `subprocess.run(shell=True)` With Task-Derived Commands
+
+`scripts/csb_metrics/oracle_checks.py` line ~498 passes `test_command` (sourced
+from task metadata files) to `subprocess.run(shell=True)`. If task metadata is
+malformed or tampered with, this is a shell injection vector.
+
+**Fix**: Use `shlex.split()` and `shell=False`.
+
+### 1.5 HIGH: No Python Dependency Manifest
+
+The repository has no `requirements.txt`, `pyproject.toml`, or `setup.py`. The
+`scripts/csb_metrics/` package and agent code import third-party packages
+(`anthropic`, `openai`) but there is no pinned dependency specification.
+Different machines get different package versions.
+
+### 1.6 HIGH: Three Conflicting Pricing Constants
+
+Cache-read pricing is defined inconsistently across three scripts:
+
+| Script | Cache-read rate | Source |
+|--------|---------------:|--------|
+| `scripts/cost_report.py:33` | $1.875/MTok | Incorrect |
+| `scripts/csb_metrics/ir_metrics.py:25` | $1.50/MTok | Correct for Sonnet |
+| `scripts/cost_breakdown_analysis.py:40` | $3.75/MTok (cache_create) | Different metric entirely |
+
+Cost reports are producing incorrect totals. A single `scripts/pricing.py`
+constants file with versioned pricing tables would eliminate this.
+
+### 1.7 MEDIUM: Hardcoded Developer Path
+
+`agents/claude_baseline_agent.py` line 31 and `agents/harnesses/base.py` line 17:
+```python
+LOCOBENCH_CLAUDE_MD_TEMPLATE = Path("/home/stephanie_jarmak/CodeScaleBench/...")
+```
+Silently falls back to a warning on any other machine. No environment variable
+override exists.
+
+### 1.8 MEDIUM: `_common.sh` Disk Space Check Silently Fails on macOS
+
+`configs/_common.sh` line 272 uses `df -BG --output=avail` which is GNU-specific.
+On macOS (the development platform), `2>/dev/null` swallows the error and
+`_disk_free` becomes empty, so the disk space gate **does nothing**. The check
+reports OK regardless of available space.
+
+### 1.9 MEDIUM: `--skip-completed` Resume Logic Has a Dependency Bug
+
+`configs/run_selected_tasks.sh` line 551 checks for both `result.json` AND
+`task_metrics.json` to consider a task completed. `task_metrics.json` is generated
+by post-processing (`extract_all_metrics`) which runs after all tasks complete.
+If a run crashes mid-batch, successfully completed tasks still lack
+`task_metrics.json`, so `--skip-completed` re-runs them unnecessarily.
+
+**Fix**: Check for `result.json` only, or generate `task_metrics.json` inline
+after each task completes.
+
+### 1.10 MEDIUM: SDK Client Instantiated on Every Judge API Call
+
+`scripts/csb_metrics/judge/backends.py` lines ~110 and ~234 create a new
+`anthropic.Anthropic()` or `openai.OpenAI()` client on every call. With 3-round
+voting, this creates 3 separate HTTP client pools per task. The client should
+be created once in `__init__` and reused.
+
+### 1.11 MEDIUM: Duplicate Function Definitions in `abc_audit.py`
+
+Four check functions are defined twice in `scripts/abc_audit.py`:
+- `check_t10_shared_state`: lines 424-480 AND 1564-1619 (with different semantics)
+- `check_oa_equivalent_solutions`: lines 1118-1164 AND 1622-1668
+- `check_ob_negated_solutions`: lines 1167-1224 AND 1671-1728
+- `check_og_determinism`: lines 1226-1308 AND 1731-1789
+
+Python uses the last definition. The first copies are dead code. The two versions
+of `check_t10_shared_state` have different behavior (the first flags all `/tmp`
+paths; the second only flags when port binds are also present).
+
+### 1.12 MEDIUM: `NODE_TLS_REJECT_UNAUTHORIZED=0` Set Globally in Container
+
+`agents/claude_baseline_agent.py` lines ~1484 and ~1638 use
+`environment.exec('export NODE_TLS_REJECT_UNAUTHORIZED=0')` which disables TLS
+for ALL Node.js processes in the container, not just the MCP subprocess. This
+is inconsistent with `create_run_agent_commands` (line ~1142) which scopes
+it to a single process via `env_with_autonomous`.
+
+### 1.13 LOW: 150 Lines of Dead Code (`V4_PREAMBLE_TEMPLATE`)
+
+`agents/claude_baseline_agent.py` lines 175-326 define `V4_PREAMBLE_TEMPLATE`
+marked as `DEPRECATED -- kept for reference`. It is never referenced anywhere
+in the codebase. Remove it.
+
+### 1.14 LOW: Private API Imports Across Package Boundaries
+
+External scripts import `_`-prefixed functions from `csb_metrics` internals:
+- `scripts/verify_oracle_fail2pass.py:27` imports `_normalize`
+- `scripts/cross_validate_gt.py:25` imports `_normalize`
+- `scripts/judge_demo.py:39` imports `_select_prompt`, `_render_prompt`
+- `scripts/csb_metrics/judge/oracle.py:28` imports `_resolve_task_dir`
+
+These will break silently if the internal implementation changes. Promote
+them to the public API or refactor callers.
+
+---
+
+## 2. Feature & UX Improvements
+
+### 2.1 Hybrid Scoring Is Fully Implemented but 100% Inert
+
+`docs/EVALUATION_PIPELINE.md` lines 158-160 document hybrid scoring:
+`composite = 0.6 * verifier_reward + 0.4 * rubric_score`, enabled via
+`--hybrid` flag on `run_judge.py`. The flag exists and the scoring logic works.
+However, `criteria.json` (the file it depends on) does not exist in any task
+directory across all 275 tasks. The feature is dead code.
+
+**Improvement**: Either author `criteria.json` for Org tasks (where it makes
+most sense) or remove the feature and documentation.
+
+### 2.2 Nine `direct_verifier.sh` Files Are Unimplemented Placeholders
+
+All nine are identical:
+```bash
+echo 'ERROR: direct_verifier.sh is a placeholder -- needs manual curation'
+echo '0.0' > /logs/verifier/reward.txt
+exit 1
+```
+Affected tasks span `csb_org_crossrepo_tracing`, `csb_org_incident`,
+`csb_org_migration`, and `csb_org_org`. Any run in direct mode on these tasks
+produces `reward=0.0` with no useful diagnostics.
+
+### 2.3 `export_official_results.py` Suite Thresholds Are Stale
+
+`SDLC_MIN_VALID_TASKS` at lines 76-85 requires 20 tasks per SDLC suite for
+official qualification. Current `selected_benchmark_tasks.json` has:
+- `csb_sdlc_understand`: 12 selected (threshold: 20)
+- `csb_sdlc_design`: 15 selected (threshold: 20)
+- `csb_sdlc_document`: 15 selected (threshold: 20)
+- `csb_sdlc_secure`: 15 selected (threshold: 20)
+- `csb_sdlc_debug`: 19 selected (threshold: 20)
+
+**Six of nine SDLC suites would be flagged as "below minimum valid tasks"** in
+any official export against the current task selection. The thresholds were never
+updated when the task selection was rebalanced.
+
+### 2.4 No Multi-Run Cross-Directory Comparison
+
+`scripts/compare_configs.py` deduplicates by "latest `started_at` wins" per
+`(suite, task, config)`. There is no way to lock a comparison to a specific pair
+of run directories. If baseline and MCP runs happened weeks apart in different
+directories, only the most recent per task is used.
+
+**Improvement**: Add `--baseline-run-dir` and `--mcp-run-dir` flags to allow
+explicit run pairing.
+
+### 2.5 Submission System Has No Destination
+
+`docs/SUBMISSION.md` and `docs/LEADERBOARD.md` describe packaging, validation,
+and scoring rubrics in detail. Neither document mentions where to actually
+submit the `.tar.gz` archive. There is no submission URL, email, GitHub issue
+template, or API endpoint. The leaderboard does not exist as an accessible
+service.
+
+### 2.6 `LEADERBOARD.md` Per-Suite Task Counts Are Stale
+
+The per-suite completeness table (set 2026-03-02) has task counts that don't
+match `selected_benchmark_tasks.json`, `SDLC_MIN_VALID_TASKS` in
+`export_official_results.py`, or the actual benchmark directories. A submitter
+gets three different answers from three authoritative sources.
+
+### 2.7 `continue` in `_launch_task_pair` Silently Skips MCP Runs
+
+`configs/run_selected_tasks.sh` lines 653-656: When `Dockerfile.artifact_baseline`
+is missing, `continue` inside the function body propagates to the calling
+`while` loop, skipping both baseline AND MCP configs. The MCP run might be
+entirely valid but is silently dropped. Should be `return 1` with separate
+handling.
+
+### 2.8 Org Task `csb_org_onboarding` Structural Inconsistencies
+
+8 of 11 tasks in `csb_org_onboarding` are missing `use_case_id` in `task.toml`.
+All other org suites have 100% coverage. Additionally, 67 `.org_backup` files
+and 362 `.bak` files in the benchmark tree are development artifacts that were
+committed to the repository.
+
+---
+
+## 3. Research Recommendations
+
+### 3.1 Centralize Pricing Constants (Immediate)
+
+Create `scripts/pricing.py`:
+```python
+PRICING = {
+    "claude-opus-4-6": {
+        "input_per_mtok": 15.00,
+        "output_per_mtok": 75.00,
+        "cache_write_per_mtok": 18.75,
+        "cache_read_per_mtok": 1.50,
+    },
+    # ... other models
+}
+```
+Import from all cost-computing scripts. This eliminates the three conflicting
+constants and makes model pricing updates a single-file change.
+
+### 3.2 Fix the Statistical Methodology
+
+Three specific issues in `scripts/csb_metrics/statistics.py`:
+
+1. **Tied-rank Spearman**: `ir_analysis.py` line 476 uses the simplified
+   `r = 1 - 6d^2/(n(n^2-1))` formula which is incorrect when ties exist
+   (common with many `reward=0.0` tasks). The tie-aware version already
+   exists in `statistics.py` line 366 -- just import it.
+
+2. **Multiple comparisons**: Per-suite p-values in
+   `retrieval_outcome_correlation` (lines 494-512) have no Bonferroni or FDR
+   correction. With 19+ suites, false discoveries are likely. Add
+   Benjamini-Hochberg FDR correction.
+
+3. **Small-sample t-test**: `welchs_t_test` (line 91) uses a normal
+   approximation regardless of df. For suites with 11-15 tasks per config,
+   df may be 10-20 where this overestimates statistical power. Implement
+   a proper t-distribution CDF or emit a warning when `df < 30`.
+
+### 3.3 Improve LLM Judge Robustness
+
+Four specific improvements to `scripts/csb_metrics/judge/`:
+
+1. **Propagate oracle confidence**: `engine.py` lines 244/313 use a binary
+   presence check (`"high" if oracle_ground_truth else "low"`). The
+   `OracleBundle.confidence` field exists but is ignored. Pass it through.
+
+2. **Fix voting at temperature=0**: `evaluate_with_voting` (lines 271-275)
+   sends identical prompts at `temperature=0.0`, producing identical outputs
+   across all rounds. Multi-round voting is a no-op. Either default to
+   `temperature > 0` for voting mode, or vary prompt structure per round.
+
+3. **Align prompt scale with parser**: Prompts instruct a 3-point scale
+   (0.0, 0.5, 1.0) but `_parse_dimension_scores` accepts continuous floats.
+   Decide: continuous scoring (more information) or constrained scoring
+   (more reliable). Document the decision.
+
+4. **Add task-type-aware system prompts**: The system prompt is
+   `"You are a precise code evaluator."` for all task types. A documentation
+   task, a bug fix, and a security audit should have different evaluation
+   priorities in the system prompt.
+
+### 3.4 Add `ruff` Linting to CI
+
+The duplicate function definitions in `abc_audit.py` (section 1.11) would
+be caught automatically by `ruff check --select F811` (redefined-while-unused).
+Add a `pyproject.toml` with:
+```toml
+[tool.ruff]
+select = ["E", "F", "W"]
+ignore = ["E501"]  # line length handled separately
+```
+And a `.github/workflows/lint.yml` that runs `ruff check scripts/`.
+
+### 3.5 Replace `parse_task_toml_simple` With `tomllib`
+
+Three scripts (`abc_audit.py`, `abc_score_task.py`, `validate_tasks_preflight.py`)
+contain a hand-rolled TOML parser with a known bug: `'"""' in line: break`
+silently truncates multi-line strings. Python 3.11+ includes `tomllib` in the
+standard library. This is a zero-dependency upgrade that also eliminates the
+truncation bug.
+
+### 3.6 Add Tool-Sequence Pattern Analysis
+
+Current trace analysis counts tool call totals but cannot answer:
+- Does the agent recover from failed MCP calls with local fallback searches?
+- What is the reading:searching ratio across configs?
+- Do successful agents start broad and narrow, or vice versa?
+
+A `scripts/analyze_tool_sequences.py` that extracts ordered tool sequences
+from trajectories and computes transition matrices would enable behavioral
+comparison between baseline and MCP configs.
+
+### 3.7 Add Cost-Efficiency Metric
+
+`scripts/cost_report.py` computes `avg_cost_per_task` but not
+`cost_per_passing_task` (total cost / passed count). For baseline vs MCP
+comparison, cost-efficiency on solved tasks is the decision-relevant metric.
+Add it to the report output, along with `cache_write_tokens` and
+`cache_read_tokens` columns that are currently computed but not displayed.
+
+---
+
+## 4. Recommended Next Feature
+
+### Operational Recovery and Run Integrity System
+
+**The single most impactful feature** is fixing the broken operational pipeline --
+the tools operators use to launch, resume, and rerun benchmark tasks. Three
+independently broken components make it impossible to reliably manage benchmark
+runs:
+
+**Part A -- Fix `rerun_failed.py`** (30 minutes)
+1. Update `SUITE_TO_BENCHMARK_DIR` mapping from `ccb_*` to `csb_sdlc_*` / `csb_org_*`
+2. Add integration test that verifies mapping covers all suites in
+   `selected_benchmark_tasks.json`
+3. Verify generated commands work against a sample failed run directory
+
+**Part B -- Create or stub `daytona_cost_guard.py`** (1 hour)
+1. Either implement the cost guard (check Daytona credit balance, abort if below
+   threshold) or create a pass-through stub that logs a warning
+2. This unblocks all 12+ config scripts that gate on this missing file
+3. Add a `repo_health.py` check that verifies all scripts referenced by configs exist
+
+**Part C -- Fix `--skip-completed` resume logic** (30 minutes)
+1. Change `run_selected_tasks.sh` line 551 to check only for `result.json`
+   (not `task_metrics.json`) when determining task completion
+2. Fix the `continue` vs `return 1` bug in `_launch_task_pair` (line 653) so
+   missing baseline Dockerfiles don't silently skip MCP runs
+3. Add a `--resume-from <run-dir>` flag that automatically finds completed tasks
+
+**Part D -- Align export thresholds with task selection** (15 minutes)
+1. Update `SDLC_MIN_VALID_TASKS` in `export_official_results.py` to match
+   actual per-suite counts in `selected_benchmark_tasks.json`
+2. Or better: read thresholds from `selected_benchmark_tasks.json` dynamically
+
+**Why this is highest impact**: The previous report's recommendation (task
+registry reconciliation and CI test gate) addresses data integrity -- important
+but not blocking. This recommendation addresses the fact that **operators
+currently cannot rerun failed tasks, resume crashed runs, or launch Daytona
+runs at all**. Every benchmark execution session requires manual workarounds
+for these three broken tools. Fixing them directly unblocks the next round of
+benchmark runs.
+
+**PRD-ready description**: "Implement an operational recovery system for
+benchmark runs consisting of: (1) updated `rerun_failed.py` with current
+`csb_*` suite mappings and integration tests, (2) a `daytona_cost_guard.py`
+script (or stub) that unblocks all Daytona-mode config scripts, (3) fixed
+`--skip-completed` resume logic in `run_selected_tasks.sh` that checks only
+`result.json` for completion and correctly handles missing baseline Dockerfiles
+without skipping MCP runs, and (4) aligned `SDLC_MIN_VALID_TASKS` thresholds
+in `export_official_results.py` to match current task selection. Success
+criteria: `rerun_failed.py` generates valid commands for all 20 suites,
+`configs/run_selected_tasks.sh` launches without cost-guard errors in Daytona
+mode, `--skip-completed` correctly identifies and skips already-completed tasks
+after a mid-run crash, and `export_official_results.py` does not flag any
+suite as below minimum when all selected tasks have been run."