|
| 1 | +# Nightly Research Report - 2026-03-07 |
| 2 | + |
| 3 | +Second nightly review. Covers security, evaluation pipeline integrity, operational |
| 4 | +tooling gaps, and statistical methodology. All findings are new -- nothing from the |
| 5 | +2026-03-06 report is repeated. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. Code & Architecture Review |
| 10 | + |
| 11 | +### 1.1 CRITICAL: `rerun_failed.py` Is Completely Inoperative |
| 12 | + |
| 13 | +`scripts/rerun_failed.py` lines 36-47 define `SUITE_TO_BENCHMARK_DIR` using the old |
| 14 | +`ccb_*` naming convention. All current benchmark directories use `csb_sdlc_*` and |
| 15 | +`csb_org_*`. When `aggregate_status.py` returns tasks with `suite="csb_sdlc_fix"`, |
| 16 | +`SUITE_TO_BENCHMARK_DIR.get("csb_sdlc_fix")` returns `None`, the task prints |
| 17 | +`"SKIP (no benchmark path)"`, and zero rerun commands are generated. **This script |
| 18 | +silently produces no output for any current benchmark task.** |
| 19 | + |
| 20 | +### 1.2 CRITICAL: `daytona_cost_guard.py` Does Not Exist |
| 21 | + |
| 22 | +`scripts/daytona_cost_guard.py` is referenced in 12+ files: |
| 23 | +- `configs/run_selected_tasks.sh` lines 422, 436 |
| 24 | +- `configs/openhands_2config.sh`, `codex_2config.sh`, `cursor_2config.sh`, |
| 25 | + `copilot_2config.sh`, `gemini_2config.sh`, `sdlc_suite_2config.sh`, |
| 26 | + `validate_one_per_benchmark.sh`, `multi_harness_compare.sh` |
| 27 | +- `docs/DAYTONA.md` lines 67, 164, 169, 181, 203 |
| 28 | +- `docs/ops/WORKFLOWS.md` lines 13, 26 |
| 29 | +- `docs/ops/SCRIPT_INDEX.md` line 199 |
| 30 | + |
| 31 | +The script does not exist. Any Daytona launch via these configs will fail at the |
| 32 | +cost-guard preflight check. |
| 33 | + |
| 34 | +### 1.3 HIGH: Sourcegraph Token Written to Disk Without Permission Restrictions |
| 35 | + |
| 36 | +`agents/claude_baseline_agent.py` embeds the `SOURCEGRAPH_ACCESS_TOKEN` in an |
| 37 | +MCP config dict and writes it to `self.logs_dir / ".mcp.json"` at four locations |
| 38 | +(lines ~1532, ~1682, ~1770, ~1912). The file is written with default permissions |
| 39 | +(world-readable). The Harbor logs directory is typically archived in run output |
| 40 | +directories, meaning tokens persist in archived run artifacts. |
| 41 | + |
| 42 | +**Fix**: Add `os.chmod(mcp_config_path, 0o600)` after each write, and add |
| 43 | +`.mcp.json` to the archive exclusion list. |
| 44 | + |
| 45 | +### 1.4 HIGH: `subprocess.run(shell=True)` With Task-Derived Commands |
| 46 | + |
| 47 | +`scripts/csb_metrics/oracle_checks.py` line ~498 passes `test_command` (sourced |
| 48 | +from task metadata files) to `subprocess.run(shell=True)`. If task metadata is |
| 49 | +malformed or tampered with, this is a shell injection vector. |
| 50 | + |
| 51 | +**Fix**: Use `shlex.split()` and `shell=False`. |
| 52 | + |
| 53 | +### 1.5 HIGH: No Python Dependency Manifest |
| 54 | + |
| 55 | +The repository has no `requirements.txt`, `pyproject.toml`, or `setup.py`. The |
| 56 | +`scripts/csb_metrics/` package and agent code import third-party packages |
| 57 | +(`anthropic`, `openai`) but there is no pinned dependency specification. |
| 58 | +Different machines get different package versions. |
| 59 | + |
| 60 | +### 1.6 HIGH: Three Conflicting Pricing Constants |
| 61 | + |
| 62 | +Cache-read pricing is defined inconsistently across three scripts: |
| 63 | + |
| 64 | +| Script | Cache-read rate | Source | |
| 65 | +|--------|---------------:|--------| |
| 66 | +| `scripts/cost_report.py:33` | $1.875/MTok | Incorrect | |
| 67 | +| `scripts/csb_metrics/ir_metrics.py:25` | $1.50/MTok | Correct for Sonnet | |
| 68 | +| `scripts/cost_breakdown_analysis.py:40` | $3.75/MTok (cache_create) | Different metric entirely | |
| 69 | + |
| 70 | +Cost reports are producing incorrect totals. A single `scripts/pricing.py` |
| 71 | +constants file with versioned pricing tables would eliminate this. |
| 72 | + |
| 73 | +### 1.7 MEDIUM: Hardcoded Developer Path |
| 74 | + |
| 75 | +`agents/claude_baseline_agent.py` line 31 and `agents/harnesses/base.py` line 17: |
| 76 | +```python |
| 77 | +LOCOBENCH_CLAUDE_MD_TEMPLATE = Path("/home/stephanie_jarmak/CodeScaleBench/...") |
| 78 | +``` |
| 79 | +Silently falls back to a warning on any other machine. No environment variable |
| 80 | +override exists. |
| 81 | + |
| 82 | +### 1.8 MEDIUM: `_common.sh` Disk Space Check Silently Fails on macOS |
| 83 | + |
| 84 | +`configs/_common.sh` line 272 uses `df -BG --output=avail` which is GNU-specific. |
| 85 | +On macOS (the development platform), `2>/dev/null` swallows the error and |
| 86 | +`_disk_free` becomes empty, so the disk space gate **does nothing**. The check |
| 87 | +reports OK regardless of available space. |
| 88 | + |
| 89 | +### 1.9 MEDIUM: `--skip-completed` Resume Logic Has a Dependency Bug |
| 90 | + |
| 91 | +`configs/run_selected_tasks.sh` line 551 checks for both `result.json` AND |
| 92 | +`task_metrics.json` to consider a task completed. `task_metrics.json` is generated |
| 93 | +by post-processing (`extract_all_metrics`) which runs after all tasks complete. |
| 94 | +If a run crashes mid-batch, successfully completed tasks still lack |
| 95 | +`task_metrics.json`, so `--skip-completed` re-runs them unnecessarily. |
| 96 | + |
| 97 | +**Fix**: Check for `result.json` only, or generate `task_metrics.json` inline |
| 98 | +after each task completes. |
| 99 | + |
| 100 | +### 1.10 MEDIUM: SDK Client Instantiated on Every Judge API Call |
| 101 | + |
| 102 | +`scripts/csb_metrics/judge/backends.py` lines ~110 and ~234 create a new |
| 103 | +`anthropic.Anthropic()` or `openai.OpenAI()` client on every call. With 3-round |
| 104 | +voting, this creates 3 separate HTTP client pools per task. The client should |
| 105 | +be created once in `__init__` and reused. |
| 106 | + |
| 107 | +### 1.11 MEDIUM: Duplicate Function Definitions in `abc_audit.py` |
| 108 | + |
| 109 | +Four check functions are defined twice in `scripts/abc_audit.py`: |
| 110 | +- `check_t10_shared_state`: lines 424-480 AND 1564-1619 (with different semantics) |
| 111 | +- `check_oa_equivalent_solutions`: lines 1118-1164 AND 1622-1668 |
| 112 | +- `check_ob_negated_solutions`: lines 1167-1224 AND 1671-1728 |
| 113 | +- `check_og_determinism`: lines 1226-1308 AND 1731-1789 |
| 114 | + |
| 115 | +Python uses the last definition. The first copies are dead code. The two versions |
| 116 | +of `check_t10_shared_state` have different behavior (the first flags all `/tmp` |
| 117 | +paths; the second only flags when port binds are also present). |
| 118 | + |
| 119 | +### 1.12 MEDIUM: `NODE_TLS_REJECT_UNAUTHORIZED=0` Set Globally in Container |
| 120 | + |
| 121 | +`agents/claude_baseline_agent.py` lines ~1484 and ~1638 use |
| 122 | +`environment.exec('export NODE_TLS_REJECT_UNAUTHORIZED=0')` which disables TLS |
| 123 | +for ALL Node.js processes in the container, not just the MCP subprocess. This |
| 124 | +is inconsistent with `create_run_agent_commands` (line ~1142) which scopes |
| 125 | +it to a single process via `env_with_autonomous`. |
| 126 | + |
| 127 | +### 1.13 LOW: 150 Lines of Dead Code (`V4_PREAMBLE_TEMPLATE`) |
| 128 | + |
| 129 | +`agents/claude_baseline_agent.py` lines 175-326 define `V4_PREAMBLE_TEMPLATE` |
| 130 | +marked as `DEPRECATED -- kept for reference`. It is never referenced anywhere |
| 131 | +in the codebase. Remove it. |
| 132 | + |
| 133 | +### 1.14 LOW: Private API Imports Across Package Boundaries |
| 134 | + |
| 135 | +External scripts import `_`-prefixed functions from `csb_metrics` internals: |
| 136 | +- `scripts/verify_oracle_fail2pass.py:27` imports `_normalize` |
| 137 | +- `scripts/cross_validate_gt.py:25` imports `_normalize` |
| 138 | +- `scripts/judge_demo.py:39` imports `_select_prompt`, `_render_prompt` |
| 139 | +- `scripts/csb_metrics/judge/oracle.py:28` imports `_resolve_task_dir` |
| 140 | + |
| 141 | +These will break silently if the internal implementation changes. Promote |
| 142 | +them to the public API or refactor callers. |
| 143 | + |
| 144 | +--- |
| 145 | + |
| 146 | +## 2. Feature & UX Improvements |
| 147 | + |
| 148 | +### 2.1 Hybrid Scoring Is Fully Implemented but 100% Inert |
| 149 | + |
| 150 | +`docs/EVALUATION_PIPELINE.md` lines 158-160 document hybrid scoring: |
| 151 | +`composite = 0.6 * verifier_reward + 0.4 * rubric_score`, enabled via |
| 152 | +`--hybrid` flag on `run_judge.py`. The flag exists and the scoring logic works. |
| 153 | +However, `criteria.json` (the file it depends on) does not exist in any task |
| 154 | +directory across all 275 tasks. The feature is dead code. |
| 155 | + |
| 156 | +**Improvement**: Either author `criteria.json` for Org tasks (where it makes |
| 157 | +most sense) or remove the feature and documentation. |
| 158 | + |
| 159 | +### 2.2 Nine `direct_verifier.sh` Files Are Unimplemented Placeholders |
| 160 | + |
| 161 | +All nine are identical: |
| 162 | +```bash |
| 163 | +echo 'ERROR: direct_verifier.sh is a placeholder -- needs manual curation' |
| 164 | +echo '0.0' > /logs/verifier/reward.txt |
| 165 | +exit 1 |
| 166 | +``` |
| 167 | +Affected tasks span `csb_org_crossrepo_tracing`, `csb_org_incident`, |
| 168 | +`csb_org_migration`, and `csb_org_org`. Any run in direct mode on these tasks |
| 169 | +produces `reward=0.0` with no useful diagnostics. |
| 170 | + |
| 171 | +### 2.3 `export_official_results.py` Suite Thresholds Are Stale |
| 172 | + |
| 173 | +`SDLC_MIN_VALID_TASKS` at lines 76-85 requires 20 tasks per SDLC suite for |
| 174 | +official qualification. Current `selected_benchmark_tasks.json` has: |
| 175 | +- `csb_sdlc_understand`: 12 selected (threshold: 20) |
| 176 | +- `csb_sdlc_design`: 15 selected (threshold: 20) |
| 177 | +- `csb_sdlc_document`: 15 selected (threshold: 20) |
| 178 | +- `csb_sdlc_secure`: 15 selected (threshold: 20) |
| 179 | +- `csb_sdlc_debug`: 19 selected (threshold: 20) |
| 180 | + |
| 181 | +**Six of nine SDLC suites would be flagged as "below minimum valid tasks"** in |
| 182 | +any official export against the current task selection. The thresholds were never |
| 183 | +updated when the task selection was rebalanced. |
| 184 | + |
| 185 | +### 2.4 No Multi-Run Cross-Directory Comparison |
| 186 | + |
| 187 | +`scripts/compare_configs.py` deduplicates by "latest `started_at` wins" per |
| 188 | +`(suite, task, config)`. There is no way to lock a comparison to a specific pair |
| 189 | +of run directories. If baseline and MCP runs happened weeks apart in different |
| 190 | +directories, only the most recent per task is used. |
| 191 | + |
| 192 | +**Improvement**: Add `--baseline-run-dir` and `--mcp-run-dir` flags to allow |
| 193 | +explicit run pairing. |
| 194 | + |
| 195 | +### 2.5 Submission System Has No Destination |
| 196 | + |
| 197 | +`docs/SUBMISSION.md` and `docs/LEADERBOARD.md` describe packaging, validation, |
| 198 | +and scoring rubrics in detail. Neither document mentions where to actually |
| 199 | +submit the `.tar.gz` archive. There is no submission URL, email, GitHub issue |
| 200 | +template, or API endpoint. The leaderboard does not exist as an accessible |
| 201 | +service. |
| 202 | + |
| 203 | +### 2.6 `LEADERBOARD.md` Per-Suite Task Counts Are Stale |
| 204 | + |
| 205 | +The per-suite completeness table (set 2026-03-02) has task counts that don't |
| 206 | +match `selected_benchmark_tasks.json`, `SDLC_MIN_VALID_TASKS` in |
| 207 | +`export_official_results.py`, or the actual benchmark directories. A submitter |
| 208 | +gets three different answers from three authoritative sources. |
| 209 | + |
| 210 | +### 2.7 `continue` in `_launch_task_pair` Silently Skips MCP Runs |
| 211 | + |
| 212 | +`configs/run_selected_tasks.sh` lines 653-656: When `Dockerfile.artifact_baseline` |
| 213 | +is missing, `continue` inside the function body propagates to the calling |
| 214 | +`while` loop, skipping both baseline AND MCP configs. The MCP run might be |
| 215 | +entirely valid but is silently dropped. Should be `return 1` with separate |
| 216 | +handling. |
| 217 | + |
| 218 | +### 2.8 Org Task `csb_org_onboarding` Structural Inconsistencies |
| 219 | + |
| 220 | +8 of 11 tasks in `csb_org_onboarding` are missing `use_case_id` in `task.toml`. |
| 221 | +All other org suites have 100% coverage. Additionally, 67 `.org_backup` files |
| 222 | +and 362 `.bak` files in the benchmark tree are development artifacts that were |
| 223 | +committed to the repository. |
| 224 | + |
| 225 | +--- |
| 226 | + |
| 227 | +## 3. Research Recommendations |
| 228 | + |
| 229 | +### 3.1 Centralize Pricing Constants (Immediate) |
| 230 | + |
| 231 | +Create `scripts/pricing.py`: |
| 232 | +```python |
| 233 | +PRICING = { |
| 234 | + "claude-opus-4-6": { |
| 235 | + "input_per_mtok": 15.00, |
| 236 | + "output_per_mtok": 75.00, |
| 237 | + "cache_write_per_mtok": 18.75, |
| 238 | + "cache_read_per_mtok": 1.50, |
| 239 | + }, |
| 240 | + # ... other models |
| 241 | +} |
| 242 | +``` |
| 243 | +Import from all cost-computing scripts. This eliminates the three conflicting |
| 244 | +constants and makes model pricing updates a single-file change. |
| 245 | + |
| 246 | +### 3.2 Fix the Statistical Methodology |
| 247 | + |
| 248 | +Three specific issues in `scripts/csb_metrics/statistics.py`: |
| 249 | + |
| 250 | +1. **Tied-rank Spearman**: `ir_analysis.py` line 476 uses the simplified |
| 251 | + `r = 1 - 6d^2/(n(n^2-1))` formula which is incorrect when ties exist |
| 252 | + (common with many `reward=0.0` tasks). The tie-aware version already |
| 253 | + exists in `statistics.py` line 366 -- just import it. |
| 254 | + |
| 255 | +2. **Multiple comparisons**: Per-suite p-values in |
| 256 | + `retrieval_outcome_correlation` (lines 494-512) have no Bonferroni or FDR |
| 257 | + correction. With 19+ suites, false discoveries are likely. Add |
| 258 | + Benjamini-Hochberg FDR correction. |
| 259 | + |
| 260 | +3. **Small-sample t-test**: `welchs_t_test` (line 91) uses a normal |
| 261 | + approximation regardless of df. For suites with 11-15 tasks per config, |
| 262 | + df may be 10-20 where this overestimates statistical power. Implement |
| 263 | + a proper t-distribution CDF or emit a warning when `df < 30`. |
| 264 | + |
| 265 | +### 3.3 Improve LLM Judge Robustness |
| 266 | + |
| 267 | +Four specific improvements to `scripts/csb_metrics/judge/`: |
| 268 | + |
| 269 | +1. **Propagate oracle confidence**: `engine.py` lines 244/313 use a binary |
| 270 | + presence check (`"high" if oracle_ground_truth else "low"`). The |
| 271 | + `OracleBundle.confidence` field exists but is ignored. Pass it through. |
| 272 | + |
| 273 | +2. **Fix voting at temperature=0**: `evaluate_with_voting` (lines 271-275) |
| 274 | + sends identical prompts at `temperature=0.0`, producing identical outputs |
| 275 | + across all rounds. Multi-round voting is a no-op. Either default to |
| 276 | + `temperature > 0` for voting mode, or vary prompt structure per round. |
| 277 | + |
| 278 | +3. **Align prompt scale with parser**: Prompts instruct a 3-point scale |
| 279 | + (0.0, 0.5, 1.0) but `_parse_dimension_scores` accepts continuous floats. |
| 280 | + Decide: continuous scoring (more information) or constrained scoring |
| 281 | + (more reliable). Document the decision. |
| 282 | + |
| 283 | +4. **Add task-type-aware system prompts**: The system prompt is |
| 284 | + `"You are a precise code evaluator."` for all task types. A documentation |
| 285 | + task, a bug fix, and a security audit should have different evaluation |
| 286 | + priorities in the system prompt. |
| 287 | + |
| 288 | +### 3.4 Add `ruff` Linting to CI |
| 289 | + |
| 290 | +The duplicate function definitions in `abc_audit.py` (section 1.11) would |
| 291 | +be caught automatically by `ruff check --select F811` (redefined-while-unused). |
| 292 | +Add a `pyproject.toml` with: |
| 293 | +```toml |
| 294 | +[tool.ruff] |
| 295 | +select = ["E", "F", "W"] |
| 296 | +ignore = ["E501"] # line length handled separately |
| 297 | +``` |
| 298 | +And a `.github/workflows/lint.yml` that runs `ruff check scripts/`. |
| 299 | + |
| 300 | +### 3.5 Replace `parse_task_toml_simple` With `tomllib` |
| 301 | + |
| 302 | +Three scripts (`abc_audit.py`, `abc_score_task.py`, `validate_tasks_preflight.py`) |
| 303 | +contain a hand-rolled TOML parser with a known bug: `'"""' in line: break` |
| 304 | +silently truncates multi-line strings. Python 3.11+ includes `tomllib` in the |
| 305 | +standard library. This is a zero-dependency upgrade that also eliminates the |
| 306 | +truncation bug. |
| 307 | + |
| 308 | +### 3.6 Add Tool-Sequence Pattern Analysis |
| 309 | + |
| 310 | +Current trace analysis counts tool call totals but cannot answer: |
| 311 | +- Does the agent recover from failed MCP calls with local fallback searches? |
| 312 | +- What is the reading:searching ratio across configs? |
| 313 | +- Do successful agents start broad and narrow, or vice versa? |
| 314 | + |
| 315 | +A `scripts/analyze_tool_sequences.py` that extracts ordered tool sequences |
| 316 | +from trajectories and computes transition matrices would enable behavioral |
| 317 | +comparison between baseline and MCP configs. |
| 318 | + |
| 319 | +### 3.7 Add Cost-Efficiency Metric |
| 320 | + |
| 321 | +`scripts/cost_report.py` computes `avg_cost_per_task` but not |
| 322 | +`cost_per_passing_task` (total cost / passed count). For baseline vs MCP |
| 323 | +comparison, cost-efficiency on solved tasks is the decision-relevant metric. |
| 324 | +Add it to the report output, along with `cache_write_tokens` and |
| 325 | +`cache_read_tokens` columns that are currently computed but not displayed. |
| 326 | + |
| 327 | +--- |
| 328 | + |
| 329 | +## 4. Recommended Next Feature |
| 330 | + |
| 331 | +### Operational Recovery and Run Integrity System |
| 332 | + |
| 333 | +**The single most impactful feature** is fixing the broken operational pipeline -- |
| 334 | +the tools operators use to launch, resume, and rerun benchmark tasks. Three |
| 335 | +independently broken components make it impossible to reliably manage benchmark |
| 336 | +runs: |
| 337 | + |
| 338 | +**Part A -- Fix `rerun_failed.py`** (30 minutes) |
| 339 | +1. Update `SUITE_TO_BENCHMARK_DIR` mapping from `ccb_*` to `csb_sdlc_*` / `csb_org_*` |
| 340 | +2. Add integration test that verifies mapping covers all suites in |
| 341 | + `selected_benchmark_tasks.json` |
| 342 | +3. Verify generated commands work against a sample failed run directory |
| 343 | + |
| 344 | +**Part B -- Create or stub `daytona_cost_guard.py`** (1 hour) |
| 345 | +1. Either implement the cost guard (check Daytona credit balance, abort if below |
| 346 | + threshold) or create a pass-through stub that logs a warning |
| 347 | +2. This unblocks all 12+ config scripts that gate on this missing file |
| 348 | +3. Add a `repo_health.py` check that verifies all scripts referenced by configs exist |
| 349 | + |
| 350 | +**Part C -- Fix `--skip-completed` resume logic** (30 minutes) |
| 351 | +1. Change `run_selected_tasks.sh` line 551 to check only for `result.json` |
| 352 | + (not `task_metrics.json`) when determining task completion |
| 353 | +2. Fix the `continue` vs `return 1` bug in `_launch_task_pair` (line 653) so |
| 354 | + missing baseline Dockerfiles don't silently skip MCP runs |
| 355 | +3. Add a `--resume-from <run-dir>` flag that automatically finds completed tasks |
| 356 | + |
| 357 | +**Part D -- Align export thresholds with task selection** (15 minutes) |
| 358 | +1. Update `SDLC_MIN_VALID_TASKS` in `export_official_results.py` to match |
| 359 | + actual per-suite counts in `selected_benchmark_tasks.json` |
| 360 | +2. Or better: read thresholds from `selected_benchmark_tasks.json` dynamically |
| 361 | + |
| 362 | +**Why this is highest impact**: The previous report's recommendation (task |
| 363 | +registry reconciliation and CI test gate) addresses data integrity -- important |
| 364 | +but not blocking. This recommendation addresses the fact that **operators |
| 365 | +currently cannot rerun failed tasks, resume crashed runs, or launch Daytona |
| 366 | +runs at all**. Every benchmark execution session requires manual workarounds |
| 367 | +for these three broken tools. Fixing them directly unblocks the next round of |
| 368 | +benchmark runs. |
| 369 | + |
| 370 | +**PRD-ready description**: "Implement an operational recovery system for |
| 371 | +benchmark runs consisting of: (1) updated `rerun_failed.py` with current |
| 372 | +`csb_*` suite mappings and integration tests, (2) a `daytona_cost_guard.py` |
| 373 | +script (or stub) that unblocks all Daytona-mode config scripts, (3) fixed |
| 374 | +`--skip-completed` resume logic in `run_selected_tasks.sh` that checks only |
| 375 | +`result.json` for completion and correctly handles missing baseline Dockerfiles |
| 376 | +without skipping MCP runs, and (4) aligned `SDLC_MIN_VALID_TASKS` thresholds |
| 377 | +in `export_official_results.py` to match current task selection. Success |
| 378 | +criteria: `rerun_failed.py` generates valid commands for all 20 suites, |
| 379 | +`configs/run_selected_tasks.sh` launches without cost-guard errors in Daytona |
| 380 | +mode, `--skip-completed` correctly identifies and skips already-completed tasks |
| 381 | +after a mid-run crash, and `export_official_results.py` does not flag any |
| 382 | +suite as below minimum when all selected tasks have been run." |
0 commit comments