Skip to content

Commit 8da3d2c

Browse files
committed
docs: add 2026-03-07 nightly research report
Covers security findings (token-on-disk, shell injection), broken operational tooling (rerun_failed.py, daytona_cost_guard.py, resume logic), evaluation pipeline gaps, and statistical methodology issues. Recommends operational recovery system as next feature.
1 parent 5ae7c73 commit 8da3d2c

File tree

1 file changed

+382
-0
lines changed

1 file changed

+382
-0
lines changed
Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
# Nightly Research Report - 2026-03-07
2+
3+
Second nightly review. Covers security, evaluation pipeline integrity, operational
4+
tooling gaps, and statistical methodology. All findings are new -- nothing from the
5+
2026-03-06 report is repeated.
6+
7+
---
8+
9+
## 1. Code & Architecture Review
10+
11+
### 1.1 CRITICAL: `rerun_failed.py` Is Completely Inoperative
12+
13+
`scripts/rerun_failed.py` lines 36-47 define `SUITE_TO_BENCHMARK_DIR` using the old
14+
`ccb_*` naming convention. All current benchmark directories use `csb_sdlc_*` and
15+
`csb_org_*`. When `aggregate_status.py` returns tasks with `suite="csb_sdlc_fix"`,
16+
`SUITE_TO_BENCHMARK_DIR.get("csb_sdlc_fix")` returns `None`, the task prints
17+
`"SKIP (no benchmark path)"`, and zero rerun commands are generated. **This script
18+
silently produces no output for any current benchmark task.**
19+
20+
### 1.2 CRITICAL: `daytona_cost_guard.py` Does Not Exist
21+
22+
`scripts/daytona_cost_guard.py` is referenced in 12+ files:
23+
- `configs/run_selected_tasks.sh` lines 422, 436
24+
- `configs/openhands_2config.sh`, `codex_2config.sh`, `cursor_2config.sh`,
25+
`copilot_2config.sh`, `gemini_2config.sh`, `sdlc_suite_2config.sh`,
26+
`validate_one_per_benchmark.sh`, `multi_harness_compare.sh`
27+
- `docs/DAYTONA.md` lines 67, 164, 169, 181, 203
28+
- `docs/ops/WORKFLOWS.md` lines 13, 26
29+
- `docs/ops/SCRIPT_INDEX.md` line 199
30+
31+
The script does not exist. Any Daytona launch via these configs will fail at the
32+
cost-guard preflight check.
33+
34+
### 1.3 HIGH: Sourcegraph Token Written to Disk Without Permission Restrictions
35+
36+
`agents/claude_baseline_agent.py` embeds the `SOURCEGRAPH_ACCESS_TOKEN` in an
37+
MCP config dict and writes it to `self.logs_dir / ".mcp.json"` at four locations
38+
(lines ~1532, ~1682, ~1770, ~1912). The file is written with default permissions
39+
(world-readable). The Harbor logs directory is typically archived in run output
40+
directories, meaning tokens persist in archived run artifacts.
41+
42+
**Fix**: Add `os.chmod(mcp_config_path, 0o600)` after each write, and add
43+
`.mcp.json` to the archive exclusion list.
44+
45+
### 1.4 HIGH: `subprocess.run(shell=True)` With Task-Derived Commands
46+
47+
`scripts/csb_metrics/oracle_checks.py` line ~498 passes `test_command` (sourced
48+
from task metadata files) to `subprocess.run(shell=True)`. If task metadata is
49+
malformed or tampered with, this is a shell injection vector.
50+
51+
**Fix**: Use `shlex.split()` and `shell=False`.
52+
53+
### 1.5 HIGH: No Python Dependency Manifest
54+
55+
The repository has no `requirements.txt`, `pyproject.toml`, or `setup.py`. The
56+
`scripts/csb_metrics/` package and agent code import third-party packages
57+
(`anthropic`, `openai`) but there is no pinned dependency specification.
58+
Different machines get different package versions.
59+
60+
### 1.6 HIGH: Three Conflicting Pricing Constants
61+
62+
Cache-read pricing is defined inconsistently across three scripts:
63+
64+
| Script | Cache-read rate | Source |
65+
|--------|---------------:|--------|
66+
| `scripts/cost_report.py:33` | $1.875/MTok | Incorrect |
67+
| `scripts/csb_metrics/ir_metrics.py:25` | $1.50/MTok | Correct for Sonnet |
68+
| `scripts/cost_breakdown_analysis.py:40` | $3.75/MTok (cache_create) | Different metric entirely |
69+
70+
Cost reports are producing incorrect totals. A single `scripts/pricing.py`
71+
constants file with versioned pricing tables would eliminate this.
72+
73+
### 1.7 MEDIUM: Hardcoded Developer Path
74+
75+
`agents/claude_baseline_agent.py` line 31 and `agents/harnesses/base.py` line 17:
76+
```python
77+
LOCOBENCH_CLAUDE_MD_TEMPLATE = Path("/home/stephanie_jarmak/CodeScaleBench/...")
78+
```
79+
Silently falls back to a warning on any other machine. No environment variable
80+
override exists.
81+
82+
### 1.8 MEDIUM: `_common.sh` Disk Space Check Silently Fails on macOS
83+
84+
`configs/_common.sh` line 272 uses `df -BG --output=avail` which is GNU-specific.
85+
On macOS (the development platform), `2>/dev/null` swallows the error and
86+
`_disk_free` becomes empty, so the disk space gate **does nothing**. The check
87+
reports OK regardless of available space.
88+
89+
### 1.9 MEDIUM: `--skip-completed` Resume Logic Has a Dependency Bug
90+
91+
`configs/run_selected_tasks.sh` line 551 checks for both `result.json` AND
92+
`task_metrics.json` to consider a task completed. `task_metrics.json` is generated
93+
by post-processing (`extract_all_metrics`) which runs after all tasks complete.
94+
If a run crashes mid-batch, successfully completed tasks still lack
95+
`task_metrics.json`, so `--skip-completed` re-runs them unnecessarily.
96+
97+
**Fix**: Check for `result.json` only, or generate `task_metrics.json` inline
98+
after each task completes.
99+
100+
### 1.10 MEDIUM: SDK Client Instantiated on Every Judge API Call
101+
102+
`scripts/csb_metrics/judge/backends.py` lines ~110 and ~234 create a new
103+
`anthropic.Anthropic()` or `openai.OpenAI()` client on every call. With 3-round
104+
voting, this creates 3 separate HTTP client pools per task. The client should
105+
be created once in `__init__` and reused.
106+
107+
### 1.11 MEDIUM: Duplicate Function Definitions in `abc_audit.py`
108+
109+
Four check functions are defined twice in `scripts/abc_audit.py`:
110+
- `check_t10_shared_state`: lines 424-480 AND 1564-1619 (with different semantics)
111+
- `check_oa_equivalent_solutions`: lines 1118-1164 AND 1622-1668
112+
- `check_ob_negated_solutions`: lines 1167-1224 AND 1671-1728
113+
- `check_og_determinism`: lines 1226-1308 AND 1731-1789
114+
115+
Python uses the last definition. The first copies are dead code. The two versions
116+
of `check_t10_shared_state` have different behavior (the first flags all `/tmp`
117+
paths; the second only flags when port binds are also present).
118+
119+
### 1.12 MEDIUM: `NODE_TLS_REJECT_UNAUTHORIZED=0` Set Globally in Container
120+
121+
`agents/claude_baseline_agent.py` lines ~1484 and ~1638 use
122+
`environment.exec('export NODE_TLS_REJECT_UNAUTHORIZED=0')` which disables TLS
123+
for ALL Node.js processes in the container, not just the MCP subprocess. This
124+
is inconsistent with `create_run_agent_commands` (line ~1142) which scopes
125+
it to a single process via `env_with_autonomous`.
126+
127+
### 1.13 LOW: 150 Lines of Dead Code (`V4_PREAMBLE_TEMPLATE`)
128+
129+
`agents/claude_baseline_agent.py` lines 175-326 define `V4_PREAMBLE_TEMPLATE`
130+
marked as `DEPRECATED -- kept for reference`. It is never referenced anywhere
131+
in the codebase. Remove it.
132+
133+
### 1.14 LOW: Private API Imports Across Package Boundaries
134+
135+
External scripts import `_`-prefixed functions from `csb_metrics` internals:
136+
- `scripts/verify_oracle_fail2pass.py:27` imports `_normalize`
137+
- `scripts/cross_validate_gt.py:25` imports `_normalize`
138+
- `scripts/judge_demo.py:39` imports `_select_prompt`, `_render_prompt`
139+
- `scripts/csb_metrics/judge/oracle.py:28` imports `_resolve_task_dir`
140+
141+
These will break silently if the internal implementation changes. Promote
142+
them to the public API or refactor callers.
143+
144+
---
145+
146+
## 2. Feature & UX Improvements
147+
148+
### 2.1 Hybrid Scoring Is Fully Implemented but 100% Inert
149+
150+
`docs/EVALUATION_PIPELINE.md` lines 158-160 document hybrid scoring:
151+
`composite = 0.6 * verifier_reward + 0.4 * rubric_score`, enabled via
152+
`--hybrid` flag on `run_judge.py`. The flag exists and the scoring logic works.
153+
However, `criteria.json` (the file it depends on) does not exist in any task
154+
directory across all 275 tasks. The feature is dead code.
155+
156+
**Improvement**: Either author `criteria.json` for Org tasks (where it makes
157+
most sense) or remove the feature and documentation.
158+
159+
### 2.2 Nine `direct_verifier.sh` Files Are Unimplemented Placeholders
160+
161+
All nine are identical:
162+
```bash
163+
echo 'ERROR: direct_verifier.sh is a placeholder -- needs manual curation'
164+
echo '0.0' > /logs/verifier/reward.txt
165+
exit 1
166+
```
167+
Affected tasks span `csb_org_crossrepo_tracing`, `csb_org_incident`,
168+
`csb_org_migration`, and `csb_org_org`. Any run in direct mode on these tasks
169+
produces `reward=0.0` with no useful diagnostics.
170+
171+
### 2.3 `export_official_results.py` Suite Thresholds Are Stale
172+
173+
`SDLC_MIN_VALID_TASKS` at lines 76-85 requires 20 tasks per SDLC suite for
174+
official qualification. Current `selected_benchmark_tasks.json` has:
175+
- `csb_sdlc_understand`: 12 selected (threshold: 20)
176+
- `csb_sdlc_design`: 15 selected (threshold: 20)
177+
- `csb_sdlc_document`: 15 selected (threshold: 20)
178+
- `csb_sdlc_secure`: 15 selected (threshold: 20)
179+
- `csb_sdlc_debug`: 19 selected (threshold: 20)
180+
181+
**Six of nine SDLC suites would be flagged as "below minimum valid tasks"** in
182+
any official export against the current task selection. The thresholds were never
183+
updated when the task selection was rebalanced.
184+
185+
### 2.4 No Multi-Run Cross-Directory Comparison
186+
187+
`scripts/compare_configs.py` deduplicates by "latest `started_at` wins" per
188+
`(suite, task, config)`. There is no way to lock a comparison to a specific pair
189+
of run directories. If baseline and MCP runs happened weeks apart in different
190+
directories, only the most recent per task is used.
191+
192+
**Improvement**: Add `--baseline-run-dir` and `--mcp-run-dir` flags to allow
193+
explicit run pairing.
194+
195+
### 2.5 Submission System Has No Destination
196+
197+
`docs/SUBMISSION.md` and `docs/LEADERBOARD.md` describe packaging, validation,
198+
and scoring rubrics in detail. Neither document mentions where to actually
199+
submit the `.tar.gz` archive. There is no submission URL, email, GitHub issue
200+
template, or API endpoint. The leaderboard does not exist as an accessible
201+
service.
202+
203+
### 2.6 `LEADERBOARD.md` Per-Suite Task Counts Are Stale
204+
205+
The per-suite completeness table (set 2026-03-02) has task counts that don't
206+
match `selected_benchmark_tasks.json`, `SDLC_MIN_VALID_TASKS` in
207+
`export_official_results.py`, or the actual benchmark directories. A submitter
208+
gets three different answers from three authoritative sources.
209+
210+
### 2.7 `continue` in `_launch_task_pair` Silently Skips MCP Runs
211+
212+
`configs/run_selected_tasks.sh` lines 653-656: When `Dockerfile.artifact_baseline`
213+
is missing, `continue` inside the function body propagates to the calling
214+
`while` loop, skipping both baseline AND MCP configs. The MCP run might be
215+
entirely valid but is silently dropped. Should be `return 1` with separate
216+
handling.
217+
218+
### 2.8 Org Task `csb_org_onboarding` Structural Inconsistencies
219+
220+
8 of 11 tasks in `csb_org_onboarding` are missing `use_case_id` in `task.toml`.
221+
All other org suites have 100% coverage. Additionally, 67 `.org_backup` files
222+
and 362 `.bak` files in the benchmark tree are development artifacts that were
223+
committed to the repository.
224+
225+
---
226+
227+
## 3. Research Recommendations
228+
229+
### 3.1 Centralize Pricing Constants (Immediate)
230+
231+
Create `scripts/pricing.py`:
232+
```python
233+
PRICING = {
234+
"claude-opus-4-6": {
235+
"input_per_mtok": 15.00,
236+
"output_per_mtok": 75.00,
237+
"cache_write_per_mtok": 18.75,
238+
"cache_read_per_mtok": 1.50,
239+
},
240+
# ... other models
241+
}
242+
```
243+
Import from all cost-computing scripts. This eliminates the three conflicting
244+
constants and makes model pricing updates a single-file change.
245+
246+
### 3.2 Fix the Statistical Methodology
247+
248+
Three specific issues in `scripts/csb_metrics/statistics.py`:
249+
250+
1. **Tied-rank Spearman**: `ir_analysis.py` line 476 uses the simplified
251+
`r = 1 - 6d^2/(n(n^2-1))` formula which is incorrect when ties exist
252+
(common with many `reward=0.0` tasks). The tie-aware version already
253+
exists in `statistics.py` line 366 -- just import it.
254+
255+
2. **Multiple comparisons**: Per-suite p-values in
256+
`retrieval_outcome_correlation` (lines 494-512) have no Bonferroni or FDR
257+
correction. With 19+ suites, false discoveries are likely. Add
258+
Benjamini-Hochberg FDR correction.
259+
260+
3. **Small-sample t-test**: `welchs_t_test` (line 91) uses a normal
261+
approximation regardless of df. For suites with 11-15 tasks per config,
262+
df may be 10-20 where this overestimates statistical power. Implement
263+
a proper t-distribution CDF or emit a warning when `df < 30`.
264+
265+
### 3.3 Improve LLM Judge Robustness
266+
267+
Four specific improvements to `scripts/csb_metrics/judge/`:
268+
269+
1. **Propagate oracle confidence**: `engine.py` lines 244/313 use a binary
270+
presence check (`"high" if oracle_ground_truth else "low"`). The
271+
`OracleBundle.confidence` field exists but is ignored. Pass it through.
272+
273+
2. **Fix voting at temperature=0**: `evaluate_with_voting` (lines 271-275)
274+
sends identical prompts at `temperature=0.0`, producing identical outputs
275+
across all rounds. Multi-round voting is a no-op. Either default to
276+
`temperature > 0` for voting mode, or vary prompt structure per round.
277+
278+
3. **Align prompt scale with parser**: Prompts instruct a 3-point scale
279+
(0.0, 0.5, 1.0) but `_parse_dimension_scores` accepts continuous floats.
280+
Decide: continuous scoring (more information) or constrained scoring
281+
(more reliable). Document the decision.
282+
283+
4. **Add task-type-aware system prompts**: The system prompt is
284+
`"You are a precise code evaluator."` for all task types. A documentation
285+
task, a bug fix, and a security audit should have different evaluation
286+
priorities in the system prompt.
287+
288+
### 3.4 Add `ruff` Linting to CI
289+
290+
The duplicate function definitions in `abc_audit.py` (section 1.11) would
291+
be caught automatically by `ruff check --select F811` (redefined-while-unused).
292+
Add a `pyproject.toml` with:
293+
```toml
294+
[tool.ruff]
295+
select = ["E", "F", "W"]
296+
ignore = ["E501"] # line length handled separately
297+
```
298+
And a `.github/workflows/lint.yml` that runs `ruff check scripts/`.
299+
300+
### 3.5 Replace `parse_task_toml_simple` With `tomllib`
301+
302+
Three scripts (`abc_audit.py`, `abc_score_task.py`, `validate_tasks_preflight.py`)
303+
contain a hand-rolled TOML parser with a known bug: `'"""' in line: break`
304+
silently truncates multi-line strings. Python 3.11+ includes `tomllib` in the
305+
standard library. This is a zero-dependency upgrade that also eliminates the
306+
truncation bug.
307+
308+
### 3.6 Add Tool-Sequence Pattern Analysis
309+
310+
Current trace analysis counts tool call totals but cannot answer:
311+
- Does the agent recover from failed MCP calls with local fallback searches?
312+
- What is the reading:searching ratio across configs?
313+
- Do successful agents start broad and narrow, or vice versa?
314+
315+
A `scripts/analyze_tool_sequences.py` that extracts ordered tool sequences
316+
from trajectories and computes transition matrices would enable behavioral
317+
comparison between baseline and MCP configs.
318+
319+
### 3.7 Add Cost-Efficiency Metric
320+
321+
`scripts/cost_report.py` computes `avg_cost_per_task` but not
322+
`cost_per_passing_task` (total cost / passed count). For baseline vs MCP
323+
comparison, cost-efficiency on solved tasks is the decision-relevant metric.
324+
Add it to the report output, along with `cache_write_tokens` and
325+
`cache_read_tokens` columns that are currently computed but not displayed.
326+
327+
---
328+
329+
## 4. Recommended Next Feature
330+
331+
### Operational Recovery and Run Integrity System
332+
333+
**The single most impactful feature** is fixing the broken operational pipeline --
334+
the tools operators use to launch, resume, and rerun benchmark tasks. Three
335+
independently broken components make it impossible to reliably manage benchmark
336+
runs:
337+
338+
**Part A -- Fix `rerun_failed.py`** (30 minutes)
339+
1. Update `SUITE_TO_BENCHMARK_DIR` mapping from `ccb_*` to `csb_sdlc_*` / `csb_org_*`
340+
2. Add integration test that verifies mapping covers all suites in
341+
`selected_benchmark_tasks.json`
342+
3. Verify generated commands work against a sample failed run directory
343+
344+
**Part B -- Create or stub `daytona_cost_guard.py`** (1 hour)
345+
1. Either implement the cost guard (check Daytona credit balance, abort if below
346+
threshold) or create a pass-through stub that logs a warning
347+
2. This unblocks all 12+ config scripts that gate on this missing file
348+
3. Add a `repo_health.py` check that verifies all scripts referenced by configs exist
349+
350+
**Part C -- Fix `--skip-completed` resume logic** (30 minutes)
351+
1. Change `run_selected_tasks.sh` line 551 to check only for `result.json`
352+
(not `task_metrics.json`) when determining task completion
353+
2. Fix the `continue` vs `return 1` bug in `_launch_task_pair` (line 653) so
354+
missing baseline Dockerfiles don't silently skip MCP runs
355+
3. Add a `--resume-from <run-dir>` flag that automatically finds completed tasks
356+
357+
**Part D -- Align export thresholds with task selection** (15 minutes)
358+
1. Update `SDLC_MIN_VALID_TASKS` in `export_official_results.py` to match
359+
actual per-suite counts in `selected_benchmark_tasks.json`
360+
2. Or better: read thresholds from `selected_benchmark_tasks.json` dynamically
361+
362+
**Why this is highest impact**: The previous report's recommendation (task
363+
registry reconciliation and CI test gate) addresses data integrity -- important
364+
but not blocking. This recommendation addresses the fact that **operators
365+
currently cannot rerun failed tasks, resume crashed runs, or launch Daytona
366+
runs at all**. Every benchmark execution session requires manual workarounds
367+
for these three broken tools. Fixing them directly unblocks the next round of
368+
benchmark runs.
369+
370+
**PRD-ready description**: "Implement an operational recovery system for
371+
benchmark runs consisting of: (1) updated `rerun_failed.py` with current
372+
`csb_*` suite mappings and integration tests, (2) a `daytona_cost_guard.py`
373+
script (or stub) that unblocks all Daytona-mode config scripts, (3) fixed
374+
`--skip-completed` resume logic in `run_selected_tasks.sh` that checks only
375+
`result.json` for completion and correctly handles missing baseline Dockerfiles
376+
without skipping MCP runs, and (4) aligned `SDLC_MIN_VALID_TASKS` thresholds
377+
in `export_official_results.py` to match current task selection. Success
378+
criteria: `rerun_failed.py` generates valid commands for all 20 suites,
379+
`configs/run_selected_tasks.sh` launches without cost-guard errors in Daytona
380+
mode, `--skip-completed` correctly identifies and skips already-completed tasks
381+
after a mid-run crash, and `export_official_results.py` does not flag any
382+
suite as below minimum when all selected tasks have been run."

0 commit comments

Comments
 (0)