Skip to content

Commit d88e1e3

Browse files
sjarmakclaude
andcommitted
feat: promote ccb_refactor pass 1 + document Daytona storage override
- Promote refactor_haiku_20260301_010758 to official (20 BL + 20 MCP results) - Document DAYTONA_OVERRIDE_STORAGE=10240 workaround in docs/DAYTONA.md (39 tasks set storage="20G" which exceeds Daytona's 10GB sandbox limit) - Add configs/run_all_sdlc_variance.sh batch launcher for 8-suite x 3-pass runs - Regenerate MANIFEST.json and official_results export Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 4d522d3 commit d88e1e3

File tree

989 files changed

+864767
-31507
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

989 files changed

+864767
-31507
lines changed

configs/run_all_sdlc_variance.sh

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
#!/bin/bash
2+
# Launch all SDLC suites on local Docker with 3 variance passes.
3+
# Usage: bash configs/run_all_sdlc_local.sh [--passes N] [--suites "suite1 suite2 ..."]
4+
#
5+
# Defaults: 3 passes, all 8 SDLC suites (excluding ccb_fix which already has 3 runs).
6+
# Runs sequentially — each suite finishes before the next starts.
7+
8+
set -e
9+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
10+
cd "$SCRIPT_DIR/.."
11+
12+
# Load credentials
13+
if [ -f .env.local ]; then
14+
set -a; source .env.local; set +a
15+
fi
16+
17+
PASSES=${PASSES:-3}
18+
SUITES="${SUITES:-debug design document feature refactor secure test understand}"
19+
20+
# Parse args
21+
while [[ $# -gt 0 ]]; do
22+
case $1 in
23+
--passes) PASSES="$2"; shift 2 ;;
24+
--suites) SUITES="$2"; shift 2 ;;
25+
--skip-first-pass-for)
26+
# Skip specific suites in pass 1 (already have a run in progress)
27+
SKIP_PASS1="$2"; shift 2 ;;
28+
*) echo "Unknown option: $1"; exit 1 ;;
29+
esac
30+
done
31+
32+
echo "=============================================="
33+
echo "LOCAL DOCKER BATCH LAUNCHER"
34+
echo "=============================================="
35+
echo "Suites: $SUITES"
36+
echo "Passes: $PASSES"
37+
echo "Skip P1: ${SKIP_PASS1:-none}"
38+
echo ""
39+
40+
TOTAL_SUITES=$(echo $SUITES | wc -w)
41+
TOTAL_RUNS=$((TOTAL_SUITES * PASSES))
42+
echo "Total suite runs: $TOTAL_RUNS ($TOTAL_SUITES suites x $PASSES passes)"
43+
echo "Started at: $(date)"
44+
echo ""
45+
read -r -p "Press Enter to start, Ctrl+C to abort... " _
46+
47+
FAILED=()
48+
COMPLETED=()
49+
LOGDIR="runs/staging/_batch_logs_$(date +%Y%m%d_%H%M%S)"
50+
mkdir -p "$LOGDIR"
51+
52+
for pass in $(seq 1 $PASSES); do
53+
echo ""
54+
echo "========================================"
55+
echo "VARIANCE PASS $pass / $PASSES ($(date))"
56+
echo "========================================"
57+
58+
for suite in $SUITES; do
59+
# Skip suites in pass 1 if requested
60+
if [ "$pass" -eq 1 ] && echo "${SKIP_PASS1:-}" | grep -qw "$suite"; then
61+
echo "--- Skipping ${suite} pass $pass (already running) ---"
62+
continue
63+
fi
64+
65+
echo ""
66+
echo "--- Pass $pass: ${suite} ($(date)) ---"
67+
start_time=$(date +%s)
68+
logfile="${LOGDIR}/${suite}_pass${pass}.log"
69+
70+
# Pipe empty line for confirm_launch gate, skip prebuild
71+
if echo '' | bash "$SCRIPT_DIR/${suite}_2config.sh" --no-prebuild 2>&1 | tee "$logfile"; then
72+
elapsed=$(( $(date +%s) - start_time ))
73+
echo "COMPLETED: ${suite} pass $pass (${elapsed}s)"
74+
COMPLETED+=("${suite}_pass${pass}")
75+
else
76+
elapsed=$(( $(date +%s) - start_time ))
77+
echo "FAILED: ${suite} pass $pass (${elapsed}s)"
78+
FAILED+=("${suite}_pass${pass}")
79+
fi
80+
81+
echo "--- Cooling down 10s between suites ---"
82+
sleep 10
83+
done
84+
done
85+
86+
echo ""
87+
echo "=============================================="
88+
echo "BATCH COMPLETE ($(date))"
89+
echo "=============================================="
90+
echo "Completed: ${#COMPLETED[@]} / $TOTAL_RUNS"
91+
if [ ${#FAILED[@]} -gt 0 ]; then
92+
echo "Failed: ${FAILED[*]}"
93+
fi
94+
echo ""
95+
echo "Logs: $LOGDIR"
96+
echo "Runs saved to: runs/staging/"
97+
ls -dt runs/staging/*_haiku_2026030* 2>/dev/null | head -$((TOTAL_RUNS + 5))

docs/DAYTONA.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -271,6 +271,7 @@ python3 scripts/build_daytona_registry.py
271271
| `SRC_ACCESS_TOKEN` | For MCP configs | Sourcegraph access token |
272272
| `DAYTONA_API_URL` | No | Override API endpoint (default: `https://app.daytona.io/api`) |
273273
| `DAYTONA_TARGET` | No | Override target region (default: `us`) |
274+
| `DAYTONA_OVERRIDE_STORAGE` | No | Override per-sandbox storage (MB). Set to `10240` to cap at Daytona's 10GB limit when tasks specify larger values in task.toml |
274275

275276
## Troubleshooting
276277

@@ -285,3 +286,5 @@ python3 scripts/build_daytona_registry.py
285286
**MCP config errors**: Verify `SRC_ACCESS_TOKEN` is valid: `curl -H "Authorization: token $SRC_ACCESS_TOKEN" https://sourcegraph.com/.api/graphql`.
286287

287288
**Harbor + Daytona: sandbox not found**: Ensure `daytona-sdk` is installed in the same Python environment as Harbor. The `DaytonaEnvironment` imports from `daytona`.
289+
290+
**Sandbox creation fails for tasks with `storage = "20G"` in task.toml**: Daytona has a hard 10GB per-sandbox storage limit. 39 tasks specify `storage = "20G"` and 1 specifies `"15G"`, exceeding this limit. Set `export DAYTONA_OVERRIDE_STORAGE=10240` before launching runs. This passes `--override-storage-mb 10240` to all `harbor run` commands, capping storage at 10GB. The actual Docker images are 1.5-5GB so 10GB is sufficient.

docs/official_results/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This bundle is generated from `runs/official/` and includes only valid scored tasks (`passed`/`failed` with numeric reward).
44

5-
Generated: `2026-03-01T00:38:37.640286+00:00`
5+
Generated: `2026-03-01T02:13:03.266350+00:00`
66

77
## Local Browse
88

@@ -61,6 +61,8 @@ Historical reruns/backfills remain available in `data/official_results.json` und
6161
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `baseline-local-direct` | 4 | 16 | 0.603 | 1.000 | FLAG: below minimum |
6262
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-artifact` | 6 | 16 | 0.792 | 1.000 | FLAG: below minimum |
6363
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-direct` | 16 | 16 | 0.705 | 1.000 | ok |
64+
| [ccb_refactor](suites/ccb_refactor.md) | `baseline-local-direct` | 20 | 20 | 0.791 | 0.950 | ok |
65+
| [ccb_refactor](suites/ccb_refactor.md) | `mcp-remote-direct` | 20 | 20 | 0.737 | 0.950 | ok |
6466
| [ccb_secure](suites/ccb_secure.md) | `baseline-local-direct` | 20 | 20 | 0.669 | 0.950 | ok |
6567
| [ccb_secure](suites/ccb_secure.md) | `mcp-remote-direct` | 22 | 20 | 0.645 | 0.909 | ok |
6668
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 20 | 20 | 0.480 | 0.750 | ok |
@@ -237,6 +239,8 @@ Historical reruns/backfills remain available in `data/official_results.json` und
237239
| [fix_haiku_20260226_024454](runs/fix_haiku_20260226_024454.md) | `ccb_fix` | `mcp-remote-direct` | 3 | 0.000 | 0.000 |
238240
| [fix_haiku_20260226_new3tasks](runs/fix_haiku_20260226_new3tasks.md) | `ccb_fix` | `baseline-local-direct` | 3 | 0.727 | 1.000 |
239241
| [fix_haiku_20260226_new3tasks](runs/fix_haiku_20260226_new3tasks.md) | `ccb_fix` | `mcp-remote-direct` | 3 | 0.801 | 1.000 |
242+
| [refactor_haiku_20260301_010758](runs/refactor_haiku_20260301_010758.md) | `ccb_refactor` | `baseline-local-direct` | 20 | 0.791 | 0.950 |
243+
| [refactor_haiku_20260301_010758](runs/refactor_haiku_20260301_010758.md) | `ccb_refactor` | `mcp-remote-direct` | 20 | 0.737 | 0.950 |
240244
| [secure_haiku_20260223_232545](runs/secure_haiku_20260223_232545.md) | `ccb_secure` | `baseline-local-direct` | 20 | 0.669 | 0.950 |
241245
| [secure_haiku_20260223_232545](runs/secure_haiku_20260223_232545.md) | `ccb_secure` | `mcp-remote-direct` | 18 | 0.705 | 1.000 |
242246
| [secure_haiku_20260224_011825](runs/secure_haiku_20260224_011825.md) | `ccb_secure` | `mcp-remote-direct` | 2 | 0.500 | 0.500 |

0 commit comments

Comments
 (0)