Skip to content

Commit 4bb433d

Browse files
sjarmakclaude
andcommitted
feat: complete MCP-unique variance coverage to 220/220 at 3+ paired runs
Targeted reruns for the last 5 gap tasks (2 passes): - Pass 1: 5 tasks (onboard-138, vuln-remed-161/164/166/169) - Pass 2: 4 remaining tasks (onboard-044, vuln-remed-161/164/169) Final coverage: 220/220 MCP-unique tasks at 3+ paired runs (100%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 10b2a5e commit 4bb433d

File tree

9 files changed

+4513
-540
lines changed

9 files changed

+4513
-540
lines changed
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
{
2+
"metadata": {
3+
"title": "Variance rerun: 5 remaining MCP-unique gap tasks (pass 1)",
4+
"description": "Targeted rerun to bring the last 5 tasks to 3+ paired runs. All 5 tasks get 1 paired run.",
5+
"generated_date": "2026-03-02",
6+
"total_tasks": 5
7+
},
8+
"statistics": {
9+
"total_tasks": 5,
10+
"per_suite": {
11+
"csb_org_onboarding": 1,
12+
"csb_org_security": 4
13+
}
14+
},
15+
"tasks": [
16+
{
17+
"task_id": "ccx-onboard-138",
18+
"benchmark": "csb_org_onboarding",
19+
"task_dir": "csb_org_onboarding/ccx-onboard-138"
20+
},
21+
{
22+
"task_id": "ccx-vuln-remed-161",
23+
"benchmark": "csb_org_security",
24+
"task_dir": "csb_org_security/ccx-vuln-remed-161"
25+
},
26+
{
27+
"task_id": "ccx-vuln-remed-164",
28+
"benchmark": "csb_org_security",
29+
"task_dir": "csb_org_security/ccx-vuln-remed-164"
30+
},
31+
{
32+
"task_id": "ccx-vuln-remed-166",
33+
"benchmark": "csb_org_security",
34+
"task_dir": "csb_org_security/ccx-vuln-remed-166"
35+
},
36+
{
37+
"task_id": "ccx-vuln-remed-169",
38+
"benchmark": "csb_org_security",
39+
"task_dir": "csb_org_security/ccx-vuln-remed-169"
40+
}
41+
]
42+
}
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
{
2+
"metadata": {
3+
"title": "Variance rerun: 4 remaining gap tasks needing baseline runs",
4+
"description": "Targeted rerun for 4 tasks still under 3 paired runs. All need 1+ more baseline.",
5+
"generated_date": "2026-03-02",
6+
"total_tasks": 4
7+
},
8+
"statistics": {
9+
"total_tasks": 4,
10+
"per_suite": {
11+
"csb_org_onboarding": 1,
12+
"csb_org_security": 3
13+
}
14+
},
15+
"tasks": [
16+
{
17+
"task_id": "ccx-onboard-044",
18+
"benchmark": "csb_org_onboarding",
19+
"task_dir": "csb_org_onboarding/ccx-onboard-044"
20+
},
21+
{
22+
"task_id": "ccx-vuln-remed-161",
23+
"benchmark": "csb_org_security",
24+
"task_dir": "csb_org_security/ccx-vuln-remed-161"
25+
},
26+
{
27+
"task_id": "ccx-vuln-remed-164",
28+
"benchmark": "csb_org_security",
29+
"task_dir": "csb_org_security/ccx-vuln-remed-164"
30+
},
31+
{
32+
"task_id": "ccx-vuln-remed-169",
33+
"benchmark": "csb_org_security",
34+
"task_dir": "csb_org_security/ccx-vuln-remed-169"
35+
}
36+
]
37+
}

docs/official_results/README.md

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This bundle is generated from `runs/official/` and includes only valid scored tasks (`passed`/`failed` with numeric reward).
44

5-
Generated: `2026-03-02T20:07:49.222149+00:00`
5+
Generated: `2026-03-02T21:30:49.001942+00:00`
66

77
## Local Browse
88

@@ -41,20 +41,20 @@ Historical reruns/backfills remain available in `data/official_results.json` und
4141
| [csb_org_incident](suites/csb_org_incident.md) | `mcp-remote-direct` | 85 | 85 | 0.613 | 0.953 | ok |
4242
| [csb_org_migration](suites/csb_org_migration.md) | `baseline-local-direct` | 26 | 85 | 0.325 | 0.846 | FLAG: below minimum |
4343
| [csb_org_migration](suites/csb_org_migration.md) | `mcp-remote-direct` | 85 | 85 | 0.452 | 0.835 | ok |
44-
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `baseline-local-artifact` | 5 | 151 | 0.200 | 0.200 | FLAG: below minimum |
45-
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `baseline-local-direct` | 28 | 151 | 0.673 | 0.893 | FLAG: below minimum |
46-
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `mcp-remote-artifact` | 5 | 151 | 0.875 | 1.000 | FLAG: below minimum |
47-
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `mcp-remote-direct` | 151 | 151 | 0.807 | 0.974 | ok |
44+
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `baseline-local-artifact` | 5 | 155 | 0.200 | 0.200 | FLAG: below minimum |
45+
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `baseline-local-direct` | 28 | 155 | 0.631 | 0.821 | FLAG: below minimum |
46+
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `mcp-remote-artifact` | 5 | 155 | 0.875 | 1.000 | FLAG: below minimum |
47+
| [csb_org_onboarding](suites/csb_org_onboarding.md) | `mcp-remote-direct` | 155 | 155 | 0.801 | 0.974 | ok |
4848
| [csb_org_org](suites/csb_org_org.md) | `baseline-local-artifact` | 2 | 70 | 0.500 | 1.000 | FLAG: below minimum |
4949
| [csb_org_org](suites/csb_org_org.md) | `baseline-local-direct` | 20 | 70 | 0.343 | 0.950 | FLAG: below minimum |
5050
| [csb_org_org](suites/csb_org_org.md) | `mcp-remote-artifact` | 2 | 70 | 0.705 | 1.000 | FLAG: below minimum |
5151
| [csb_org_org](suites/csb_org_org.md) | `mcp-remote-direct` | 70 | 70 | 0.356 | 0.800 | ok |
5252
| [csb_org_platform](suites/csb_org_platform.md) | `baseline-local-direct` | 21 | 83 | 0.283 | 0.810 | FLAG: below minimum |
5353
| [csb_org_platform](suites/csb_org_platform.md) | `mcp-remote-direct` | 83 | 83 | 0.300 | 0.952 | ok |
54-
| [csb_org_security](suites/csb_org_security.md) | `baseline-local-artifact` | 25 | 78 | 0.283 | 0.720 | FLAG: below minimum |
55-
| [csb_org_security](suites/csb_org_security.md) | `baseline-local-direct` | 23 | 78 | 0.508 | 0.957 | FLAG: below minimum |
56-
| [csb_org_security](suites/csb_org_security.md) | `mcp-remote-artifact` | 26 | 78 | 0.563 | 1.000 | FLAG: below minimum |
57-
| [csb_org_security](suites/csb_org_security.md) | `mcp-remote-direct` | 78 | 78 | 0.636 | 0.987 | ok |
54+
| [csb_org_security](suites/csb_org_security.md) | `baseline-local-artifact` | 25 | 93 | 0.283 | 0.720 | FLAG: below minimum |
55+
| [csb_org_security](suites/csb_org_security.md) | `baseline-local-direct` | 24 | 93 | 0.486 | 0.875 | FLAG: below minimum |
56+
| [csb_org_security](suites/csb_org_security.md) | `mcp-remote-artifact` | 26 | 93 | 0.563 | 1.000 | FLAG: below minimum |
57+
| [csb_org_security](suites/csb_org_security.md) | `mcp-remote-direct` | 93 | 93 | 0.560 | 0.914 | ok |
5858
| [csb_sdlc_build](suites/csb_sdlc_build.md) | `baseline-local-direct` | 23 | 23 | 0.601 | 0.826 | ok |
5959
| [csb_sdlc_build](suites/csb_sdlc_build.md) | `mcp-remote-direct` | 20 | 23 | 0.592 | 0.800 | FLAG: below minimum |
6060
| [csb_sdlc_debug](suites/csb_sdlc_debug.md) | `baseline-local-direct` | 20 | 20 | 0.688 | 1.000 | ok |
@@ -326,6 +326,14 @@ Historical reruns/backfills remain available in `data/official_results.json` und
326326
| [csb_org_onboarding_haiku_20260302_183602](runs/csb_org_onboarding_haiku_20260302_183602.md) | `csb_org_onboarding` | `mcp-remote-direct` | 18 | 0.896 | 1.000 |
327327
| [csb_org_onboarding_haiku_20260302_183608](runs/csb_org_onboarding_haiku_20260302_183608.md) | `csb_org_onboarding` | `baseline-local-direct` | 18 | 0.792 | 0.889 |
328328
| [csb_org_onboarding_haiku_20260302_183608](runs/csb_org_onboarding_haiku_20260302_183608.md) | `csb_org_onboarding` | `mcp-remote-direct` | 18 | 0.917 | 1.000 |
329+
| [csb_org_onboarding_haiku_20260302_210829](runs/csb_org_onboarding_haiku_20260302_210829.md) | `csb_org_onboarding` | `baseline-local-direct` | 1 | 0.000 | 0.000 |
330+
| [csb_org_onboarding_haiku_20260302_210829](runs/csb_org_onboarding_haiku_20260302_210829.md) | `csb_org_onboarding` | `mcp-remote-direct` | 1 | 0.750 | 1.000 |
331+
| [csb_org_onboarding_haiku_20260302_210835](runs/csb_org_onboarding_haiku_20260302_210835.md) | `csb_org_onboarding` | `baseline-local-direct` | 1 | 0.000 | 0.000 |
332+
| [csb_org_onboarding_haiku_20260302_210835](runs/csb_org_onboarding_haiku_20260302_210835.md) | `csb_org_onboarding` | `mcp-remote-direct` | 1 | 0.500 | 1.000 |
333+
| [csb_org_onboarding_haiku_20260302_210842](runs/csb_org_onboarding_haiku_20260302_210842.md) | `csb_org_onboarding` | `baseline-local-direct` | 1 | 0.000 | 0.000 |
334+
| [csb_org_onboarding_haiku_20260302_210842](runs/csb_org_onboarding_haiku_20260302_210842.md) | `csb_org_onboarding` | `mcp-remote-direct` | 1 | 0.500 | 1.000 |
335+
| [csb_org_onboarding_haiku_20260302_212645](runs/csb_org_onboarding_haiku_20260302_212645.md) | `csb_org_onboarding` | `baseline-local-direct` | 1 | 0.000 | 0.000 |
336+
| [csb_org_onboarding_haiku_20260302_212645](runs/csb_org_onboarding_haiku_20260302_212645.md) | `csb_org_onboarding` | `mcp-remote-direct` | 1 | 0.432 | 1.000 |
329337
| [csb_org_org_haiku_20260224_181919](runs/csb_org_org_haiku_20260224_181919.md) | `csb_org_org` | `mcp-remote-artifact` | 2 | 0.705 | 1.000 |
330338
| [csb_org_org_haiku_20260225_011700](runs/csb_org_org_haiku_20260225_011700.md) | `csb_org_org` | `baseline-local-artifact` | 2 | 0.500 | 1.000 |
331339
| [csb_org_org_haiku_20260226_035617](runs/csb_org_org_haiku_20260226_035617.md) | `csb_org_org` | `mcp-remote-direct` | 3 | 0.503 | 1.000 |
@@ -414,6 +422,14 @@ Historical reruns/backfills remain available in `data/official_results.json` und
414422
| [csb_org_security_haiku_20260302_183602](runs/csb_org_security_haiku_20260302_183602.md) | `csb_org_security` | `mcp-remote-direct` | 6 | 0.697 | 0.833 |
415423
| [csb_org_security_haiku_20260302_183608](runs/csb_org_security_haiku_20260302_183608.md) | `csb_org_security` | `baseline-local-direct` | 6 | 0.588 | 0.833 |
416424
| [csb_org_security_haiku_20260302_183608](runs/csb_org_security_haiku_20260302_183608.md) | `csb_org_security` | `mcp-remote-direct` | 6 | 0.771 | 1.000 |
425+
| [csb_org_security_haiku_20260302_210829](runs/csb_org_security_haiku_20260302_210829.md) | `csb_org_security` | `baseline-local-direct` | 1 | 0.000 | 0.000 |
426+
| [csb_org_security_haiku_20260302_210829](runs/csb_org_security_haiku_20260302_210829.md) | `csb_org_security` | `mcp-remote-direct` | 4 | 0.119 | 0.500 |
427+
| [csb_org_security_haiku_20260302_210835](runs/csb_org_security_haiku_20260302_210835.md) | `csb_org_security` | `baseline-local-direct` | 1 | 0.000 | 0.000 |
428+
| [csb_org_security_haiku_20260302_210835](runs/csb_org_security_haiku_20260302_210835.md) | `csb_org_security` | `mcp-remote-direct` | 4 | 0.193 | 0.500 |
429+
| [csb_org_security_haiku_20260302_210842](runs/csb_org_security_haiku_20260302_210842.md) | `csb_org_security` | `baseline-local-direct` | 1 | 0.000 | 0.000 |
430+
| [csb_org_security_haiku_20260302_210842](runs/csb_org_security_haiku_20260302_210842.md) | `csb_org_security` | `mcp-remote-direct` | 4 | 0.200 | 0.500 |
431+
| [csb_org_security_haiku_20260302_212645](runs/csb_org_security_haiku_20260302_212645.md) | `csb_org_security` | `baseline-local-direct` | 3 | 0.231 | 0.667 |
432+
| [csb_org_security_haiku_20260302_212645](runs/csb_org_security_haiku_20260302_212645.md) | `csb_org_security` | `mcp-remote-direct` | 3 | 0.127 | 0.667 |
417433
| [csb_sdlc_build_haiku_20260227_025524](runs/csb_sdlc_build_haiku_20260227_025524.md) | `csb_sdlc_build` | `baseline-local-direct` | 3 | 0.513 | 1.000 |
418434
| [csb_sdlc_build_haiku_20260227_034711](runs/csb_sdlc_build_haiku_20260227_034711.md) | `csb_sdlc_build` | `baseline-local-direct` | 1 | 0.500 | 1.000 |
419435
| [csb_sdlc_build_haiku_20260227_123839](runs/csb_sdlc_build_haiku_20260227_123839.md) | `csb_sdlc_build` | `baseline-local-direct` | 8 | 0.641 | 1.000 |

0 commit comments

Comments
 (0)