Skip to content

Commit fa3a1c0

Browse files
committed
Normalize SDLC config labels in official results export
1 parent 83bb9c8 commit fa3a1c0

File tree

539 files changed

+26566
-49441
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

539 files changed

+26566
-49441
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,8 @@ This writes:
261261
Suite summaries are deduplicated to the latest result per
262262
`suite + config + task_name`; full historical rows remain in
263263
`official_results.json` under `all_tasks`.
264+
For SDLC suites, export normalizes legacy config labels:
265+
`baseline` -> `baseline-local-direct`, `mcp` -> `mcp-remote-direct`.
264266

265267
Serve locally:
266268

docs/OFFICIAL_RESULTS_BROWSER.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ Suite-level views and top-level summaries are deduplicated to one canonical row
1818
per `suite + config + task_name` (latest by task `started_at`). Full historical
1919
rows are preserved in `data/official_results.json` as `all_tasks`.
2020

21+
For SDLC suites (`ccb_build`, `ccb_debug`, `ccb_design`, `ccb_document`,
22+
`ccb_fix`, `ccb_secure`, `ccb_test`, `ccb_understand`), legacy config labels
23+
are normalized during export:
24+
- `baseline` -> `baseline-local-direct`
25+
- `mcp` -> `mcp-remote-direct`
26+
2127
## Usage
2228

2329
```bash

docs/official_results/README.md

Lines changed: 17 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This bundle is generated from `runs/official/` and includes only valid scored tasks (`passed`/`failed` with numeric reward).
44

5-
Generated: `2026-02-27T02:23:03.814992+00:00`
5+
Generated: `2026-02-27T02:26:00.850511+00:00`
66

77
## Local Browse
88

@@ -17,25 +17,15 @@ Historical reruns/backfills remain available in `data/official_results.json` und
1717

1818
| Suite | Config | Valid Tasks | Mean Reward | Pass Rate |
1919
|---|---|---:|---:|---:|
20-
| [ccb_build](suites/ccb_build.md) | `baseline` | 19 | 0.511 | 0.789 |
2120
| [ccb_build](suites/ccb_build.md) | `baseline-local-direct` | 20 | 0.527 | 0.800 |
22-
| [ccb_build](suites/ccb_build.md) | `mcp` | 25 | 0.372 | 0.640 |
2321
| [ccb_build](suites/ccb_build.md) | `mcp-remote-direct` | 25 | 0.372 | 0.640 |
24-
| [ccb_debug](suites/ccb_debug.md) | `baseline` | 20 | 0.670 | 1.000 |
2522
| [ccb_debug](suites/ccb_debug.md) | `baseline-local-direct` | 20 | 0.670 | 1.000 |
26-
| [ccb_debug](suites/ccb_debug.md) | `mcp` | 20 | 0.487 | 0.600 |
2723
| [ccb_debug](suites/ccb_debug.md) | `mcp-remote-direct` | 20 | 0.487 | 0.600 |
28-
| [ccb_design](suites/ccb_design.md) | `baseline` | 13 | 0.770 | 1.000 |
2924
| [ccb_design](suites/ccb_design.md) | `baseline-local-direct` | 20 | 0.753 | 0.950 |
30-
| [ccb_design](suites/ccb_design.md) | `mcp` | 20 | 0.718 | 1.000 |
3125
| [ccb_design](suites/ccb_design.md) | `mcp-remote-direct` | 20 | 0.718 | 1.000 |
32-
| [ccb_document](suites/ccb_document.md) | `baseline` | 14 | 0.904 | 1.000 |
3326
| [ccb_document](suites/ccb_document.md) | `baseline-local-direct` | 20 | 0.847 | 1.000 |
34-
| [ccb_document](suites/ccb_document.md) | `mcp` | 15 | 0.953 | 1.000 |
3527
| [ccb_document](suites/ccb_document.md) | `mcp-remote-direct` | 25 | 0.802 | 1.000 |
36-
| [ccb_fix](suites/ccb_fix.md) | `baseline` | 17 | 0.535 | 0.706 |
3728
| [ccb_fix](suites/ccb_fix.md) | `baseline-local-direct` | 28 | 0.428 | 0.571 |
38-
| [ccb_fix](suites/ccb_fix.md) | `mcp` | 17 | 0.538 | 0.647 |
3929
| [ccb_fix](suites/ccb_fix.md) | `mcp-remote-direct` | 28 | 0.467 | 0.571 |
4030
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `baseline-local-artifact` | 1 | 0.375 | 1.000 |
4131
| [ccb_mcp_compliance](suites/ccb_mcp_compliance.md) | `baseline-local-direct` | 6 | 0.668 | 1.000 |
@@ -85,17 +75,11 @@ Historical reruns/backfills remain available in `data/official_results.json` und
8575
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp` | 2 | 0.821 | 1.000 |
8676
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-artifact` | 4 | 0.777 | 1.000 |
8777
| [ccb_mcp_security](suites/ccb_mcp_security.md) | `mcp-remote-direct` | 16 | 0.705 | 1.000 |
88-
| [ccb_secure](suites/ccb_secure.md) | `baseline` | 18 | 0.688 | 0.944 |
8978
| [ccb_secure](suites/ccb_secure.md) | `baseline-local-direct` | 20 | 0.669 | 0.950 |
90-
| [ccb_secure](suites/ccb_secure.md) | `mcp` | 18 | 0.705 | 1.000 |
9179
| [ccb_secure](suites/ccb_secure.md) | `mcp-remote-direct` | 22 | 0.645 | 0.909 |
92-
| [ccb_test](suites/ccb_test.md) | `baseline` | 9 | 0.472 | 0.778 |
9380
| [ccb_test](suites/ccb_test.md) | `baseline-local-direct` | 20 | 0.480 | 0.750 |
94-
| [ccb_test](suites/ccb_test.md) | `mcp` | 8 | 0.555 | 0.625 |
9581
| [ccb_test](suites/ccb_test.md) | `mcp-remote-direct` | 31 | 0.403 | 0.613 |
96-
| [ccb_understand](suites/ccb_understand.md) | `baseline` | 13 | 0.592 | 0.692 |
9782
| [ccb_understand](suites/ccb_understand.md) | `baseline-local-direct` | 20 | 0.660 | 0.800 |
98-
| [ccb_understand](suites/ccb_understand.md) | `mcp` | 13 | 0.841 | 1.000 |
9983
| [ccb_understand](suites/ccb_understand.md) | `mcp-remote-direct` | 20 | 0.851 | 1.000 |
10084

10185
<details>
@@ -106,23 +90,23 @@ Historical reruns/backfills remain available in `data/official_results.json` und
10690
|---|---|---|---:|---:|---:|
10791
| [build_haiku_20260223_124805](runs/build_haiku_20260223_124805.md) | `ccb_build` | `baseline-local-direct` | 19 | 0.511 | 0.789 |
10892
| [build_haiku_20260223_124805](runs/build_haiku_20260223_124805.md) | `ccb_build` | `mcp-remote-direct` | 25 | 0.372 | 0.640 |
109-
| [ccb_build_haiku_022326](runs/ccb_build_haiku_022326.md) | `ccb_build` | `baseline` | 19 | 0.511 | 0.789 |
110-
| [ccb_build_haiku_022326](runs/ccb_build_haiku_022326.md) | `ccb_build` | `mcp` | 25 | 0.372 | 0.640 |
93+
| [ccb_build_haiku_022326](runs/ccb_build_haiku_022326.md) | `ccb_build` | `baseline-local-direct` | 19 | 0.511 | 0.789 |
94+
| [ccb_build_haiku_022326](runs/ccb_build_haiku_022326.md) | `ccb_build` | `mcp-remote-direct` | 25 | 0.372 | 0.640 |
11195
| [ccb_build_haiku_20260225_234223](runs/ccb_build_haiku_20260225_234223.md) | `ccb_build` | `baseline-local-direct` | 1 | 0.820 | 1.000 |
11296
| [ccb_build_haiku_20260226_015500_backfill](runs/ccb_build_haiku_20260226_015500_backfill.md) | `ccb_build` | `baseline-local-direct` | 1 | 0.820 | 1.000 |
113-
| [ccb_debug_haiku_022326](runs/ccb_debug_haiku_022326.md) | `ccb_debug` | `baseline` | 20 | 0.670 | 1.000 |
114-
| [ccb_debug_haiku_022326](runs/ccb_debug_haiku_022326.md) | `ccb_debug` | `mcp` | 20 | 0.487 | 0.600 |
115-
| [ccb_design_haiku_022326](runs/ccb_design_haiku_022326.md) | `ccb_design` | `baseline` | 13 | 0.770 | 1.000 |
116-
| [ccb_design_haiku_022326](runs/ccb_design_haiku_022326.md) | `ccb_design` | `mcp` | 20 | 0.718 | 1.000 |
97+
| [ccb_debug_haiku_022326](runs/ccb_debug_haiku_022326.md) | `ccb_debug` | `baseline-local-direct` | 20 | 0.670 | 1.000 |
98+
| [ccb_debug_haiku_022326](runs/ccb_debug_haiku_022326.md) | `ccb_debug` | `mcp-remote-direct` | 20 | 0.487 | 0.600 |
99+
| [ccb_design_haiku_022326](runs/ccb_design_haiku_022326.md) | `ccb_design` | `baseline-local-direct` | 13 | 0.770 | 1.000 |
100+
| [ccb_design_haiku_022326](runs/ccb_design_haiku_022326.md) | `ccb_design` | `mcp-remote-direct` | 20 | 0.718 | 1.000 |
117101
| [ccb_design_haiku_20260225_234223](runs/ccb_design_haiku_20260225_234223.md) | `ccb_design` | `baseline-local-direct` | 7 | 0.723 | 0.857 |
118102
| [ccb_design_haiku_20260226_015500_backfill](runs/ccb_design_haiku_20260226_015500_backfill.md) | `ccb_design` | `baseline-local-direct` | 7 | 0.723 | 0.857 |
119-
| [ccb_document_haiku_022326](runs/ccb_document_haiku_022326.md) | `ccb_document` | `baseline` | 14 | 0.904 | 1.000 |
120-
| [ccb_document_haiku_022326](runs/ccb_document_haiku_022326.md) | `ccb_document` | `mcp` | 15 | 0.953 | 1.000 |
103+
| [ccb_document_haiku_022326](runs/ccb_document_haiku_022326.md) | `ccb_document` | `baseline-local-direct` | 14 | 0.904 | 1.000 |
104+
| [ccb_document_haiku_022326](runs/ccb_document_haiku_022326.md) | `ccb_document` | `mcp-remote-direct` | 15 | 0.953 | 1.000 |
121105
| [ccb_document_haiku_20260224_174311](runs/ccb_document_haiku_20260224_174311.md) | `ccb_document` | `baseline-local-direct` | 5 | 0.658 | 1.000 |
122106
| [ccb_document_haiku_20260224_174311](runs/ccb_document_haiku_20260224_174311.md) | `ccb_document` | `mcp-remote-direct` | 5 | 0.720 | 1.000 |
123107
| [ccb_document_haiku_20260226_015500_backfill](runs/ccb_document_haiku_20260226_015500_backfill.md) | `ccb_document` | `baseline-local-direct` | 1 | 1.000 | 1.000 |
124-
| [ccb_fix_haiku_022326](runs/ccb_fix_haiku_022326.md) | `ccb_fix` | `baseline` | 17 | 0.535 | 0.706 |
125-
| [ccb_fix_haiku_022326](runs/ccb_fix_haiku_022326.md) | `ccb_fix` | `mcp` | 17 | 0.538 | 0.647 |
108+
| [ccb_fix_haiku_022326](runs/ccb_fix_haiku_022326.md) | `ccb_fix` | `baseline-local-direct` | 17 | 0.535 | 0.706 |
109+
| [ccb_fix_haiku_022326](runs/ccb_fix_haiku_022326.md) | `ccb_fix` | `mcp-remote-direct` | 17 | 0.538 | 0.647 |
126110
| [ccb_fix_haiku_20260224_203138](runs/ccb_fix_haiku_20260224_203138.md) | `ccb_fix` | `baseline-local-direct` | 1 | 0.710 | 1.000 |
127111
| [ccb_fix_haiku_20260224_203138](runs/ccb_fix_haiku_20260224_203138.md) | `ccb_fix` | `mcp-remote-direct` | 1 | 0.740 | 1.000 |
128112
| [ccb_fix_haiku_20260226_015500_backfill](runs/ccb_fix_haiku_20260226_015500_backfill.md) | `ccb_fix` | `baseline-local-direct` | 2 | 0.235 | 0.500 |
@@ -254,18 +238,18 @@ Historical reruns/backfills remain available in `data/official_results.json` und
254238
| [ccb_mcp_security_haiku_20260226_035633_variance](runs/ccb_mcp_security_haiku_20260226_035633_variance.md) | `ccb_mcp_security` | `baseline-local-direct` | 1 | 0.586 | 1.000 |
255239
| [ccb_mcp_security_haiku_20260226_035633_variance](runs/ccb_mcp_security_haiku_20260226_035633_variance.md) | `ccb_mcp_security` | `mcp-remote-direct` | 4 | 0.731 | 1.000 |
256240
| [ccb_mcp_security_haiku_20260226_205845](runs/ccb_mcp_security_haiku_20260226_205845.md) | `ccb_mcp_security` | `baseline-local-direct` | 3 | 0.682 | 1.000 |
257-
| [ccb_secure_haiku_022326](runs/ccb_secure_haiku_022326.md) | `ccb_secure` | `baseline` | 18 | 0.688 | 0.944 |
258-
| [ccb_secure_haiku_022326](runs/ccb_secure_haiku_022326.md) | `ccb_secure` | `mcp` | 18 | 0.705 | 1.000 |
241+
| [ccb_secure_haiku_022326](runs/ccb_secure_haiku_022326.md) | `ccb_secure` | `baseline-local-direct` | 18 | 0.688 | 0.944 |
242+
| [ccb_secure_haiku_022326](runs/ccb_secure_haiku_022326.md) | `ccb_secure` | `mcp-remote-direct` | 18 | 0.705 | 1.000 |
259243
| [ccb_secure_haiku_20260224_213146](runs/ccb_secure_haiku_20260224_213146.md) | `ccb_secure` | `baseline-local-direct` | 2 | 0.500 | 1.000 |
260244
| [ccb_secure_haiku_20260224_213146](runs/ccb_secure_haiku_20260224_213146.md) | `ccb_secure` | `mcp-remote-direct` | 2 | 0.250 | 0.500 |
261-
| [ccb_test_haiku_022326](runs/ccb_test_haiku_022326.md) | `ccb_test` | `baseline` | 9 | 0.472 | 0.778 |
262-
| [ccb_test_haiku_022326](runs/ccb_test_haiku_022326.md) | `ccb_test` | `mcp` | 8 | 0.555 | 0.625 |
245+
| [ccb_test_haiku_022326](runs/ccb_test_haiku_022326.md) | `ccb_test` | `baseline-local-direct` | 9 | 0.472 | 0.778 |
246+
| [ccb_test_haiku_022326](runs/ccb_test_haiku_022326.md) | `ccb_test` | `mcp-remote-direct` | 8 | 0.555 | 0.625 |
263247
| [ccb_test_haiku_20260224_180149](runs/ccb_test_haiku_20260224_180149.md) | `ccb_test` | `baseline-local-direct` | 11 | 0.486 | 0.727 |
264248
| [ccb_test_haiku_20260224_180149](runs/ccb_test_haiku_20260224_180149.md) | `ccb_test` | `mcp-remote-direct` | 11 | 0.387 | 0.727 |
265249
| [ccb_test_haiku_20260226_015500_backfill](runs/ccb_test_haiku_20260226_015500_backfill.md) | `ccb_test` | `baseline-local-direct` | 1 | 0.370 | 1.000 |
266250
| [ccb_test_haiku_20260226_015500_backfill](runs/ccb_test_haiku_20260226_015500_backfill.md) | `ccb_test` | `mcp-remote-direct` | 1 | 0.900 | 1.000 |
267-
| [ccb_understand_haiku_022426](runs/ccb_understand_haiku_022426.md) | `ccb_understand` | `baseline` | 13 | 0.592 | 0.692 |
268-
| [ccb_understand_haiku_022426](runs/ccb_understand_haiku_022426.md) | `ccb_understand` | `mcp` | 13 | 0.841 | 1.000 |
251+
| [ccb_understand_haiku_022426](runs/ccb_understand_haiku_022426.md) | `ccb_understand` | `baseline-local-direct` | 13 | 0.592 | 0.692 |
252+
| [ccb_understand_haiku_022426](runs/ccb_understand_haiku_022426.md) | `ccb_understand` | `mcp-remote-direct` | 13 | 0.841 | 1.000 |
269253
| [debug_haiku_20260223_154724](runs/debug_haiku_20260223_154724.md) | `ccb_debug` | `baseline-local-direct` | 20 | 0.670 | 1.000 |
270254
| [debug_haiku_20260223_154724](runs/debug_haiku_20260223_154724.md) | `ccb_debug` | `mcp-remote-direct` | 20 | 0.487 | 0.600 |
271255
| [design_haiku_20260223_124652](runs/design_haiku_20260223_124652.md) | `ccb_design` | `baseline-local-direct` | 13 | 0.770 | 1.000 |

docs/official_results/audits/ccb_build_haiku_022326--baseline--bustub-hyperloglog-impl-001.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1545,7 +1545,7 @@
15451545
"trajectory_sha256": "abbebfc8576932ca312f951ee3a35f89e0f496fe4d9f73a0d6e0636ed1bf90e3",
15461546
"transcript_sha256": "77f0ce290b2a4792e74680f6660294911f9f1344620ec45ca3bd48b8d679b875"
15471547
},
1548-
"config": "baseline",
1548+
"config": "baseline-local-direct",
15491549
"result_path": "runs/official/ccb_build_haiku_022326/baseline/ccb_build_bustub-hyperloglog-impl-001_baseline-local-direct/bustub-hyperloglog-impl-001__3icLMXy/result.json",
15501550
"run_dir": "ccb_build_haiku_022326",
15511551
"suite": "ccb_build",

docs/official_results/audits/ccb_build_haiku_022326--baseline--cgen-deps-install-001.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -330,7 +330,7 @@
330330
"trajectory_sha256": "ff3384c500d18b0583923cce59ff33eca8ebccb6bf5f8036891dd4541220aeb8",
331331
"transcript_sha256": "d2ae71534ecb860c1ea1d630e24ec2ae205a5855c943bf5be7f27a0ad0203c2c"
332332
},
333-
"config": "baseline",
333+
"config": "baseline-local-direct",
334334
"result_path": "runs/official/ccb_build_haiku_022326/baseline/ccb_build_cgen-deps-install-001_baseline-local-direct/cgen-deps-install-001__YJMCyCc/result.json",
335335
"run_dir": "ccb_build_haiku_022326",
336336
"suite": "ccb_build",

docs/official_results/audits/ccb_build_haiku_022326--baseline--codecoverage-deps-install-001.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,7 @@
186186
"trajectory_sha256": "093c0b5eeda449343e256407e77fda3b6730bf23e64f387c37ed01d48f751e44",
187187
"transcript_sha256": "01400ee26a61ee41337b4f8f626360d621a46793c314f2e7d3c95cae5b6f8004"
188188
},
189-
"config": "baseline",
189+
"config": "baseline-local-direct",
190190
"result_path": "runs/official/ccb_build_haiku_022326/baseline/ccb_build_codecoverage-deps-install-001_baseline-local-direct/codecoverage-deps-install-001__kBsG3jC/result.json",
191191
"run_dir": "ccb_build_haiku_022326",
192192
"suite": "ccb_build",

docs/official_results/audits/ccb_build_haiku_022326--baseline--dotenv-expand-deps-install-001.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -222,7 +222,7 @@
222222
"trajectory_sha256": "36520034216371213010f3b839a46a551e0409962f04d7d711061d508e0e0152",
223223
"transcript_sha256": "2dad31fb9dfc8a8897b1aad1cff1018cb9e19fefe54cf2a40631629855aac208"
224224
},
225-
"config": "baseline",
225+
"config": "baseline-local-direct",
226226
"result_path": "runs/official/ccb_build_haiku_022326/baseline/ccb_build_dotenv-expand-deps-install-001_baseline-local-direct/dotenv-expand-deps-install-001__QRwDqLP/result.json",
227227
"run_dir": "ccb_build_haiku_022326",
228228
"suite": "ccb_build",

docs/official_results/audits/ccb_build_haiku_022326--baseline--dotnetkoans-deps-install-001.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -609,7 +609,7 @@
609609
"trajectory_sha256": "bc4dc3e8f53a378180991ec401a0ee083c1d450c5a240585970ceb0709d6a63a",
610610
"transcript_sha256": "9425c05527d987f77110c66e2f7937a343ef4717fc22fd80e2ac55981f7a8bd9"
611611
},
612-
"config": "baseline",
612+
"config": "baseline-local-direct",
613613
"result_path": "runs/official/ccb_build_haiku_022326/baseline/ccb_build_dotnetkoans-deps-install-001_baseline-local-direct/dotnetkoans-deps-install-001__KDVfh8k/result.json",
614614
"run_dir": "ccb_build_haiku_022326",
615615
"suite": "ccb_build",

docs/official_results/audits/ccb_build_haiku_022326--baseline--envoy-grpc-server-impl-001.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -182,7 +182,7 @@
182182
"trajectory_sha256": "bc95e9df3b586c376ae26ecf5a6ff3b2210b4a8bbef4fec7b5a50b743040515c",
183183
"transcript_sha256": "7abc87f22a578c868e80a96954a2ce40aa1f6aeab3665675e46f632903c8b015"
184184
},
185-
"config": "baseline",
185+
"config": "baseline-local-direct",
186186
"result_path": "runs/official/ccb_build_haiku_022326/baseline/ccb_build_envoy-grpc-server-impl-001_baseline-local-direct/envoy-grpc-server-impl-001__wJGXPrq/result.json",
187187
"run_dir": "ccb_build_haiku_022326",
188188
"suite": "ccb_build",

docs/official_results/audits/ccb_build_haiku_022326--baseline--eslint-markdown-deps-install-001.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -331,7 +331,7 @@
331331
"trajectory_sha256": "a9c0c99a23acbe82c5f789945b7c6c76d58ef2995fc7ad0d8a2891506026d131",
332332
"transcript_sha256": "702c41d505ee09d3ceb307f959b02549303208766e942061f3a6b8a70963a37d"
333333
},
334-
"config": "baseline",
334+
"config": "baseline-local-direct",
335335
"result_path": "runs/official/ccb_build_haiku_022326/baseline/ccb_build_eslint-markdown-deps-install-001_baseline-local-direct/eslint-markdown-deps-install-001__St58aVB/result.json",
336336
"run_dir": "ccb_build_haiku_022326",
337337
"suite": "ccb_build",

0 commit comments

Comments
 (0)