Skip to content

Commit e0d5be6

Browse files
committed
Add GitHub-auditable official results and trace export
1 parent 801ef23 commit e0d5be6

File tree

1,053 files changed

+486704
-8
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,053 files changed

+486704
-8
lines changed

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,31 @@ The report generator produces:
238238

239239
See `python3 scripts/generate_eval_report.py --help` for all options.
240240

241+
### Publishable Official Results + Trace Browser
242+
243+
To export GitHub-friendly official results (valid scored tasks only) with parsed
244+
trace summaries and local browsing UI:
245+
246+
```bash
247+
python3 scripts/export_official_results.py \
248+
--runs-dir ./runs/official/ \
249+
--output-dir ./docs/official_results/
250+
```
251+
252+
This writes:
253+
- `docs/official_results/README.md` -- run/config score summary
254+
- `docs/official_results/runs/*.md` -- per-run task tables
255+
- `docs/official_results/tasks/*.md` -- per-task metrics + parsed tool/trace view
256+
- `docs/official_results/data/official_results.json` -- machine-readable dataset
257+
- `docs/official_results/audits/*.json` -- per-task audit artifacts (checksums + parsed trace events)
258+
- `docs/official_results/index.html` -- interactive local browser
259+
260+
Serve locally:
261+
262+
```bash
263+
python3 scripts/export_official_results.py --serve
264+
```
265+
241266
For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical analysis, dual-score reporting), see [docs/EVALUATION_PIPELINE.md](docs/EVALUATION_PIPELINE.md).
242267

243268
---

docs/OFFICIAL_RESULTS_BROWSER.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Official Results Browser
2+
3+
Use this workflow to publish valid official scores with easy-to-view parsed traces.
4+
5+
## What It Exports
6+
7+
`python3 scripts/export_official_results.py` scans `runs/official/` and exports only valid scored tasks (status `passed`/`failed` with numeric reward) into a static bundle:
8+
9+
- `docs/official_results/README.md` - run/config score summary
10+
- `docs/official_results/runs/*.md` - per-run task tables
11+
- `docs/official_results/tasks/*.md` - per-task metrics and parsed trace/tool summaries
12+
- `docs/official_results/data/official_results.json` - machine-readable data
13+
- `docs/official_results/audits/*.json` - per-task audit payloads with trace parsing and SHA256 checksums
14+
- `docs/official_results/index.html` - local interactive browser
15+
16+
## Usage
17+
18+
```bash
19+
python3 scripts/export_official_results.py \
20+
--runs-dir ./runs/official/ \
21+
--output-dir ./docs/official_results/
22+
```
23+
24+
Filter to specific run(s):
25+
26+
```bash
27+
python3 scripts/export_official_results.py \
28+
--run ccb_mcp_compliance_haiku_20260226_205845 \
29+
--run ccb_mcp_domain_haiku_20260226_205845
30+
```
31+
32+
Serve locally after export:
33+
34+
```bash
35+
python3 scripts/export_official_results.py --serve
36+
```
37+
38+
## Notes
39+
40+
- The exporter prefers `task_metrics.json` when present and falls back to transcript parsing for tool-call extraction.
41+
- Task pages link to bundled `audits/*.json` so GitHub viewers can audit without local `runs/official/`.
42+
- If `runs/official/MANIFEST.json` exists, export is automatically scoped to run directories tracked in the manifest.

docs/official_results/README.md

Lines changed: 200 additions & 7 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)