Description
When running compare_results.py against two run directories where the judge was enabled and quality scores are present in results.json, the quality section of the comparison report shows N/A for all metrics instead of the actual scores.
Steps to reproduce
- Run two full evaluations with
judge.enabled: true in your config.
- Compare them:
python scripts/compare_results.py results/run-A results/run-B
- Observe the Quality section of the output — all values show
N/A.
Expected behaviour
The comparison should display the pairwise win rates, absolute scores, and per-category breakdowns from both runs so you can track quality changes between experiments.
Actual behaviour
Quality section renders N/A for every metric even when results.json contains complete judge output.
Environment
- Tested on the
main branch as of 2026-05-22
- Python 3.14, Windows 11
- Both runs had
judge.enabled: true and completed without errors
Notes
This is likely a key mismatch when the script reads the quality block from results.json. The evaluation and individual reporting pipelines (report.py) read quality data correctly — the issue appears to be specific to the comparison script.
Description
When running
compare_results.pyagainst two run directories where the judge was enabled and quality scores are present inresults.json, the quality section of the comparison report shows N/A for all metrics instead of the actual scores.Steps to reproduce
judge.enabled: truein your config.N/A.Expected behaviour
The comparison should display the pairwise win rates, absolute scores, and per-category breakdowns from both runs so you can track quality changes between experiments.
Actual behaviour
Quality section renders
N/Afor every metric even whenresults.jsoncontains complete judge output.Environment
mainbranch as of 2026-05-22judge.enabled: trueand completed without errorsNotes
This is likely a key mismatch when the script reads the quality block from
results.json. The evaluation and individual reporting pipelines (report.py) read quality data correctly — the issue appears to be specific to the comparison script.