compare_results.py: quality metrics always return N/A when judge was enabled

## Description

When running `compare_results.py` against two run directories where the judge was enabled and quality scores are present in `results.json`, the quality section of the comparison report shows **N/A** for all metrics instead of the actual scores.

## Steps to reproduce

1. Run two full evaluations with `judge.enabled: true` in your config.
2. Compare them:
   ```bash
   python scripts/compare_results.py results/run-A results/run-B
   ```
3. Observe the **Quality** section of the output — all values show `N/A`.

## Expected behaviour

The comparison should display the pairwise win rates, absolute scores, and per-category breakdowns from both runs so you can track quality changes between experiments.

## Actual behaviour

Quality section renders `N/A` for every metric even when `results.json` contains complete judge output.

## Environment

- Tested on the `main` branch as of 2026-05-22
- Python 3.14, Windows 11
- Both runs had `judge.enabled: true` and completed without errors

## Notes

This is likely a key mismatch when the script reads the quality block from `results.json`. The evaluation and individual reporting pipelines (`report.py`) read quality data correctly — the issue appears to be specific to the comparison script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compare_results.py: quality metrics always return N/A when judge was enabled #1

Description

Steps to reproduce

Expected behaviour

Actual behaviour

Environment

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

compare_results.py: quality metrics always return N/A when judge was enabled #1

Description

Description

Steps to reproduce

Expected behaviour

Actual behaviour

Environment

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions