Skip to content

compare_results.py: quality metrics always return N/A when judge was enabled #1

@jpsantoscosta

Description

@jpsantoscosta

Description

When running compare_results.py against two run directories where the judge was enabled and quality scores are present in results.json, the quality section of the comparison report shows N/A for all metrics instead of the actual scores.

Steps to reproduce

  1. Run two full evaluations with judge.enabled: true in your config.
  2. Compare them:
    python scripts/compare_results.py results/run-A results/run-B
  3. Observe the Quality section of the output — all values show N/A.

Expected behaviour

The comparison should display the pairwise win rates, absolute scores, and per-category breakdowns from both runs so you can track quality changes between experiments.

Actual behaviour

Quality section renders N/A for every metric even when results.json contains complete judge output.

Environment

  • Tested on the main branch as of 2026-05-22
  • Python 3.14, Windows 11
  • Both runs had judge.enabled: true and completed without errors

Notes

This is likely a key mismatch when the script reads the quality block from results.json. The evaluation and individual reporting pipelines (report.py) read quality data correctly — the issue appears to be specific to the comparison script.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions