Skip to content

[fix] Render numeric/discrete evaluator metrics as values, not raw stats#4586

Open
mmabrouk wants to merge 1 commit into
mainfrom
fix/eval-discrete-metric-display
Open

[fix] Render numeric/discrete evaluator metrics as values, not raw stats#4586
mmabrouk wants to merge 1 commit into
mainfrom
fix/eval-discrete-metric-display

Conversation

@mmabrouk
Copy link
Copy Markdown
Member

@mmabrouk mmabrouk commented Jun 8, 2026

Context

Running an evaluation whose evaluator returns integer counts (an LLM judge that reports passed and total, e.g. passed=2 / total=4) showed those columns in the scenarios table as a raw stats blob:

{"type":"numeric/discrete","count":1,"max":4,"min":4,"sum":4,"mean":4,...}

The same fields in the testcase drawer rendered passed as false instead of the count. Two separate display paths each mishandled the numeric/discrete metric type. score (numeric/continuous) and success (binary) rendered fine, which is why only total and passed looked wrong.

Changes

Two fixes.

1. Table cell showed the raw stats object. unwrapStatsForCompare reduces a per-scenario stats object to a scalar, but only handled binary and numeric/continuous. A numeric/discrete value fell through unchanged and got JSON-stringified into the cell. Added the numeric/discrete case so it reduces to mean/sum like the continuous one. This also fixes filtering on discrete metrics, which previously compared the predicate against the raw object.

Before (table cell): {"type":"numeric/discrete","count":1,"mean":4,...}
After: 4

2. Drawer coerced the count to a boolean. The drawer coerces a metric to a boolean when its field name contains success or passed, so a passed count of 0 rendered as false. Removed passed from that name heuristic (kept success). A genuinely boolean passed still renders true/false through formatMetricDisplay, and an explicit boolean metricType still coerces.

Before (drawer): Rubric Correctness (passed) = false
After: Rubric Correctness (passed) = 0

Tests

  • Lint and type-check pass on both files.
  • Verified on a live eval run. The scenarios table now shows total and passed as integers (one row total=4 passed=0, another passed=4), and the drawer shows passed as the count. success still renders false/true correctly in both the table and the drawer.

@dosubot dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 8, 2026
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agenta-documentation Ready Ready Preview, Comment Jun 8, 2026 3:54pm

Request Review

@dosubot dosubot Bot added Bug Report Something isn't working Frontend labels Jun 8, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 8, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 10ff8bd8-75d0-4348-b21b-a480e513ea15

📥 Commits

Reviewing files that changed from the base of the PR and between 535ca8d and 5818261.

📒 Files selected for processing (2)
  • web/oss/src/components/EvalRunDetails/atoms/table/columnAccess.ts
  • web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts
✅ Files skipped from review due to trivial changes (1)
  • web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • web/oss/src/components/EvalRunDetails/atoms/table/columnAccess.ts

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes
    • Refined boolean metric interpretation to consistently treat metrics identified by "success", improving when values are shown as true/false in reports and dashboards.
    • Improved numeric metric comparisons to extract and compare dominant numeric values from discrete metrics, aligning their behavior with continuous metrics for more accurate comparisons.

Walkthrough

This PR updates metric value interpretation logic in evaluation run components. The inferBooleanMetric function is refined to coerce only fields containing "success" in their names, excluding "passed" fields. Separately, unwrapStatsForCompare is extended to handle discrete numeric statistics by extracting their dominant numeric value (mean, sum, or count) for comparison, matching the behavior for continuous and regular numeric types.

Changes

Metric Value Interpretation Updates

Layer / File(s) Summary
Boolean metric coercion refinement
web/oss/src/components/EvalRunDetails/atoms/table/columnAccess.ts
inferBooleanMetric now coerces only fields whose name contains "success" (by valueKey, metricKey, or path), with an explicit comment documenting the exclusion of "passed" fields from coercion.
Discrete numeric statistics unwrapping
web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts
unwrapStatsForCompare documentation is updated and its condition is expanded to include type: "numeric/discrete", enabling discrete numeric stats to be unwrapped via the same mean/sum/count extraction path used for continuous and regular numeric types.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing numeric/discrete metric rendering from raw stats objects to displayable values.
Description check ✅ Passed The description is directly related to the changeset, providing comprehensive context, explaining both fixes, and documenting the manual testing performed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 60.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/eval-discrete-metric-display

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 8, 2026

Railway Preview Environment

Preview URL https://gateway-production-eeb3.up.railway.app/w
Project agenta-oss-pr-4586
Image tag pr-4586-0a624d8
Status Deployed
Railway logs Open logs
Workflow logs View workflow run
Updated at 2026-06-08T16:08:35.600Z

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Report Something isn't working Frontend size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants