[fix] Render numeric/discrete evaluator metrics as values, not raw stats by mmabrouk · Pull Request #4586 · Agenta-AI/agenta

mmabrouk · 2026-06-08T15:49:38Z

Context

Running an evaluation whose evaluator returns integer counts (an LLM judge that reports passed and total, e.g. passed=2 / total=4) showed those columns in the scenarios table as a raw stats blob:

{"type":"numeric/discrete","count":1,"max":4,"min":4,"sum":4,"mean":4,...}

The same fields in the testcase drawer rendered passed as false instead of the count. Two separate display paths each mishandled the numeric/discrete metric type. score (numeric/continuous) and success (binary) rendered fine, which is why only total and passed looked wrong.

Changes

Two fixes.

1. Table cell showed the raw stats object. unwrapStatsForCompare reduces a per-scenario stats object to a scalar, but only handled binary and numeric/continuous. A numeric/discrete value fell through unchanged and got JSON-stringified into the cell. Added the numeric/discrete case so it reduces to mean/sum like the continuous one. This also fixes filtering on discrete metrics, which previously compared the predicate against the raw object.

Before (table cell): {"type":"numeric/discrete","count":1,"mean":4,...}
After: 4

2. Drawer coerced the count to a boolean. The drawer coerces a metric to a boolean when its field name contains success or passed, so a passed count of 0 rendered as false. Removed passed from that name heuristic (kept success). A genuinely boolean passed still renders true/false through formatMetricDisplay, and an explicit boolean metricType still coerces.

Before (drawer): Rubric Correctness (passed) = false
After: Rubric Correctness (passed) = 0

Tests

Lint and type-check pass on both files.
Verified on a live eval run. The scenarios table now shows total and passed as integers (one row total=4 passed=0, another passed=4), and the drawer shows passed as the count. success still renders false/true correctly in both the table and the drawer.

vercel · 2026-06-08T15:49:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agenta-documentation	Ready	Preview, Comment	Jun 8, 2026 3:54pm

coderabbitai · 2026-06-08T15:49:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 10ff8bd8-75d0-4348-b21b-a480e513ea15

📥 Commits

Reviewing files that changed from the base of the PR and between 535ca8d and 5818261.

📒 Files selected for processing (2)

web/oss/src/components/EvalRunDetails/atoms/table/columnAccess.ts
web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts

✅ Files skipped from review due to trivial changes (1)

web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts

🚧 Files skipped from review as they are similar to previous changes (1)

web/oss/src/components/EvalRunDetails/atoms/table/columnAccess.ts

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Refined boolean metric interpretation to consistently treat metrics identified by "success", improving when values are shown as true/false in reports and dashboards.
- Improved numeric metric comparisons to extract and compare dominant numeric values from discrete metrics, aligning their behavior with continuous metrics for more accurate comparisons.

Walkthrough

This PR updates metric value interpretation logic in evaluation run components. The inferBooleanMetric function is refined to coerce only fields containing "success" in their names, excluding "passed" fields. Separately, unwrapStatsForCompare is extended to handle discrete numeric statistics by extracting their dominant numeric value (mean, sum, or count) for comparison, matching the behavior for continuous and regular numeric types.

Changes

Metric Value Interpretation Updates

Layer / File(s)	Summary
Boolean metric coercion refinement `web/oss/src/components/EvalRunDetails/atoms/table/columnAccess.ts`	`inferBooleanMetric` now coerces only fields whose name contains `"success"` (by `valueKey`, `metricKey`, or `path`), with an explicit comment documenting the exclusion of `"passed"` fields from coercion.
Discrete numeric statistics unwrapping `web/packages/agenta-entities/src/evaluationRun/etl/rowPredicateFilter.ts`	`unwrapStatsForCompare` documentation is updated and its condition is expanded to include `type: "numeric/discrete"`, enabling discrete numeric stats to be unwrapped via the same mean/sum/count extraction path used for continuous and regular numeric types.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing numeric/discrete metric rendering from raw stats objects to displayable values.
Description check	✅ Passed	The description is directly related to the changeset, providing comprehensive context, explaining both fixes, and documenting the manual testing performed.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 60.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/eval-discrete-metric-display

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-06-08T16:08:36Z

Railway Preview Environment


Preview URL	https://gateway-production-eeb3.up.railway.app/w
Project	`agenta-oss-pr-4586`
Image tag	`pr-4586-0a624d8`
Status	Deployed
Railway logs	Open logs
Workflow logs	View workflow run
Updated at 2026-06-08T16:08:35.600Z

dosubot Bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jun 8, 2026

dosubot Bot added Bug Report Something isn't working Frontend labels Jun 8, 2026

vercel Bot deployed to Preview June 8, 2026 15:50 View deployment

fix(frontend): render numeric/discrete evaluator metrics as scalars

5818261

mmabrouk force-pushed the fix/eval-discrete-metric-display branch from 535ca8d to 5818261 Compare June 8, 2026 15:53

mmabrouk requested a review from ardaerzin June 8, 2026 15:54

vercel Bot deployed to Preview June 8, 2026 15:54 View deployment

ardaerzin approved these changes Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] Render numeric/discrete evaluator metrics as values, not raw stats#4586

[fix] Render numeric/discrete evaluator metrics as values, not raw stats#4586
mmabrouk wants to merge 1 commit into
mainfrom
fix/eval-discrete-metric-display

mmabrouk commented Jun 8, 2026

Uh oh!

vercel Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

github-actions Bot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mmabrouk commented Jun 8, 2026

Context

Changes

Tests

Uh oh!

vercel Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

github-actions Bot commented Jun 8, 2026

Railway Preview Environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 8, 2026 •

edited

Loading

coderabbitai Bot commented Jun 8, 2026 •

edited

Loading