[fix] Render numeric/discrete evaluator metrics as values, not raw stats#4586
[fix] Render numeric/discrete evaluator metrics as values, not raw stats#4586mmabrouk wants to merge 1 commit into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThis PR updates metric value interpretation logic in evaluation run components. The ChangesMetric Value Interpretation Updates
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
535ca8d to
5818261
Compare
Railway Preview Environment
|
Context
Running an evaluation whose evaluator returns integer counts (an LLM judge that reports
passedandtotal, e.g.passed=2 / total=4) showed those columns in the scenarios table as a raw stats blob:The same fields in the testcase drawer rendered
passedasfalseinstead of the count. Two separate display paths each mishandled thenumeric/discretemetric type.score(numeric/continuous) andsuccess(binary) rendered fine, which is why onlytotalandpassedlooked wrong.Changes
Two fixes.
1. Table cell showed the raw stats object.
unwrapStatsForComparereduces a per-scenario stats object to a scalar, but only handledbinaryandnumeric/continuous. Anumeric/discretevalue fell through unchanged and got JSON-stringified into the cell. Added thenumeric/discretecase so it reduces to mean/sum like the continuous one. This also fixes filtering on discrete metrics, which previously compared the predicate against the raw object.Before (table cell):
{"type":"numeric/discrete","count":1,"mean":4,...}After:
42. Drawer coerced the count to a boolean. The drawer coerces a metric to a boolean when its field name contains
successorpassed, so apassedcount of0rendered asfalse. Removedpassedfrom that name heuristic (keptsuccess). A genuinely booleanpassedstill renderstrue/falsethroughformatMetricDisplay, and an explicit booleanmetricTypestill coerces.Before (drawer):
Rubric Correctness (passed)=falseAfter:
Rubric Correctness (passed)=0Tests
totalandpassedas integers (one rowtotal=4 passed=0, anotherpassed=4), and the drawer showspassedas the count.successstill rendersfalse/truecorrectly in both the table and the drawer.