+{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T17:34:50Z","event_type":"claimed","id":40,"issue_id":"CodeScaleBench-25b.4","new_value":"{\"assignee\":\"sjarmak\",\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-25b.4\",\"title\":\"Update reporting to separate reward, pass status, and scorer family\",\"description\":\"Goal\\nMake downstream analysis reflect the real semantics of the verifier outputs instead of collapsing everything into one comparable-looking scalar.\\n\\nScope\\n- Thread validation_result metadata into report generation.\\n- Expose pass_threshold and passed alongside reward.\\n- Label evaluator families in summaries and comparisons.\\n- Add caveats or partitioned views where scorer families are not directly comparable.\\n\\nWhy\\nA 0.6 from oracle F1 is not the same construct as a 0.6 from a checklist or repo-grep verifier.\",\"acceptance_criteria\":\"1. Reports surface scorer family and output contract for canonical tasks. 2. Continuous reward and solved/pass status are reported separately. 3. Aggregate reporting avoids direct cross-family comparisons unless calibrated or clearly caveated.\",\"status\":\"open\",\"priority\":2,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:05:19Z\"}"}
0 commit comments