|
| 1 | +# Canonical Evaluation Policy |
| 2 | + |
| 3 | +This document defines the stable evaluation policy for the canonical |
| 4 | +CodeScaleBench task set. |
| 5 | + |
| 6 | +Use this document when you need to answer four questions precisely: |
| 7 | + |
| 8 | +- what every canonical task must do |
| 9 | +- what is allowed to vary by verifier family |
| 10 | +- how artifact-oriented task variants relate to deterministic verification |
| 11 | +- how reporting should interpret reward versus pass/fail |
| 12 | + |
| 13 | +## Universal Policy |
| 14 | + |
| 15 | +These rules apply to every canonical task, regardless of suite or verifier |
| 16 | +family: |
| 17 | + |
| 18 | +- Every task has a deterministic verifier. |
| 19 | +- Every deterministic verifier writes `/logs/verifier/reward.txt`. |
| 20 | +- Canonical verifiers should also write |
| 21 | + `/logs/verifier/validation_result.json`. |
| 22 | +- `validation_result.json` is the semantic verifier contract; `reward.txt` is |
| 23 | + the scalar compatibility artifact. |
| 24 | +- Reporting must preserve continuous `reward` separately from pass semantics. |
| 25 | + |
| 26 | +The deterministic verifier is the authoritative benchmark outcome producer. |
| 27 | +Artifact-oriented flows do not replace it; they give the verifier a structured |
| 28 | +or family-specific input surface. |
| 29 | + |
| 30 | +## Hybrid Output Policy |
| 31 | + |
| 32 | +Canonical tasks intentionally use a hybrid output model. The benchmark does |
| 33 | +not require one universal agent artifact format. |
| 34 | + |
| 35 | +Supported output-contract patterns include: |
| 36 | + |
| 37 | +- `answer_json_native`: the verifier directly scores a structured |
| 38 | + `/workspace/answer.json` contract |
| 39 | +- `answer_json_bridge`: an artifact-oriented image or wrapper maps structured |
| 40 | + agent output into an existing deterministic verifier flow |
| 41 | +- `repo_state`: the verifier scores repository state and tests, with no |
| 42 | + required structured artifact |
| 43 | +- other family-specific contracts such as `solution_json` or |
| 44 | + `report_markdown` |
| 45 | + |
| 46 | +Implications: |
| 47 | + |
| 48 | +- Deterministic verification is universal. |
| 49 | +- Artifact support is family-specific. |
| 50 | +- `answer.json` is common, but it is not universal benchmark policy. |
| 51 | +- Presence of `Dockerfile.artifact_only` does not imply the same verifier |
| 52 | + family or the same artifact semantics across tasks. |
| 53 | + |
| 54 | +The maintained snapshot of current canonical coverage lives in |
| 55 | +`configs/canonical_evaluation_audit.json`. Use that audit to answer |
| 56 | +family-level questions such as which suites are `answer_json_native`, |
| 57 | +`answer_json_bridge`, or still migrating to `validation_result.json`. |
| 58 | + |
| 59 | +## Canonical Verifier Contract |
| 60 | + |
| 61 | +Canonical verifiers should publish semantics through |
| 62 | +`/logs/verifier/validation_result.json` using |
| 63 | +`docs/reference/VALIDATION_RESULT_SCHEMA.md`. |
| 64 | + |
| 65 | +That sidecar is where verifiers declare: |
| 66 | + |
| 67 | +- `status` and `scorable` |
| 68 | +- `scorer_family` |
| 69 | +- `reward` |
| 70 | +- `pass_threshold` |
| 71 | +- `passed` |
| 72 | +- `output_contract` |
| 73 | +- `sub_scores` |
| 74 | +- structured failure context |
| 75 | + |
| 76 | +Downstream consumers should treat `passed` as the authoritative solved/pass |
| 77 | +flag. They should not recompute solved status from `reward > 0`. |
| 78 | + |
| 79 | +## Reporting Policy |
| 80 | + |
| 81 | +Reporting and export code must keep these concepts separate: |
| 82 | + |
| 83 | +- `reward`: continuous scalar produced by the deterministic verifier |
| 84 | +- `passed`: authoritative pass/fail flag from verifier semantics |
| 85 | +- `pass_threshold`: task or family policy threshold |
| 86 | +- `scorer_family`: family that gives meaning to the reward |
| 87 | +- `output_contract`: verifier-facing output mode |
| 88 | + |
| 89 | +Mean reward is still useful, but mixed-family aggregates require caveats. A |
| 90 | +0.7 from `test_ratio`, `oracle_checks`, and `checklist` should not be treated |
| 91 | +as silently calibrated equivalents. |
| 92 | + |
| 93 | +Operationally: |
| 94 | + |
| 95 | +- use `passed` / `status` for pass-rate tables when available |
| 96 | +- use `reward` for continuous-score summaries |
| 97 | +- surface `scorer_family` and `output_contract` in reports and exports |
| 98 | +- caveat or partition mixed-family reward aggregates |
| 99 | + |
| 100 | +## Launch And Validation Expectations |
| 101 | + |
| 102 | +Preflight checks, smoke runs, and launch docs should assume: |
| 103 | + |
| 104 | +- the deterministic verifier always exists |
| 105 | +- required artifacts come from the task's published output contract |
| 106 | +- missing required artifacts are invalid-output conditions, not ordinary |
| 107 | + benchmark misses |
| 108 | +- artifact-oriented image variants must preserve the same verifier semantics, |
| 109 | + even when the agent-facing output path differs by family |
| 110 | + |
| 111 | +The benchmark should therefore validate artifact expectations from task |
| 112 | +metadata and verifier contract, not from a blanket assumption that every task |
| 113 | +must produce `/workspace/answer.json`. |
0 commit comments