docs: document canonical hybrid evaluation policy

sjarmak · sjarmak · commit c7e31ad60694 · 2026-03-09T18:18:31.000Z
diff --git a/docs/AGENT_INTERFACE.md b/docs/AGENT_INTERFACE.md
@@ -40,6 +40,12 @@ Each task has a `time_limit_sec` field in `task.toml` (typically 300-1800 second
 
 The agent modifies files in the workspace to solve the task. After the agent finishes (or times out), the verifier runs `tests/test.sh` to evaluate the result.
 
+The required agent output is task-specific. Some tasks are scored from repo
+state alone, some require a structured artifact such as
+`/workspace/answer.json`, and some use other published output paths. Agents
+should follow the task's declared contract rather than assuming one universal
+artifact format.
+
 ### Verification
 
 The test script (`tests/test.sh`) is uploaded by Harbor to `/tests/` in the container at runtime. It is **not** present in the workspace directory. The script:
@@ -51,6 +57,11 @@ The test script (`tests/test.sh`) is uploaded by Harbor to `/tests/` in the cont
 4. May use non-zero exit codes to distinguish scored failure from verifier/runtime failure;
    Harbor still reads the scalar reward artifact
 
+For canonical tasks, `reward.txt` remains the compatibility artifact, while
+`validation_result.json` carries the semantic outcome: scorer family,
+authoritative `passed`, `pass_threshold`, output contract, and invalid-output
+context.
+
 ### Result Format
 
 Harbor produces a `result.json` for each task containing:
diff --git a/docs/EVALUATION_PIPELINE.md b/docs/EVALUATION_PIPELINE.md
@@ -11,6 +11,10 @@ retrieval/IR evaluation pipeline (normalized retrieval events, file/chunk IR
 metrics, utilization probes, taxonomy, and emitted artifacts), see
 [RETRIEVAL_EVAL_SPEC.md](RETRIEVAL_EVAL_SPEC.md).
 
+For canonical-task policy, read
+[docs/reference/CANONICAL_EVALUATION_POLICY.md](reference/CANONICAL_EVALUATION_POLICY.md)
+alongside this pipeline document.
+
 ---
 
 ## Pipeline Layers
@@ -56,6 +60,11 @@ in [docs/reference/VALIDATION_RESULT_SCHEMA.md](reference/VALIDATION_RESULT_SCHE
 so downstream reporting can preserve scorer family, pass semantics, and failure
 context.
 
+This is the core hybrid-policy rule: deterministic verifier reward is
+universal, but the agent-facing output contract is family-specific. Some tasks
+score repo state directly, some natively score `answer.json`, and some use
+artifact-oriented bridge variants that still feed the same verifier semantics.
+
 Verifier types are documented in [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md).
 
 ### Verifier Debug Mode
@@ -242,6 +251,11 @@ my-fix-task-002  |            1.00 |        0.75 |  -0.25 | medium [DIVERGENT]
 Tasks where `abs(verifier_reward - judge_score) > 0.3` are flagged `[DIVERGENT]`
 for manual review.
 
+For canonical deterministic reporting, treat continuous reward and pass/fail as
+distinct dimensions. Report generators should use verifier `passed` /
+`pass_threshold` metadata when available and surface `scorer_family` plus
+`output_contract` so mixed-family reward aggregates are explicitly caveated.
+
 ---
 
 ## Generating Reports
diff --git a/docs/REPORT_CONTEXT.md b/docs/REPORT_CONTEXT.md
@@ -124,9 +124,12 @@ grep/glob/read. MCP-Full agents have truncated source and must use
 Sourcegraph MCP tools (keyword search, semantic search, go-to-definition,
 find-references, deep search, etc.).
 
-For Org tasks, an artifact evaluation variant is also used:
-- `baseline-local-artifact`: full local code, structured `answer.json` output
-- `mcp-remote-artifact`: truncated source, MCP tools, structured `answer.json` output
+Canonical tasks may also have artifact-oriented variants, but those variants
+follow a hybrid policy rather than one universal output format. Some tasks use
+native `answer.json`, some use bridge-mode structured artifacts, and some are
+still fundamentally repo-state verifiers. The maintained audit snapshot in
+`configs/canonical_evaluation_audit.json` is the source of truth for current
+family-level coverage and migration status.
 
 ### 3.2 Verification Pipeline
 
@@ -139,6 +142,9 @@ The evaluation uses a multi-layer pipeline:
    `/logs/verifier/validation_result.json` sidecar so scorer family,
    pass/fail semantics, sub-scores, and invalid-output context are preserved.
 
+   Deterministic verifier reward is the universal policy. Artifact support is
+   family-specific input to that verifier layer, not a replacement for it.
+
 2. **Optional LLM judge**: Post-hoc qualitative scoring across five
    dimensions (correctness 0.30, completeness 0.25, code quality 0.20,
    retrieval quality 0.15, efficiency 0.10) with multi-round voting.
diff --git a/docs/SCORING_SEMANTICS.md b/docs/SCORING_SEMANTICS.md
@@ -19,6 +19,11 @@ Canonical tasks should normalize these families into
 `docs/reference/VALIDATION_RESULT_SCHEMA.md`. The reward type determines the
 meaning of `reward` and `sub_scores`, but not the top-level contract.
 
+See `docs/reference/CANONICAL_EVALUATION_POLICY.md` for the stable policy that
+ties these families together: deterministic verifier reward is universal,
+artifact support is hybrid and family-specific, and reporting must keep reward
+separate from pass semantics.
+
 ## Per-Verifier Scoring (Active Suites)
 
 Tasks are organized into 8 SDLC-phase suites (`csb_sdlc_understand` through `csb_sdlc_debug`)
@@ -209,6 +214,10 @@ only when non-empty.
 Org tasks use a unified oracle check library for deterministic scoring,
 with optional rubric judge for Deep Search synthesis tasks.
 
+This section is Org-specific. The `/workspace/answer.json` format below is not
+the universal canonical benchmark contract; other families may use bridge-mode
+artifacts or repo-state verification instead.
+
 ### Oracle Checks (scripts/csb_metrics/oracle_checks.py)
 
 All Org tasks are scored by `oracle_checks.py`, a stdlib-only Python
@@ -238,7 +247,8 @@ composite == 0 (total failure). Harbor reads the score from `/logs/verifier/rewa
 
 ### Agent Answer Format
 
-Agents write `/workspace/answer.json`:
+Agents write `/workspace/answer.json` for Org tasks using the native
+answer-artifact contract:
 
 ```json
 {
diff --git a/docs/reference/CANONICAL_EVALUATION_POLICY.md b/docs/reference/CANONICAL_EVALUATION_POLICY.md
@@ -0,0 +1,113 @@
+# Canonical Evaluation Policy
+
+This document defines the stable evaluation policy for the canonical
+CodeScaleBench task set.
+
+Use this document when you need to answer four questions precisely:
+
+- what every canonical task must do
+- what is allowed to vary by verifier family
+- how artifact-oriented task variants relate to deterministic verification
+- how reporting should interpret reward versus pass/fail
+
+## Universal Policy
+
+These rules apply to every canonical task, regardless of suite or verifier
+family:
+
+- Every task has a deterministic verifier.
+- Every deterministic verifier writes `/logs/verifier/reward.txt`.
+- Canonical verifiers should also write
+  `/logs/verifier/validation_result.json`.
+- `validation_result.json` is the semantic verifier contract; `reward.txt` is
+  the scalar compatibility artifact.
+- Reporting must preserve continuous `reward` separately from pass semantics.
+
+The deterministic verifier is the authoritative benchmark outcome producer.
+Artifact-oriented flows do not replace it; they give the verifier a structured
+or family-specific input surface.
+
+## Hybrid Output Policy
+
+Canonical tasks intentionally use a hybrid output model. The benchmark does
+not require one universal agent artifact format.
+
+Supported output-contract patterns include:
+
+- `answer_json_native`: the verifier directly scores a structured
+  `/workspace/answer.json` contract
+- `answer_json_bridge`: an artifact-oriented image or wrapper maps structured
+  agent output into an existing deterministic verifier flow
+- `repo_state`: the verifier scores repository state and tests, with no
+  required structured artifact
+- other family-specific contracts such as `solution_json` or
+  `report_markdown`
+
+Implications:
+
+- Deterministic verification is universal.
+- Artifact support is family-specific.
+- `answer.json` is common, but it is not universal benchmark policy.
+- Presence of `Dockerfile.artifact_only` does not imply the same verifier
+  family or the same artifact semantics across tasks.
+
+The maintained snapshot of current canonical coverage lives in
+`configs/canonical_evaluation_audit.json`. Use that audit to answer
+family-level questions such as which suites are `answer_json_native`,
+`answer_json_bridge`, or still migrating to `validation_result.json`.
+
+## Canonical Verifier Contract
+
+Canonical verifiers should publish semantics through
+`/logs/verifier/validation_result.json` using
+`docs/reference/VALIDATION_RESULT_SCHEMA.md`.
+
+That sidecar is where verifiers declare:
+
+- `status` and `scorable`
+- `scorer_family`
+- `reward`
+- `pass_threshold`
+- `passed`
+- `output_contract`
+- `sub_scores`
+- structured failure context
+
+Downstream consumers should treat `passed` as the authoritative solved/pass
+flag. They should not recompute solved status from `reward > 0`.
+
+## Reporting Policy
+
+Reporting and export code must keep these concepts separate:
+
+- `reward`: continuous scalar produced by the deterministic verifier
+- `passed`: authoritative pass/fail flag from verifier semantics
+- `pass_threshold`: task or family policy threshold
+- `scorer_family`: family that gives meaning to the reward
+- `output_contract`: verifier-facing output mode
+
+Mean reward is still useful, but mixed-family aggregates require caveats. A
+0.7 from `test_ratio`, `oracle_checks`, and `checklist` should not be treated
+as silently calibrated equivalents.
+
+Operationally:
+
+- use `passed` / `status` for pass-rate tables when available
+- use `reward` for continuous-score summaries
+- surface `scorer_family` and `output_contract` in reports and exports
+- caveat or partition mixed-family reward aggregates
+
+## Launch And Validation Expectations
+
+Preflight checks, smoke runs, and launch docs should assume:
+
+- the deterministic verifier always exists
+- required artifacts come from the task's published output contract
+- missing required artifacts are invalid-output conditions, not ordinary
+  benchmark misses
+- artifact-oriented image variants must preserve the same verifier semantics,
+  even when the agent-facing output path differs by family
+
+The benchmark should therefore validate artifact expectations from task
+metadata and verifier contract, not from a blanket assumption that every task
+must produce `/workspace/answer.json`.
diff --git a/docs/reference/README.md b/docs/reference/README.md
@@ -9,6 +9,7 @@ Stable specifications and policy/reference documents.
 - `docs/WORKFLOW_METRICS.md`
 
 ## Evaluation / Scoring
+- `docs/reference/CANONICAL_EVALUATION_POLICY.md`
 - `docs/SCORING_SEMANTICS.md`
 - `docs/EVALUATION_PIPELINE.md`
 - `docs/reference/VALIDATION_RESULT_SCHEMA.md`
diff --git a/docs/reference/TASK_CONTRACT.md b/docs/reference/TASK_CONTRACT.md
@@ -26,6 +26,11 @@ The task image, instruction, and verifier must still agree on:
 - which files the verifier is allowed to depend on
 - what counts as a valid task outcome versus an infrastructure invalid
 
+For canonical tasks, this task-level execution contract sits inside the hybrid
+evaluation policy documented in
+`docs/reference/CANONICAL_EVALUATION_POLICY.md`: deterministic verifier reward
+is universal, while required artifacts remain family-specific.
+
 ## Required Task Contract
 
 Every task should expose one canonical contract:
@@ -41,6 +46,10 @@ Recommended defaults:
 - `TASK_OUTPUT=/logs/agent/solution.md` for narrative answers
 - `TASK_OUTPUT=/workspace/solution.json`, `/workspace/review.json`, or `/workspace/answer.json` for structured-output tasks
 
+Not every canonical task requires `answer.json`. `TASK_OUTPUT` should describe
+the actual verifier-facing contract for that family, including repo-state tasks
+that do not require a structured artifact.
+
 If a task uses `/app` instead of `/workspace`, that is valid, but the task must
 use it consistently across:
 
@@ -75,6 +84,12 @@ and `validation_result.json` as the semantic verifier contract. The JSON
 sidecar is where verifiers should record scorer family, pass semantics,
 sub-scores, and failure context.
 
+Artifact-oriented variants should preserve this separation. A wrapper that asks
+the agent for `answer.json` may feed or bridge into an existing deterministic
+verifier, but it does not change the underlying requirement that reward and
+pass/fail semantics come from the verifier contract rather than from the mere
+presence of an artifact.
+
 At minimum, verifiers should:
 
 - emit a clear error for missing required output
diff --git a/docs/reference/VALIDATION_RESULT_SCHEMA.md b/docs/reference/VALIDATION_RESULT_SCHEMA.md
@@ -8,6 +8,11 @@ This schema standardizes verifier semantics across scalar-only shell verifiers,
 answer.json artifact verifiers, repo-state verifiers, and oracle-based promoted
 tasks. It is intentionally simple enough to emit from shell or Python.
 
+This schema is the canonical semantic contract for hybrid evaluation. It
+applies whether the task is scored from repo state, native `answer.json`,
+bridge-mode structured output, or another family-specific artifact contract.
+It does not imply that every canonical task uses the same output artifact.
+
 ## Required Top-Level Fields
 
 Every canonical `validation_result.json` should emit these keys, even when the
@@ -30,6 +35,10 @@ Downstream consumers should treat `passed` as authoritative. `pass_threshold`
 is included so reporting can preserve task policy, but parsers should not
 recompute `passed` from `reward` alone.
 
+Likewise, consumers should not infer artifact policy from `reward` or from the
+presence of `validation_result.json`; the authoritative artifact semantics live
+under `output_contract`.
+
 ## Required `output_contract` Fields
 
 `output_contract` should always contain:
@@ -40,6 +49,10 @@ recompute `passed` from `reward` alone.
 | `primary_path` | string or `null` | Primary artifact path the verifier expected, if any |
 | `required_artifact` | boolean | Whether a missing primary artifact makes the run unscorable |
 
+`output_contract` is the bridge between the universal verifier contract and
+family-specific task IO. Reporting and validation should use it instead of
+assuming that every canonical task expects `/workspace/answer.json`.
+
 ## Failure Object
 
 When `status != "scored"`, `failure` should be populated with: