Skip to content

Commit c7e31ad

Browse files
committed
docs: document canonical hybrid evaluation policy
1 parent 8f18d96 commit c7e31ad

File tree

8 files changed

+187
-4
lines changed

8 files changed

+187
-4
lines changed

docs/AGENT_INTERFACE.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,12 @@ Each task has a `time_limit_sec` field in `task.toml` (typically 300-1800 second
4040

4141
The agent modifies files in the workspace to solve the task. After the agent finishes (or times out), the verifier runs `tests/test.sh` to evaluate the result.
4242

43+
The required agent output is task-specific. Some tasks are scored from repo
44+
state alone, some require a structured artifact such as
45+
`/workspace/answer.json`, and some use other published output paths. Agents
46+
should follow the task's declared contract rather than assuming one universal
47+
artifact format.
48+
4349
### Verification
4450

4551
The test script (`tests/test.sh`) is uploaded by Harbor to `/tests/` in the container at runtime. It is **not** present in the workspace directory. The script:
@@ -51,6 +57,11 @@ The test script (`tests/test.sh`) is uploaded by Harbor to `/tests/` in the cont
5157
4. May use non-zero exit codes to distinguish scored failure from verifier/runtime failure;
5258
Harbor still reads the scalar reward artifact
5359

60+
For canonical tasks, `reward.txt` remains the compatibility artifact, while
61+
`validation_result.json` carries the semantic outcome: scorer family,
62+
authoritative `passed`, `pass_threshold`, output contract, and invalid-output
63+
context.
64+
5465
### Result Format
5566

5667
Harbor produces a `result.json` for each task containing:

docs/EVALUATION_PIPELINE.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,10 @@ retrieval/IR evaluation pipeline (normalized retrieval events, file/chunk IR
1111
metrics, utilization probes, taxonomy, and emitted artifacts), see
1212
[RETRIEVAL_EVAL_SPEC.md](RETRIEVAL_EVAL_SPEC.md).
1313

14+
For canonical-task policy, read
15+
[docs/reference/CANONICAL_EVALUATION_POLICY.md](reference/CANONICAL_EVALUATION_POLICY.md)
16+
alongside this pipeline document.
17+
1418
---
1519

1620
## Pipeline Layers
@@ -56,6 +60,11 @@ in [docs/reference/VALIDATION_RESULT_SCHEMA.md](reference/VALIDATION_RESULT_SCHE
5660
so downstream reporting can preserve scorer family, pass semantics, and failure
5761
context.
5862

63+
This is the core hybrid-policy rule: deterministic verifier reward is
64+
universal, but the agent-facing output contract is family-specific. Some tasks
65+
score repo state directly, some natively score `answer.json`, and some use
66+
artifact-oriented bridge variants that still feed the same verifier semantics.
67+
5968
Verifier types are documented in [SCORING_SEMANTICS.md](SCORING_SEMANTICS.md).
6069

6170
### Verifier Debug Mode
@@ -242,6 +251,11 @@ my-fix-task-002 | 1.00 | 0.75 | -0.25 | medium [DIVERGENT]
242251
Tasks where `abs(verifier_reward - judge_score) > 0.3` are flagged `[DIVERGENT]`
243252
for manual review.
244253

254+
For canonical deterministic reporting, treat continuous reward and pass/fail as
255+
distinct dimensions. Report generators should use verifier `passed` /
256+
`pass_threshold` metadata when available and surface `scorer_family` plus
257+
`output_contract` so mixed-family reward aggregates are explicitly caveated.
258+
245259
---
246260

247261
## Generating Reports

docs/REPORT_CONTEXT.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -124,9 +124,12 @@ grep/glob/read. MCP-Full agents have truncated source and must use
124124
Sourcegraph MCP tools (keyword search, semantic search, go-to-definition,
125125
find-references, deep search, etc.).
126126

127-
For Org tasks, an artifact evaluation variant is also used:
128-
- `baseline-local-artifact`: full local code, structured `answer.json` output
129-
- `mcp-remote-artifact`: truncated source, MCP tools, structured `answer.json` output
127+
Canonical tasks may also have artifact-oriented variants, but those variants
128+
follow a hybrid policy rather than one universal output format. Some tasks use
129+
native `answer.json`, some use bridge-mode structured artifacts, and some are
130+
still fundamentally repo-state verifiers. The maintained audit snapshot in
131+
`configs/canonical_evaluation_audit.json` is the source of truth for current
132+
family-level coverage and migration status.
130133

131134
### 3.2 Verification Pipeline
132135

@@ -139,6 +142,9 @@ The evaluation uses a multi-layer pipeline:
139142
`/logs/verifier/validation_result.json` sidecar so scorer family,
140143
pass/fail semantics, sub-scores, and invalid-output context are preserved.
141144

145+
Deterministic verifier reward is the universal policy. Artifact support is
146+
family-specific input to that verifier layer, not a replacement for it.
147+
142148
2. **Optional LLM judge**: Post-hoc qualitative scoring across five
143149
dimensions (correctness 0.30, completeness 0.25, code quality 0.20,
144150
retrieval quality 0.15, efficiency 0.10) with multi-round voting.

docs/SCORING_SEMANTICS.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,11 @@ Canonical tasks should normalize these families into
1919
`docs/reference/VALIDATION_RESULT_SCHEMA.md`. The reward type determines the
2020
meaning of `reward` and `sub_scores`, but not the top-level contract.
2121

22+
See `docs/reference/CANONICAL_EVALUATION_POLICY.md` for the stable policy that
23+
ties these families together: deterministic verifier reward is universal,
24+
artifact support is hybrid and family-specific, and reporting must keep reward
25+
separate from pass semantics.
26+
2227
## Per-Verifier Scoring (Active Suites)
2328

2429
Tasks are organized into 8 SDLC-phase suites (`csb_sdlc_understand` through `csb_sdlc_debug`)
@@ -209,6 +214,10 @@ only when non-empty.
209214
Org tasks use a unified oracle check library for deterministic scoring,
210215
with optional rubric judge for Deep Search synthesis tasks.
211216

217+
This section is Org-specific. The `/workspace/answer.json` format below is not
218+
the universal canonical benchmark contract; other families may use bridge-mode
219+
artifacts or repo-state verification instead.
220+
212221
### Oracle Checks (scripts/csb_metrics/oracle_checks.py)
213222

214223
All Org tasks are scored by `oracle_checks.py`, a stdlib-only Python
@@ -238,7 +247,8 @@ composite == 0 (total failure). Harbor reads the score from `/logs/verifier/rewa
238247

239248
### Agent Answer Format
240249

241-
Agents write `/workspace/answer.json`:
250+
Agents write `/workspace/answer.json` for Org tasks using the native
251+
answer-artifact contract:
242252

243253
```json
244254
{
Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Canonical Evaluation Policy
2+
3+
This document defines the stable evaluation policy for the canonical
4+
CodeScaleBench task set.
5+
6+
Use this document when you need to answer four questions precisely:
7+
8+
- what every canonical task must do
9+
- what is allowed to vary by verifier family
10+
- how artifact-oriented task variants relate to deterministic verification
11+
- how reporting should interpret reward versus pass/fail
12+
13+
## Universal Policy
14+
15+
These rules apply to every canonical task, regardless of suite or verifier
16+
family:
17+
18+
- Every task has a deterministic verifier.
19+
- Every deterministic verifier writes `/logs/verifier/reward.txt`.
20+
- Canonical verifiers should also write
21+
`/logs/verifier/validation_result.json`.
22+
- `validation_result.json` is the semantic verifier contract; `reward.txt` is
23+
the scalar compatibility artifact.
24+
- Reporting must preserve continuous `reward` separately from pass semantics.
25+
26+
The deterministic verifier is the authoritative benchmark outcome producer.
27+
Artifact-oriented flows do not replace it; they give the verifier a structured
28+
or family-specific input surface.
29+
30+
## Hybrid Output Policy
31+
32+
Canonical tasks intentionally use a hybrid output model. The benchmark does
33+
not require one universal agent artifact format.
34+
35+
Supported output-contract patterns include:
36+
37+
- `answer_json_native`: the verifier directly scores a structured
38+
`/workspace/answer.json` contract
39+
- `answer_json_bridge`: an artifact-oriented image or wrapper maps structured
40+
agent output into an existing deterministic verifier flow
41+
- `repo_state`: the verifier scores repository state and tests, with no
42+
required structured artifact
43+
- other family-specific contracts such as `solution_json` or
44+
`report_markdown`
45+
46+
Implications:
47+
48+
- Deterministic verification is universal.
49+
- Artifact support is family-specific.
50+
- `answer.json` is common, but it is not universal benchmark policy.
51+
- Presence of `Dockerfile.artifact_only` does not imply the same verifier
52+
family or the same artifact semantics across tasks.
53+
54+
The maintained snapshot of current canonical coverage lives in
55+
`configs/canonical_evaluation_audit.json`. Use that audit to answer
56+
family-level questions such as which suites are `answer_json_native`,
57+
`answer_json_bridge`, or still migrating to `validation_result.json`.
58+
59+
## Canonical Verifier Contract
60+
61+
Canonical verifiers should publish semantics through
62+
`/logs/verifier/validation_result.json` using
63+
`docs/reference/VALIDATION_RESULT_SCHEMA.md`.
64+
65+
That sidecar is where verifiers declare:
66+
67+
- `status` and `scorable`
68+
- `scorer_family`
69+
- `reward`
70+
- `pass_threshold`
71+
- `passed`
72+
- `output_contract`
73+
- `sub_scores`
74+
- structured failure context
75+
76+
Downstream consumers should treat `passed` as the authoritative solved/pass
77+
flag. They should not recompute solved status from `reward > 0`.
78+
79+
## Reporting Policy
80+
81+
Reporting and export code must keep these concepts separate:
82+
83+
- `reward`: continuous scalar produced by the deterministic verifier
84+
- `passed`: authoritative pass/fail flag from verifier semantics
85+
- `pass_threshold`: task or family policy threshold
86+
- `scorer_family`: family that gives meaning to the reward
87+
- `output_contract`: verifier-facing output mode
88+
89+
Mean reward is still useful, but mixed-family aggregates require caveats. A
90+
0.7 from `test_ratio`, `oracle_checks`, and `checklist` should not be treated
91+
as silently calibrated equivalents.
92+
93+
Operationally:
94+
95+
- use `passed` / `status` for pass-rate tables when available
96+
- use `reward` for continuous-score summaries
97+
- surface `scorer_family` and `output_contract` in reports and exports
98+
- caveat or partition mixed-family reward aggregates
99+
100+
## Launch And Validation Expectations
101+
102+
Preflight checks, smoke runs, and launch docs should assume:
103+
104+
- the deterministic verifier always exists
105+
- required artifacts come from the task's published output contract
106+
- missing required artifacts are invalid-output conditions, not ordinary
107+
benchmark misses
108+
- artifact-oriented image variants must preserve the same verifier semantics,
109+
even when the agent-facing output path differs by family
110+
111+
The benchmark should therefore validate artifact expectations from task
112+
metadata and verifier contract, not from a blanket assumption that every task
113+
must produce `/workspace/answer.json`.

docs/reference/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ Stable specifications and policy/reference documents.
99
- `docs/WORKFLOW_METRICS.md`
1010

1111
## Evaluation / Scoring
12+
- `docs/reference/CANONICAL_EVALUATION_POLICY.md`
1213
- `docs/SCORING_SEMANTICS.md`
1314
- `docs/EVALUATION_PIPELINE.md`
1415
- `docs/reference/VALIDATION_RESULT_SCHEMA.md`

docs/reference/TASK_CONTRACT.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,11 @@ The task image, instruction, and verifier must still agree on:
2626
- which files the verifier is allowed to depend on
2727
- what counts as a valid task outcome versus an infrastructure invalid
2828

29+
For canonical tasks, this task-level execution contract sits inside the hybrid
30+
evaluation policy documented in
31+
`docs/reference/CANONICAL_EVALUATION_POLICY.md`: deterministic verifier reward
32+
is universal, while required artifacts remain family-specific.
33+
2934
## Required Task Contract
3035

3136
Every task should expose one canonical contract:
@@ -41,6 +46,10 @@ Recommended defaults:
4146
- `TASK_OUTPUT=/logs/agent/solution.md` for narrative answers
4247
- `TASK_OUTPUT=/workspace/solution.json`, `/workspace/review.json`, or `/workspace/answer.json` for structured-output tasks
4348

49+
Not every canonical task requires `answer.json`. `TASK_OUTPUT` should describe
50+
the actual verifier-facing contract for that family, including repo-state tasks
51+
that do not require a structured artifact.
52+
4453
If a task uses `/app` instead of `/workspace`, that is valid, but the task must
4554
use it consistently across:
4655

@@ -75,6 +84,12 @@ and `validation_result.json` as the semantic verifier contract. The JSON
7584
sidecar is where verifiers should record scorer family, pass semantics,
7685
sub-scores, and failure context.
7786

87+
Artifact-oriented variants should preserve this separation. A wrapper that asks
88+
the agent for `answer.json` may feed or bridge into an existing deterministic
89+
verifier, but it does not change the underlying requirement that reward and
90+
pass/fail semantics come from the verifier contract rather than from the mere
91+
presence of an artifact.
92+
7893
At minimum, verifiers should:
7994

8095
- emit a clear error for missing required output

docs/reference/VALIDATION_RESULT_SCHEMA.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@ This schema standardizes verifier semantics across scalar-only shell verifiers,
88
answer.json artifact verifiers, repo-state verifiers, and oracle-based promoted
99
tasks. It is intentionally simple enough to emit from shell or Python.
1010

11+
This schema is the canonical semantic contract for hybrid evaluation. It
12+
applies whether the task is scored from repo state, native `answer.json`,
13+
bridge-mode structured output, or another family-specific artifact contract.
14+
It does not imply that every canonical task uses the same output artifact.
15+
1116
## Required Top-Level Fields
1217

1318
Every canonical `validation_result.json` should emit these keys, even when the
@@ -30,6 +35,10 @@ Downstream consumers should treat `passed` as authoritative. `pass_threshold`
3035
is included so reporting can preserve task policy, but parsers should not
3136
recompute `passed` from `reward` alone.
3237

38+
Likewise, consumers should not infer artifact policy from `reward` or from the
39+
presence of `validation_result.json`; the authoritative artifact semantics live
40+
under `output_contract`.
41+
3342
## Required `output_contract` Fields
3443

3544
`output_contract` should always contain:
@@ -40,6 +49,10 @@ recompute `passed` from `reward` alone.
4049
| `primary_path` | string or `null` | Primary artifact path the verifier expected, if any |
4150
| `required_artifact` | boolean | Whether a missing primary artifact makes the run unscorable |
4251

52+
`output_contract` is the bridge between the universal verifier contract and
53+
family-specific task IO. Reporting and validation should use it instead of
54+
assuming that every canonical task expects `/workspace/answer.json`.
55+
4356
## Failure Object
4457

4558
When `status != "scored"`, `failure` should be populated with:

0 commit comments

Comments
 (0)