Skip to content

Commit 1c19f51

Browse files
committed
bd: backup 2026-03-09 17:09
1 parent 5bbe675 commit 1c19f51

File tree

3 files changed

+5
-4
lines changed

3 files changed

+5
-4
lines changed

.beads/backup/backup_state.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
{
2-
"last_dolt_commit": "9lp64kdd2mq6lb7fcsfr6r3dv7r39p19",
2+
"last_dolt_commit": "ks1iasqarfn6gv35g5vqm35h3u4uv0dr",
33
"last_event_id": 0,
4-
"timestamp": "2026-03-09T16:28:56.687035219Z",
4+
"timestamp": "2026-03-09T17:09:30.455394351Z",
55
"counts": {
66
"issues": 15,
7-
"events": 36,
7+
"events": 37,
88
"comments": 0,
99
"dependencies": 10,
1010
"labels": 0,

.beads/backup/events.jsonl

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,3 +34,4 @@
3434
{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T16:23:21Z","event_type":"updated","id":34,"issue_id":"CodeScaleBench-25b.1","new_value":"{\"notes\":\"Canonical evaluation audit landed in configs/canonical_evaluation_audit.json, generated by scripts/audit_canonical_evaluation_contract.py.\\n\\nHeadline counts from the checked-in audit:\\n- 275 tasks audited (264 active, 11 excluded)\\n- answer_json modes: 136 native, 89 bridge, 50 none\\n- artifact primary outputs among artifact-capable tasks: 217 /workspace/answer.json, 8 /app/solution.json, 7 repo-state only, 1 unspecified\\n- 42 tasks missing Dockerfile.artifact_only\\n- 16 artifact-capable tasks do not actually use answer.json\\n- 217 tasks do not yet emit validation_result-style structured verifier output\\n\\nThis confirms the canonical benchmark is hybrid at the benchmark level, but not uniformly dual-mode or uniformly structured-output across all 275 tasks.\"}","old_value":"{\"id\":\"CodeScaleBench-25b.1\",\"title\":\"Audit canonical task evaluator families and output contracts\",\"description\":\"Goal\\nProduce a canonical audit of the 275 selected tasks so follow-on work is driven by facts instead of assumptions.\\n\\nScope\\n- Classify each task by verifier family (oracle-checks, checklist, repo-state heuristic, test-ratio, F1-hybrid, etc.).\\n- Record expected agent output contract (solution.json, answer.json, review.json bridge, patched repo, report markdown, etc.).\\n- Record whether Dockerfile.artifact_only and Dockerfile.artifact_baseline exist.\\n- Record whether the verifier already emits validation_result-style structured output.\\n\\nWhy\\nCurrent support is uneven across SDLC suites, and we need a canonical source of truth before standardizing contracts or closing gaps.\",\"acceptance_criteria\":\"1. All 275 canonical tasks are classified by evaluator family, expected output artifact(s), and current artifact-mode support. 2. The audit identifies exact tasks missing artifact_only or answer.json bridge support. 3. Results are stored in repo-visible metadata or a generated audit artifact that can drive follow-on work.\",\"notes\":\"-\",\"status\":\"in_progress\",\"priority\":1,\"issue_type\":\"task\",\"assignee\":\"sjarmak\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:23:05Z\"}"}
3535
{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T16:24:17Z","event_type":"closed","id":35,"issue_id":"CodeScaleBench-25b.1","new_value":"Done","old_value":""}
3636
{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T16:28:56Z","event_type":"claimed","id":36,"issue_id":"CodeScaleBench-25b.2","new_value":"{\"assignee\":\"sjarmak\",\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-25b.2\",\"title\":\"Define standard validation_result schema for canonical verifiers\",\"description\":\"Goal\\nDefine the common verifier output contract that all canonical tasks should converge on in addition to reward.txt.\\n\\nScope\\n- Specify required and optional fields.\\n- Distinguish continuous reward from pass/fail semantics.\\n- Cover both deterministic verifiers and answer.json-derived artifact verifiers.\\n- Ensure the schema can represent partial credit, verifier failures, and missing output cleanly.\\n\\nWhy\\nToday reward.txt is universal, but the richer semantics are inconsistent and often lost.\",\"acceptance_criteria\":\"1. A single validation_result schema is defined for canonical tasks. 2. The schema includes scorer_family, reward, pass_threshold, passed, sub_scores, output_contract, and failure/error context. 3. The schema is documented in reference docs and is implementable from shell/Python verifiers without ambiguity.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:05:19Z\"}"}
37+
{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T17:09:30Z","event_type":"claimed","id":37,"issue_id":"CodeScaleBench-25b.3","new_value":"{\"assignee\":\"sjarmak\",\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-25b.3\",\"title\":\"Close artifact-mode coverage gaps in the 275 canonical tasks\",\"description\":\"Goal\\nBring the canonical set closer to the intended hybrid evaluation model by closing answer.json/artifact support gaps where feasible.\\n\\nKnown baseline\\nA local audit found 42 canonical tasks without Dockerfile.artifact_only support. SDLC suites are the main source of gaps.\\n\\nScope\\n- Add artifact-mode Dockerfiles and verifier bridges where appropriate.\\n- Reuse answer_json_verifier_lib.sh when it fits; avoid bespoke one-offs.\\n- For tasks that should remain deterministic-only, document the reason explicitly.\\n\\nWhy\\nThe canonical benchmark should not imply universal hybrid evaluation if a non-trivial subset cannot actually run that way.\",\"acceptance_criteria\":\"1. Every canonical task either supports artifact_only evaluation or is explicitly marked as a documented exception. 2. Missing artifact-mode tasks are either remediated or tracked individually with rationale. 3. Representative smoke coverage exists for each family touched by the remediation work.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:05:19Z\"}"}

0 commit comments

Comments
 (0)