bd: backup 2026-03-09 17:09

sjarmak · sjarmak · commit 1c19f51af251 · 2026-03-09T17:09:30.000Z
diff --git a/.beads/backup/backup_state.json b/.beads/backup/backup_state.json
@@ -1,10 +1,10 @@
 {
-  "last_dolt_commit": "9lp64kdd2mq6lb7fcsfr6r3dv7r39p19",
+  "last_dolt_commit": "ks1iasqarfn6gv35g5vqm35h3u4uv0dr",
   "last_event_id": 0,
-  "timestamp": "2026-03-09T16:28:56.687035219Z",
+  "timestamp": "2026-03-09T17:09:30.455394351Z",
   "counts": {
     "issues": 15,
-    "events": 36,
+    "events": 37,
     "comments": 0,
     "dependencies": 10,
     "labels": 0,
diff --git a/.beads/backup/events.jsonl b/.beads/backup/events.jsonl
@@ -34,3 +34,4 @@
 {"actor":"sjarmak","comment":null,"created_at":"2026-03-09T16:23:21Z","event_type":"updated","id":34,"issue_id":"CodeScaleBench-25b.1","new_value":"{\"notes\":\"Canonical evaluation audit landed in configs/canonical_evaluation_audit.json, generated by scripts/audit_canonical_evaluation_contract.py.\\n\\nHeadline counts from the checked-in audit:\\n- 275 tasks audited (264 active, 11 excluded)\\n- answer_json modes: 136 native, 89 bridge, 50 none\\n- artifact primary outputs among artifact-capable tasks: 217 /workspace/answer.json, 8 /app/solution.json, 7 repo-state only, 1 unspecified\\n- 42 tasks missing Dockerfile.artifact_only\\n- 16 artifact-capable tasks do not actually use answer.json\\n- 217 tasks do not yet emit validation_result-style structured verifier output\\n\\nThis confirms the canonical benchmark is hybrid at the benchmark level, but not uniformly dual-mode or uniformly structured-output across all 275 tasks.\"}","old_value":"{\"id\":\"CodeScaleBench-25b.1\",\"title\":\"Audit canonical task evaluator families and output contracts\",\"description\":\"Goal\\nProduce a canonical audit of the 275 selected tasks so follow-on work is driven by facts instead of assumptions.\\n\\nScope\\n- Classify each task by verifier family (oracle-checks, checklist, repo-state heuristic, test-ratio, F1-hybrid, etc.).\\n- Record expected agent output contract (solution.json, answer.json, review.json bridge, patched repo, report markdown, etc.).\\n- Record whether Dockerfile.artifact_only and Dockerfile.artifact_baseline exist.\\n- Record whether the verifier already emits validation_result-style structured output.\\n\\nWhy\\nCurrent support is uneven across SDLC suites, and we need a canonical source of truth before standardizing contracts or closing gaps.\",\"acceptance_criteria\":\"1. All 275 canonical tasks are classified by evaluator family, expected output artifact(s), and current artifact-mode support. 2. The audit identifies exact tasks missing artifact_only or answer.json bridge support. 3. Results are stored in repo-visible metadata or a generated audit artifact that can drive follow-on work.\",\"notes\":\"-\",\"status\":\"in_progress\",\"priority\":1,\"issue_type\":\"task\",\"assignee\":\"sjarmak\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:23:05Z\"}"}
 {"actor":"sjarmak","comment":null,"created_at":"2026-03-09T16:24:17Z","event_type":"closed","id":35,"issue_id":"CodeScaleBench-25b.1","new_value":"Done","old_value":""}
 {"actor":"sjarmak","comment":null,"created_at":"2026-03-09T16:28:56Z","event_type":"claimed","id":36,"issue_id":"CodeScaleBench-25b.2","new_value":"{\"assignee\":\"sjarmak\",\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-25b.2\",\"title\":\"Define standard validation_result schema for canonical verifiers\",\"description\":\"Goal\\nDefine the common verifier output contract that all canonical tasks should converge on in addition to reward.txt.\\n\\nScope\\n- Specify required and optional fields.\\n- Distinguish continuous reward from pass/fail semantics.\\n- Cover both deterministic verifiers and answer.json-derived artifact verifiers.\\n- Ensure the schema can represent partial credit, verifier failures, and missing output cleanly.\\n\\nWhy\\nToday reward.txt is universal, but the richer semantics are inconsistent and often lost.\",\"acceptance_criteria\":\"1. A single validation_result schema is defined for canonical tasks. 2. The schema includes scorer_family, reward, pass_threshold, passed, sub_scores, output_contract, and failure/error context. 3. The schema is documented in reference docs and is implementable from shell/Python verifiers without ambiguity.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:05:19Z\"}"}
+{"actor":"sjarmak","comment":null,"created_at":"2026-03-09T17:09:30Z","event_type":"claimed","id":37,"issue_id":"CodeScaleBench-25b.3","new_value":"{\"assignee\":\"sjarmak\",\"status\":\"in_progress\"}","old_value":"{\"id\":\"CodeScaleBench-25b.3\",\"title\":\"Close artifact-mode coverage gaps in the 275 canonical tasks\",\"description\":\"Goal\\nBring the canonical set closer to the intended hybrid evaluation model by closing answer.json/artifact support gaps where feasible.\\n\\nKnown baseline\\nA local audit found 42 canonical tasks without Dockerfile.artifact_only support. SDLC suites are the main source of gaps.\\n\\nScope\\n- Add artifact-mode Dockerfiles and verifier bridges where appropriate.\\n- Reuse answer_json_verifier_lib.sh when it fits; avoid bespoke one-offs.\\n- For tasks that should remain deterministic-only, document the reason explicitly.\\n\\nWhy\\nThe canonical benchmark should not imply universal hybrid evaluation if a non-trivial subset cannot actually run that way.\",\"acceptance_criteria\":\"1. Every canonical task either supports artifact_only evaluation or is explicitly marked as a documented exception. 2. Missing artifact-mode tasks are either remediated or tracked individually with rationale. 3. Representative smoke coverage exists for each family touched by the remediation work.\",\"status\":\"open\",\"priority\":1,\"issue_type\":\"task\",\"owner\":\"sjarmak@users.noreply.github.com\",\"created_at\":\"2026-03-09T16:05:19Z\",\"created_by\":\"sjarmak\",\"updated_at\":\"2026-03-09T16:05:19Z\"}"}
diff --git a/.beads/backup/issues.jsonl b/.beads/backup/issues.jsonl