|
352 | 352 | [2026-02-20 21:03:25 UTC] Iteration 5 no story markers found |
353 | 353 | [2026-02-20 21:03:25 UTC] Iteration 5 complete |
354 | 354 | [2026-02-20 21:03:27 UTC] Iteration 6 started |
| 355 | + |
| 356 | +## 2026-02-20 - US-015: Starter tasks — Category E: Onboarding comprehension (2 tasks) |
| 357 | +- Authored 2 complete tasks in benchmarks/ccb_mcp_onboarding/: |
| 358 | + |
| 359 | +**CCX-onboard-041** (E41 — API consumption, python-ml-stack fixture): |
| 360 | + - Scenario: New engineer auditing which pandas source files `from scipy.stats import` at runtime |
| 361 | + - Oracle: 4 files in pandas-dev/pandas: |
| 362 | + - pandas/core/nanops.py (kendalltau, spearmanr for correlation methods) |
| 363 | + - pandas/plotting/_matplotlib/misc.py (gaussian_kde) |
| 364 | + - pandas/plotting/_matplotlib/hist.py (gaussian_kde) |
| 365 | + - pandas/tests/groupby/test_reductions.py (sem) |
| 366 | + - eval.sh: file_set_match + provenance |
| 367 | + - Validity gate: VALID (gold=1.0, empty=0.0) |
| 368 | + |
| 369 | +**CCX-onboard-050-ds** (E50 — end-to-end flow, Deep Search variant, kubernetes-ecosystem fixture): |
| 370 | + - Scenario: Onboarding engineer traces Deployment creation flow end-to-end across 3 repos |
| 371 | + - Open-ended instruction: "Explain how a client creates a Deployment — trace through all 3 layers" |
| 372 | + - Oracle chain: |
| 373 | + - sg-benchmarks/kubernetes-client-go: kubernetes/typed/apps/v1/deployment.go (Create) |
| 374 | + - kubernetes/kubernetes: pkg/registry/apps/deployment/strategy.go (PrepareForCreate) |
| 375 | + - etcd-io/etcd: server/storage/mvcc/kvstore_txn.go (Put) |
| 376 | + - eval.sh: dependency_chain + provenance |
| 377 | + - tests/criteria.json: 4 AAA quality rubric criteria (flow_completeness, cross_repo_synthesis, technical_accuracy, onboarding_clarity) |
| 378 | + - deepsearch_relevant=true in selection file |
| 379 | + - rubric_judge weight=0.4 in task_spec.json (supplementary, not in eval.sh) |
| 380 | + - Validity gate: VALID (gold=1.0, empty=0.0) |
| 381 | + |
| 382 | +- Both tasks registered in configs/selected_mcp_unique_tasks.json (8 total tasks now) |
| 383 | +- Files changed: benchmarks/ccb_mcp_onboarding/ (2 new task dirs with 19 files total), configs/selected_mcp_unique_tasks.json, prd.json, progress.txt |
| 384 | +- **Learnings for future iterations:** |
| 385 | + - For "API consumption" oracles, use exact import pattern (`from scipy.stats import`) to get bounded, verifiable oracle — not `import scipy.stats` (which is broader) |
| 386 | + - Production-only vs all-files oracle: including test files gives more realistic "complete audit" scenario |
| 387 | + - Deep Search variant design: open-ended cross-repo synthesis question + criteria.json is the right pattern |
| 388 | + - kubernetes-client-go typed client: `kubernetes/typed/apps/v1/deployment.go` has Create() that POSTs to REST endpoint |
| 389 | + - etcd Put function: `server/storage/mvcc/kvstore_txn.go:Put()` is the canonical write path |
| 390 | + - criteria.json uses AAA pattern: each metric description has "Accurate: ... Attributed: ... Actionable: ..." |
| 391 | + - deepsearch_relevant in task.toml is a new field — verify task.toml doesn't reject unknown fields (it doesn't — task.toml is read by Harbor but extra fields are ignored) |
| 392 | +--- |
0 commit comments