Skip to content

Commit 0843578

Browse files
sjarmakclaude
andcommitted
docs: promote 7 backlog tasks to 170, fix linux base images, update all counts
- Promote 7 Phase 2 backlog tasks (3 ccb_document, 4 ccb_test) to selection - Remove duplicate cgen-deps-install-001 entry from selection file - Fix build_linux_base_images.sh to try stable kernel tree before mainline - Add docs/REPORT_CONTEXT.md with paper context and preliminary results - Update task counts (157→170 SDLC, 182 total) across 15 documentation files - Mark backlog_ccb_document.json and backlog_ccb_test.json as completed - Modernize AGENT_INTERFACE.md to reflect 2-config model (drop SG_base) - Fix WORKFLOW_METRICS.md category table to use actual CCB suite names - Fix SCORING_SEMANTICS.md section header and add suite structure context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent fd41857 commit 0843578

18 files changed

+657
-118
lines changed

CLAUDE.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,10 @@ This file is the operational quick-reference for benchmark maintenance.
44
`AGENTS.md` mirrors this file.
55

66
## Benchmark Overview
7-
8 SDLC phase suites + 6 MCP-unique suites. SDLC tasks measure code quality
8-
across phases: build, debug, design, document, fix, secure, test, understand.
9-
MCP-unique tasks measure org-scale cross-repo discovery and retrieval.
7+
8 SDLC phase suites + 8 MCP-unique suites (6 active, 2 deferred). SDLC tasks
8+
measure code quality across phases: build, debug, design, document, fix,
9+
secure, test, understand. MCP-unique tasks measure org-scale cross-repo
10+
discovery and retrieval.
1011
See `README.md` for the full suite table and `docs/TASK_CATALOG.md` for
1112
per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
1213

@@ -28,6 +29,7 @@ per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
2829
- `docs/LEADERBOARD.md` - ranking policy
2930
- `docs/SUBMISSION.md` - submission format
3031
- `docs/SKILLS.md` - AI agent skill system overview
32+
- `docs/REPORT_CONTEXT.md` - paper context: design approach and preliminary results
3133
- `skills/` - operational runbooks for AI agents (see `skills/README.md`)
3234

3335
## Git Policy

README.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -16,17 +16,15 @@ Eight suites organized by software development lifecycle phase:
1616
| `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
1717
| `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
1818
| `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
19-
| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation |
20-
| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides |
19+
| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
20+
| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
2121
| `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
2222
| `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
2323
| **Total** | | **170** | |
2424

25-
*ccb_test* and *ccb_document* currently have 14 and 13 tasks on disk (target 20 each); see `docs/backlog_ccb_test.json` and `docs/backlog_ccb_document.json` for the growth plan.
26-
2725
## MCP-Unique Suites (Org-Scale Context Retrieval)
2826

29-
Six additional suites measure what local-only agents *cannot* do: cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
27+
Six additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
3028

3129
| Suite | Category | Tasks | Description |
3230
|-------|----------|------:|-------------|
@@ -38,6 +36,8 @@ Six additional suites measure what local-only agents *cannot* do: cross-repo dis
3836
| `ccb_mcp_platform` | J: Platform Knowledge | 1 | Service template discovery and tribal knowledge |
3937
| **Total** | | **12** | |
4038

39+
The table above shows the 12 tasks evaluated in official runs. The full MCP-unique catalog has 20 tasks across 8 suites (including compliance and migration, pending first runs). **Combined catalog total: 190 tasks** (170 SDLC + 20 MCP-unique).
40+
4141
Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
4242

4343
See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task system, authoring guide, and oracle evaluation framework. See [docs/MCP_UNIQUE_CALIBRATION.md](docs/MCP_UNIQUE_CALIBRATION.md) for oracle coverage analysis.
@@ -64,10 +64,10 @@ benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
6464
ccb_build/ # Feature & Refactoring (25 tasks)
6565
ccb_debug/ # Debugging & Investigation (20 tasks)
6666
ccb_design/ # Architecture & Design (20 tasks)
67-
ccb_document/ # Documentation (13 tasks)
67+
ccb_document/ # Documentation (20 tasks)
6868
ccb_fix/ # Bug Repair (25 tasks)
6969
ccb_secure/ # Security & Compliance (20 tasks)
70-
ccb_test/ # Testing & QA (14 tasks)
70+
ccb_test/ # Testing & QA (20 tasks)
7171
ccb_understand/ # Requirements & Discovery (20 tasks)
7272
ccb_mcp_crossrepo_tracing/ # MCP-unique: cross-repo dependency tracing (3 tasks)
7373
ccb_mcp_security/ # MCP-unique: vulnerability remediation (2 tasks)
@@ -81,10 +81,10 @@ configs/ # Run configs and task selection
8181
build_2config.sh # Phase wrapper: Build (25 tasks)
8282
debug_2config.sh # Phase wrapper: Debug (20 tasks)
8383
design_2config.sh # Phase wrapper: Design (20 tasks)
84-
document_2config.sh # Phase wrapper: Document (13 tasks)
84+
document_2config.sh # Phase wrapper: Document (20 tasks)
8585
fix_2config.sh # Phase wrapper: Fix (25 tasks)
8686
secure_2config.sh # Phase wrapper: Secure (20 tasks)
87-
test_2config.sh # Phase wrapper: Test (14 tasks)
87+
test_2config.sh # Phase wrapper: Test (20 tasks)
8888
run_selected_tasks.sh # Unified runner for all tasks
8989
validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
9090
selected_benchmark_tasks.json # Canonical SDLC task selection with metadata
@@ -172,10 +172,10 @@ For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical a
172172

173173
### SDLC Tasks
174174

175-
The unified runner executes all 157 SDLC tasks across the 2-config matrix:
175+
The unified runner executes all 170 SDLC tasks across the 2-config matrix:
176176

177177
```bash
178-
# Run all 157 SDLC tasks across 2 configs
178+
# Run all 170 SDLC tasks across 2 configs
179179
bash configs/run_selected_tasks.sh
180180

181181
# Run only the baseline config
@@ -197,16 +197,16 @@ bash configs/understand_2config.sh # 20 Requirements & Discovery tasks
197197
bash configs/design_2config.sh # 20 Architecture & Design tasks
198198
bash configs/debug_2config.sh # 20 Debugging & Investigation tasks
199199
bash configs/secure_2config.sh # 20 Security & Compliance tasks
200-
bash configs/test_2config.sh # 14 Testing & QA tasks
201-
bash configs/document_2config.sh # 13 Documentation tasks
200+
bash configs/test_2config.sh # 20 Testing & QA tasks
201+
bash configs/document_2config.sh # 20 Documentation tasks
202202
```
203203

204204
### MCP-Unique Tasks
205205

206206
MCP-unique tasks use a separate selection file:
207207

208208
```bash
209-
# Run all 12 MCP-unique tasks across 2 configs
209+
# Run all MCP-unique tasks across 2 configs
210210
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json
211211

212212
# Filter by use-case category

benchmarks/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# CodeContextBench Benchmarks
22

3-
157 tasks organized into 8 suites aligned with the software development lifecycle (SDLC). Each suite targets a distinct phase of engineering work. The canonical task selection is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json).
3+
170 tasks organized into 8 suites aligned with the software development lifecycle (SDLC). Each suite targets a distinct phase of engineering work. The canonical task selection is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json).
44

5-
See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology and [`docs/PRD_SDLC_SUITE_REORGANIZATION.md`](../docs/PRD_SDLC_SUITE_REORGANIZATION.md) for the migration rationale.
5+
See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
66

77
---
88

@@ -14,11 +14,11 @@ See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodol
1414
| `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
1515
| `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
1616
| `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
17-
| `ccb_test` | Testing & QA | 14 | Code review, performance testing, code search validation |
18-
| `ccb_document` | Documentation | 13 | API references, architecture docs, migration guides |
17+
| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
18+
| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
1919
| `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
2020
| `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
21-
| **Total** | | **157** | |
21+
| **Total** | | **170** | |
2222

2323
---
2424

@@ -148,7 +148,7 @@ New feature implementation, code refactoring, and dependency management tasks.
148148

149149
---
150150

151-
## ccb_test (14 tasks) — Testing & QA
151+
## ccb_test (20 tasks) — Testing & QA
152152

153153
Code review with injected defects, performance testing, and code search validation.
154154

@@ -171,7 +171,7 @@ Code review with injected defects, performance testing, and code search validati
171171

172172
---
173173

174-
## ccb_document (13 tasks) — Documentation
174+
## ccb_document (20 tasks) — Documentation
175175

176176
API reference generation, architecture documentation, and migration guide creation.
177177

@@ -269,7 +269,7 @@ Each task follows this layout:
269269
## Running Benchmarks
270270

271271
```bash
272-
# Run all 157 tasks across 2 configs (Baseline + MCP-Full)
272+
# Run all 170 tasks across 2 configs (Baseline + MCP-Full)
273273
bash configs/run_selected_tasks.sh
274274

275275
# Run a single SDLC phase

configs/selected_benchmark_tasks.json

Lines changed: 158 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -5,12 +5,12 @@
55
"generated_by": "SDLC suite migration from migration_map.json",
66
"generated_date": "2026-02-18",
77
"total_available": 835,
8-
"total_selected": 164,
8+
"total_selected": 170,
99
"migration_source": "migration_map.json (157 mapped tasks across 8 SDLC suites)",
1010
"target_total": 170,
11-
"target_note": "ccb_test and ccb_document target 20 each (see docs/backlog_ccb_test.json, docs/backlog_ccb_document.json)",
12-
"last_updated": "2026-02-22",
13-
"note": "Updated all 8 SDLC suites to use renamed on-disk task directories. Removed 2 orphaned ccb_understand tasks."
11+
"target_note": "All suites at target: ccb_test=20, ccb_document=20. Phase 2 backlog fully promoted.",
12+
"last_updated": "2026-02-23",
13+
"note": "Promoted 7 Phase 2 backlog tasks (3 ccb_document, 4 ccb_test). Removed duplicate cgen-deps-install-001 entry."
1414
},
1515
"methodology": {
1616
"description": "Tasks reorganized from 17 legacy suites into 8 SDLC-phase suites. Each task retains its original scoring and metadata; only the suite assignment and task_dir are updated to reflect the new SDLC taxonomy. Tasks with status 'archived' or 'retired' in migration_map.json are excluded.",
@@ -1622,28 +1622,6 @@
16221622
"files_count": 6,
16231623
"files_count_source": "task_metrics_run"
16241624
},
1625-
{
1626-
"task_id": "cgen-deps-install-001",
1627-
"benchmark": "ccb_build",
1628-
"sdlc_phase": "Implementation (feature)",
1629-
"language": "python",
1630-
"difficulty": "medium",
1631-
"category": "dependency-inference",
1632-
"repo": "",
1633-
"mcp_benefit_score": 0.55,
1634-
"mcp_breakdown": {
1635-
"context_complexity": 0.5,
1636-
"cross_file_deps": 0.4,
1637-
"semantic_search_potential": 0.6,
1638-
"task_category_weight": 0.7
1639-
},
1640-
"selection_rationale": "New SDLC task: dependency inference from DIBench",
1641-
"task_dir": "ccb_build/cgen-deps-install-001",
1642-
"context_length": 500000,
1643-
"context_length_source": "mcp_breakdown_proxy",
1644-
"files_count": 8,
1645-
"files_count_source": "mcp_breakdown_proxy"
1646-
},
16471625
{
16481626
"task_id": "django-composite-field-recover-001",
16491627
"benchmark": "ccb_understand",
@@ -3018,6 +2996,160 @@
30182996
"repo": "charmbracelet/wish",
30192997
"mcp_benefit_score": 0.75,
30202998
"selection_rationale": "SDLC phase task (auto-generated)"
2999+
},
3000+
{
3001+
"task_id": "docgen-inline-001",
3002+
"benchmark": "ccb_document",
3003+
"sdlc_phase": "Documentation",
3004+
"language": "python",
3005+
"difficulty": "medium",
3006+
"category": "inline_docstring_generation",
3007+
"repo": "django/django",
3008+
"mcp_benefit_score": 0.82,
3009+
"mcp_breakdown": {
3010+
"context_complexity": 0.8,
3011+
"cross_file_deps": 0.7,
3012+
"semantic_search_potential": 0.85,
3013+
"task_category_weight": 0.9
3014+
},
3015+
"selection_rationale": "Phase 2 backlog promotion: inline docstring generation (Python variant)",
3016+
"task_dir": "ccb_document/docgen-inline-001",
3017+
"context_length": 850000,
3018+
"context_length_source": "mcp_breakdown_proxy",
3019+
"files_count": 12,
3020+
"files_count_source": "mcp_breakdown_proxy"
3021+
},
3022+
{
3023+
"task_id": "docgen-runbook-001",
3024+
"benchmark": "ccb_document",
3025+
"sdlc_phase": "Documentation",
3026+
"language": "go",
3027+
"difficulty": "hard",
3028+
"category": "runbook_writing",
3029+
"repo": "prometheus/prometheus",
3030+
"mcp_benefit_score": 0.88,
3031+
"mcp_breakdown": {
3032+
"context_complexity": 0.9,
3033+
"cross_file_deps": 0.85,
3034+
"semantic_search_potential": 0.9,
3035+
"task_category_weight": 0.85
3036+
},
3037+
"selection_rationale": "Phase 2 backlog promotion: operational runbook for TSDB compaction",
3038+
"task_dir": "ccb_document/docgen-runbook-001",
3039+
"context_length": 900000,
3040+
"context_length_source": "mcp_breakdown_proxy",
3041+
"files_count": 15,
3042+
"files_count_source": "mcp_breakdown_proxy"
3043+
},
3044+
{
3045+
"task_id": "docgen-runbook-002",
3046+
"benchmark": "ccb_document",
3047+
"sdlc_phase": "Documentation",
3048+
"language": "cpp",
3049+
"difficulty": "hard",
3050+
"category": "runbook_writing",
3051+
"repo": "envoyproxy/envoy",
3052+
"mcp_benefit_score": 0.87,
3053+
"mcp_breakdown": {
3054+
"context_complexity": 0.9,
3055+
"cross_file_deps": 0.85,
3056+
"semantic_search_potential": 0.85,
3057+
"task_category_weight": 0.85
3058+
},
3059+
"selection_rationale": "Phase 2 backlog promotion: troubleshooting runbook for Envoy connection pool",
3060+
"task_dir": "ccb_document/docgen-runbook-002",
3061+
"context_length": 900000,
3062+
"context_length_source": "mcp_breakdown_proxy",
3063+
"files_count": 14,
3064+
"files_count_source": "mcp_breakdown_proxy"
3065+
},
3066+
{
3067+
"task_id": "test-coverage-gap-001",
3068+
"benchmark": "ccb_test",
3069+
"sdlc_phase": "Testing & QA",
3070+
"language": "cpp",
3071+
"difficulty": "hard",
3072+
"category": "test-coverage-gap-analysis",
3073+
"repo": "envoyproxy/envoy",
3074+
"mcp_benefit_score": 0.75,
3075+
"mcp_breakdown": {
3076+
"context_complexity": 0.85,
3077+
"cross_file_deps": 0.7,
3078+
"semantic_search_potential": 0.7,
3079+
"task_category_weight": 0.75
3080+
},
3081+
"selection_rationale": "Phase 2 backlog promotion: coverage gap analysis for Envoy HTTP filter chain",
3082+
"task_dir": "ccb_test/test-coverage-gap-001",
3083+
"context_length": 800000,
3084+
"context_length_source": "mcp_breakdown_proxy",
3085+
"files_count": 10,
3086+
"files_count_source": "mcp_breakdown_proxy"
3087+
},
3088+
{
3089+
"task_id": "test-coverage-gap-002",
3090+
"benchmark": "ccb_test",
3091+
"sdlc_phase": "Testing & QA",
3092+
"language": "java",
3093+
"difficulty": "hard",
3094+
"category": "test-coverage-gap-analysis",
3095+
"repo": "apache/kafka",
3096+
"mcp_benefit_score": 0.73,
3097+
"mcp_breakdown": {
3098+
"context_complexity": 0.8,
3099+
"cross_file_deps": 0.7,
3100+
"semantic_search_potential": 0.7,
3101+
"task_category_weight": 0.75
3102+
},
3103+
"selection_rationale": "Phase 2 backlog promotion: coverage gap analysis for Kafka consumer group coordinator",
3104+
"task_dir": "ccb_test/test-coverage-gap-002",
3105+
"context_length": 800000,
3106+
"context_length_source": "mcp_breakdown_proxy",
3107+
"files_count": 10,
3108+
"files_count_source": "mcp_breakdown_proxy"
3109+
},
3110+
{
3111+
"task_id": "test-integration-002",
3112+
"benchmark": "ccb_test",
3113+
"sdlc_phase": "Testing & QA",
3114+
"language": "go",
3115+
"difficulty": "hard",
3116+
"category": "integration-test-authoring",
3117+
"repo": "navidrome/navidrome",
3118+
"mcp_benefit_score": 0.76,
3119+
"mcp_breakdown": {
3120+
"context_complexity": 0.8,
3121+
"cross_file_deps": 0.75,
3122+
"semantic_search_potential": 0.75,
3123+
"task_category_weight": 0.75
3124+
},
3125+
"selection_rationale": "Phase 2 backlog promotion: integration tests for media scanning pipeline",
3126+
"task_dir": "ccb_test/test-integration-002",
3127+
"context_length": 600000,
3128+
"context_length_source": "mcp_breakdown_proxy",
3129+
"files_count": 8,
3130+
"files_count_source": "mcp_breakdown_proxy"
3131+
},
3132+
{
3133+
"task_id": "test-unitgen-py-001",
3134+
"benchmark": "ccb_test",
3135+
"sdlc_phase": "Testing & QA",
3136+
"language": "python",
3137+
"difficulty": "medium",
3138+
"category": "unit-test-generation",
3139+
"repo": "django/django",
3140+
"mcp_benefit_score": 0.78,
3141+
"mcp_breakdown": {
3142+
"context_complexity": 0.8,
3143+
"cross_file_deps": 0.7,
3144+
"semantic_search_potential": 0.8,
3145+
"task_category_weight": 0.8
3146+
},
3147+
"selection_rationale": "Phase 2 backlog promotion: unit test generation for Django cache middleware (Python variant)",
3148+
"task_dir": "ccb_test/test-unitgen-py-001",
3149+
"context_length": 500000,
3150+
"context_length_source": "mcp_breakdown_proxy",
3151+
"files_count": 8,
3152+
"files_count_source": "mcp_breakdown_proxy"
30213153
}
30223154
]
30233155
}

0 commit comments

Comments
 (0)