sourcegraph
diff --git a/‎CLAUDE.md‎
Lines changed: 5 additions & 3 deletions b/‎CLAUDE.md‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 14 additions & 14 deletions b/‎README.md‎
Lines changed: 14 additions & 14 deletions
diff --git a/‎benchmarks/README.md‎
Lines changed: 8 additions & 8 deletions b/‎benchmarks/README.md‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎configs/selected_benchmark_tasks.json‎
Lines changed: 158 additions & 26 deletions b/‎configs/selected_benchmark_tasks.json‎
Lines changed: 158 additions & 26 deletions
@@ -4,9 +4,10 @@ This file is the operational quick-reference for benchmark maintenance.
 `AGENTS.md` mirrors this file.
 
 ## Benchmark Overview
-8 SDLC phase suites + 6 MCP-unique suites. SDLC tasks measure code quality
-across phases: build, debug, design, document, fix, secure, test, understand.
-MCP-unique tasks measure org-scale cross-repo discovery and retrieval.
+8 SDLC phase suites + 8 MCP-unique suites (6 active, 2 deferred). SDLC tasks
+measure code quality across phases: build, debug, design, document, fix,
+secure, test, understand. MCP-unique tasks measure org-scale cross-repo
+discovery and retrieval.
 See `README.md` for the full suite table and `docs/TASK_CATALOG.md` for
 per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
 
@@ -28,6 +29,7 @@ per-task details. See `docs/MCP_UNIQUE_TASKS.md` for the MCP-unique extension.
 - `docs/LEADERBOARD.md` - ranking policy
 - `docs/SUBMISSION.md` - submission format
 - `docs/SKILLS.md` - AI agent skill system overview
+- `docs/REPORT_CONTEXT.md` - paper context: design approach and preliminary results
 - `skills/` - operational runbooks for AI agents (see `skills/README.md`)
 
 ## Git Policy
 
@@ -16,17 +16,15 @@ Eight suites organized by software development lifecycle phase:
 | `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
 | `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
 | `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
-| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation |
-| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides |
+| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
+| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
 | `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
 | `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
 | **Total** | | **170** | |
 
-*ccb_test* and *ccb_document* currently have 14 and 13 tasks on disk (target 20 each); see `docs/backlog_ccb_test.json` and `docs/backlog_ccb_document.json` for the growth plan.
-
 ## MCP-Unique Suites (Org-Scale Context Retrieval)
 
-Six additional suites measure what local-only agents *cannot* do: cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
+Six additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
 
 | Suite | Category | Tasks | Description |
 |-------|----------|------:|-------------|
@@ -38,6 +36,8 @@ Six additional suites measure what local-only agents *cannot* do: cross-repo dis
 | `ccb_mcp_platform` | J: Platform Knowledge | 1 | Service template discovery and tribal knowledge |
 | **Total** | | **12** | |
 
+The table above shows the 12 tasks evaluated in official runs. The full MCP-unique catalog has 20 tasks across 8 suites (including compliance and migration, pending first runs). **Combined catalog total: 190 tasks** (170 SDLC + 20 MCP-unique).
+
 Both baseline and MCP-Full agents have access to **all repos** in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
 
 See [docs/MCP_UNIQUE_TASKS.md](docs/MCP_UNIQUE_TASKS.md) for the full task system, authoring guide, and oracle evaluation framework. See [docs/MCP_UNIQUE_CALIBRATION.md](docs/MCP_UNIQUE_CALIBRATION.md) for oracle coverage analysis.
@@ -64,10 +64,10 @@ benchmarks/              # Task definitions organized by SDLC phase + MCP-unique
   ccb_build/             #   Feature & Refactoring (25 tasks)
   ccb_debug/             #   Debugging & Investigation (20 tasks)
   ccb_design/            #   Architecture & Design (20 tasks)
-  ccb_document/          #   Documentation (13 tasks)
+  ccb_document/          #   Documentation (20 tasks)
   ccb_fix/               #   Bug Repair (25 tasks)
   ccb_secure/            #   Security & Compliance (20 tasks)
-  ccb_test/              #   Testing & QA (14 tasks)
+  ccb_test/              #   Testing & QA (20 tasks)
   ccb_understand/        #   Requirements & Discovery (20 tasks)
   ccb_mcp_crossrepo_tracing/  #   MCP-unique: cross-repo dependency tracing (3 tasks)
   ccb_mcp_security/      #   MCP-unique: vulnerability remediation (2 tasks)
@@ -81,10 +81,10 @@ configs/                 # Run configs and task selection
   build_2config.sh       #   Phase wrapper: Build (25 tasks)
   debug_2config.sh       #   Phase wrapper: Debug (20 tasks)
   design_2config.sh      #   Phase wrapper: Design (20 tasks)
-  document_2config.sh    #   Phase wrapper: Document (13 tasks)
+  document_2config.sh    #   Phase wrapper: Document (20 tasks)
   fix_2config.sh         #   Phase wrapper: Fix (25 tasks)
   secure_2config.sh      #   Phase wrapper: Secure (20 tasks)
-  test_2config.sh        #   Phase wrapper: Test (14 tasks)
+  test_2config.sh        #   Phase wrapper: Test (20 tasks)
   run_selected_tasks.sh  #   Unified runner for all tasks
   validate_one_per_benchmark.sh  # Pre-flight smoke (1 task per suite)
   selected_benchmark_tasks.json  # Canonical SDLC task selection with metadata
@@ -172,10 +172,10 @@ For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical a
 
 ### SDLC Tasks
 
-The unified runner executes all 157 SDLC tasks across the 2-config matrix:
+The unified runner executes all 170 SDLC tasks across the 2-config matrix:
 
 ```bash
-# Run all 157 SDLC tasks across 2 configs
+# Run all 170 SDLC tasks across 2 configs
 bash configs/run_selected_tasks.sh
 
 # Run only the baseline config
@@ -197,16 +197,16 @@ bash configs/understand_2config.sh       # 20 Requirements & Discovery tasks
 bash configs/design_2config.sh           # 20 Architecture & Design tasks
 bash configs/debug_2config.sh            # 20 Debugging & Investigation tasks
 bash configs/secure_2config.sh           # 20 Security & Compliance tasks
-bash configs/test_2config.sh             # 14 Testing & QA tasks
-bash configs/document_2config.sh         # 13 Documentation tasks
+bash configs/test_2config.sh             # 20 Testing & QA tasks
+bash configs/document_2config.sh         # 20 Documentation tasks
 ```
 
 ### MCP-Unique Tasks
 
 MCP-unique tasks use a separate selection file:
 
 ```bash
-# Run all 12 MCP-unique tasks across 2 configs
+# Run all MCP-unique tasks across 2 configs
 bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json
 
 # Filter by use-case category
 
@@ -1,8 +1,8 @@
 # CodeContextBench Benchmarks
 
-157 tasks organized into 8 suites aligned with the software development lifecycle (SDLC). Each suite targets a distinct phase of engineering work. The canonical task selection is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json).
+170 tasks organized into 8 suites aligned with the software development lifecycle (SDLC). Each suite targets a distinct phase of engineering work. The canonical task selection is in [`selected_benchmark_tasks.json`](../configs/selected_benchmark_tasks.json).
 
-See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology and [`docs/PRD_SDLC_SUITE_REORGANIZATION.md`](../docs/PRD_SDLC_SUITE_REORGANIZATION.md) for the migration rationale.
+See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodology.
 
 ---
 
@@ -14,11 +14,11 @@ See [`docs/TASK_SELECTION.md`](../docs/TASK_SELECTION.md) for selection methodol
 | `ccb_design` | Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
 | `ccb_fix` | Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
 | `ccb_build` | Feature & Refactoring | 25 | New features, refactoring, dependency management |
-| `ccb_test` | Testing & QA | 14 | Code review, performance testing, code search validation |
-| `ccb_document` | Documentation | 13 | API references, architecture docs, migration guides |
+| `ccb_test` | Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
+| `ccb_document` | Documentation | 20 | API references, architecture docs, migration guides, runbooks |
 | `ccb_secure` | Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
 | `ccb_debug` | Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
-| **Total** | | **157** | |
+| **Total** | | **170** | |
 
 ---
 
@@ -148,7 +148,7 @@ New feature implementation, code refactoring, and dependency management tasks.
 
 ---
 
-## ccb_test (14 tasks) — Testing & QA
+## ccb_test (20 tasks) — Testing & QA
 
 Code review with injected defects, performance testing, and code search validation.
 
@@ -171,7 +171,7 @@ Code review with injected defects, performance testing, and code search validati
 
 ---
 
-## ccb_document (13 tasks) — Documentation
+## ccb_document (20 tasks) — Documentation
 
 API reference generation, architecture documentation, and migration guide creation.
 
@@ -269,7 +269,7 @@ Each task follows this layout:
 ## Running Benchmarks
 
 ```bash
-# Run all 157 tasks across 2 configs (Baseline + MCP-Full)
+# Run all 170 tasks across 2 configs (Baseline + MCP-Full)
 bash configs/run_selected_tasks.sh
 
 # Run a single SDLC phase
 
@@ -5,12 +5,12 @@
     "generated_by": "SDLC suite migration from migration_map.json",
     "generated_date": "2026-02-18",
     "total_available": 835,
-    "total_selected": 164,
+    "total_selected": 170,
     "migration_source": "migration_map.json (157 mapped tasks across 8 SDLC suites)",
     "target_total": 170,
-    "target_note": "ccb_test and ccb_document target 20 each (see docs/backlog_ccb_test.json, docs/backlog_ccb_document.json)",
-    "last_updated": "2026-02-22",
-    "note": "Updated all 8 SDLC suites to use renamed on-disk task directories. Removed 2 orphaned ccb_understand tasks."
+    "target_note": "All suites at target: ccb_test=20, ccb_document=20. Phase 2 backlog fully promoted.",
+    "last_updated": "2026-02-23",
+    "note": "Promoted 7 Phase 2 backlog tasks (3 ccb_document, 4 ccb_test). Removed duplicate cgen-deps-install-001 entry."
   },
   "methodology": {
     "description": "Tasks reorganized from 17 legacy suites into 8 SDLC-phase suites. Each task retains its original scoring and metadata; only the suite assignment and task_dir are updated to reflect the new SDLC taxonomy. Tasks with status 'archived' or 'retired' in migration_map.json are excluded.",
@@ -1622,28 +1622,6 @@
       "files_count": 6,
       "files_count_source": "task_metrics_run"
     },
-    {
-      "task_id": "cgen-deps-install-001",
-      "benchmark": "ccb_build",
-      "sdlc_phase": "Implementation (feature)",
-      "language": "python",
-      "difficulty": "medium",
-      "category": "dependency-inference",
-      "repo": "",
-      "mcp_benefit_score": 0.55,
-      "mcp_breakdown": {
-        "context_complexity": 0.5,
-        "cross_file_deps": 0.4,
-        "semantic_search_potential": 0.6,
-        "task_category_weight": 0.7
-      },
-      "selection_rationale": "New SDLC task: dependency inference from DIBench",
-      "task_dir": "ccb_build/cgen-deps-install-001",
-      "context_length": 500000,
-      "context_length_source": "mcp_breakdown_proxy",
-      "files_count": 8,
-      "files_count_source": "mcp_breakdown_proxy"
-    },
     {
       "task_id": "django-composite-field-recover-001",
       "benchmark": "ccb_understand",
@@ -3018,6 +2996,160 @@
       "repo": "charmbracelet/wish",
       "mcp_benefit_score": 0.75,
       "selection_rationale": "SDLC phase task (auto-generated)"
+    },
+    {
+      "task_id": "docgen-inline-001",
+      "benchmark": "ccb_document",
+      "sdlc_phase": "Documentation",
+      "language": "python",
+      "difficulty": "medium",
+      "category": "inline_docstring_generation",
+      "repo": "django/django",
+      "mcp_benefit_score": 0.82,
+      "mcp_breakdown": {
+        "context_complexity": 0.8,
+        "cross_file_deps": 0.7,
+        "semantic_search_potential": 0.85,
+        "task_category_weight": 0.9
+      },
+      "selection_rationale": "Phase 2 backlog promotion: inline docstring generation (Python variant)",
+      "task_dir": "ccb_document/docgen-inline-001",
+      "context_length": 850000,
+      "context_length_source": "mcp_breakdown_proxy",
+      "files_count": 12,
+      "files_count_source": "mcp_breakdown_proxy"
+    },
+    {
+      "task_id": "docgen-runbook-001",
+      "benchmark": "ccb_document",
+      "sdlc_phase": "Documentation",
+      "language": "go",
+      "difficulty": "hard",
+      "category": "runbook_writing",
+      "repo": "prometheus/prometheus",
+      "mcp_benefit_score": 0.88,
+      "mcp_breakdown": {
+        "context_complexity": 0.9,
+        "cross_file_deps": 0.85,
+        "semantic_search_potential": 0.9,
+        "task_category_weight": 0.85
+      },
+      "selection_rationale": "Phase 2 backlog promotion: operational runbook for TSDB compaction",
+      "task_dir": "ccb_document/docgen-runbook-001",
+      "context_length": 900000,
+      "context_length_source": "mcp_breakdown_proxy",
+      "files_count": 15,
+      "files_count_source": "mcp_breakdown_proxy"
+    },
+    {
+      "task_id": "docgen-runbook-002",
+      "benchmark": "ccb_document",
+      "sdlc_phase": "Documentation",
+      "language": "cpp",
+      "difficulty": "hard",
+      "category": "runbook_writing",
+      "repo": "envoyproxy/envoy",
+      "mcp_benefit_score": 0.87,
+      "mcp_breakdown": {
+        "context_complexity": 0.9,
+        "cross_file_deps": 0.85,
+        "semantic_search_potential": 0.85,
+        "task_category_weight": 0.85
+      },
+      "selection_rationale": "Phase 2 backlog promotion: troubleshooting runbook for Envoy connection pool",
+      "task_dir": "ccb_document/docgen-runbook-002",
+      "context_length": 900000,
+      "context_length_source": "mcp_breakdown_proxy",
+      "files_count": 14,
+      "files_count_source": "mcp_breakdown_proxy"
+    },
+    {
+      "task_id": "test-coverage-gap-001",
+      "benchmark": "ccb_test",
+      "sdlc_phase": "Testing & QA",
+      "language": "cpp",
+      "difficulty": "hard",
+      "category": "test-coverage-gap-analysis",
+      "repo": "envoyproxy/envoy",
+      "mcp_benefit_score": 0.75,
+      "mcp_breakdown": {
+        "context_complexity": 0.85,
+        "cross_file_deps": 0.7,
+        "semantic_search_potential": 0.7,
+        "task_category_weight": 0.75
+      },
+      "selection_rationale": "Phase 2 backlog promotion: coverage gap analysis for Envoy HTTP filter chain",
+      "task_dir": "ccb_test/test-coverage-gap-001",
+      "context_length": 800000,
+      "context_length_source": "mcp_breakdown_proxy",
+      "files_count": 10,
+      "files_count_source": "mcp_breakdown_proxy"
+    },
+    {
+      "task_id": "test-coverage-gap-002",
+      "benchmark": "ccb_test",
+      "sdlc_phase": "Testing & QA",
+      "language": "java",
+      "difficulty": "hard",
+      "category": "test-coverage-gap-analysis",
+      "repo": "apache/kafka",
+      "mcp_benefit_score": 0.73,
+      "mcp_breakdown": {
+        "context_complexity": 0.8,
+        "cross_file_deps": 0.7,
+        "semantic_search_potential": 0.7,
+        "task_category_weight": 0.75
+      },
+      "selection_rationale": "Phase 2 backlog promotion: coverage gap analysis for Kafka consumer group coordinator",
+      "task_dir": "ccb_test/test-coverage-gap-002",
+      "context_length": 800000,
+      "context_length_source": "mcp_breakdown_proxy",
+      "files_count": 10,
+      "files_count_source": "mcp_breakdown_proxy"
+    },
+    {
+      "task_id": "test-integration-002",
+      "benchmark": "ccb_test",
+      "sdlc_phase": "Testing & QA",
+      "language": "go",
+      "difficulty": "hard",
+      "category": "integration-test-authoring",
+      "repo": "navidrome/navidrome",
+      "mcp_benefit_score": 0.76,
+      "mcp_breakdown": {
+        "context_complexity": 0.8,
+        "cross_file_deps": 0.75,
+        "semantic_search_potential": 0.75,
+        "task_category_weight": 0.75
+      },
+      "selection_rationale": "Phase 2 backlog promotion: integration tests for media scanning pipeline",
+      "task_dir": "ccb_test/test-integration-002",
+      "context_length": 600000,
+      "context_length_source": "mcp_breakdown_proxy",
+      "files_count": 8,
+      "files_count_source": "mcp_breakdown_proxy"
+    },
+    {
+      "task_id": "test-unitgen-py-001",
+      "benchmark": "ccb_test",
+      "sdlc_phase": "Testing & QA",
+      "language": "python",
+      "difficulty": "medium",
+      "category": "unit-test-generation",
+      "repo": "django/django",
+      "mcp_benefit_score": 0.78,
+      "mcp_breakdown": {
+        "context_complexity": 0.8,
+        "cross_file_deps": 0.7,
+        "semantic_search_potential": 0.8,
+        "task_category_weight": 0.8
+      },
+      "selection_rationale": "Phase 2 backlog promotion: unit test generation for Django cache middleware (Python variant)",
+      "task_dir": "ccb_test/test-unitgen-py-001",
+      "context_length": 500000,
+      "context_length_source": "mcp_breakdown_proxy",
+      "files_count": 8,
+      "files_count_source": "mcp_breakdown_proxy"
     }
   ]
 }