Skip to content

Commit 7241caf

Browse files
sjarmakclaude
andcommitted
docs: refine blog post with Results Explorer examples and timing detail
- Add back specific task examples (bustub-hyperloglog, CCX-compliance-124) - Fix session count in build section (~600 for original build period) - Add per-category timing breakdown (Org -63s, SDLC -23s) - Expand MCP tool search strategy breakdown Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 30eb9b9 commit 7241caf

File tree

1 file changed

+9
-3
lines changed

1 file changed

+9
-3
lines changed

docs/BLOG_POST.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -153,11 +153,13 @@ Where cost is mixed, speed is not. MCP is substantially faster across the board,
153153
| Wall clock | 369 | 403.1 | 356.2 | -46.8 |
154154
| Agent execution | 369 | 237.1 | 146.7 | -90.4 |
155155

156+
Agent execution time (excluding infrastructure overhead) tells the cleaner story: the agent's problem-solving phase is 38% shorter with MCP. Org tasks see the biggest speedup (-63 seconds wall clock, -19.8%), because remote search eliminates the need to read many local files when source is truncated. SDLC tasks are also faster (-23 seconds, -4.4%), though the improvement is smaller since local code is already available.
157+
156158
On suites where MCP improves reward (Org security and incident especially), you get better results faster and cheaper. Where MCP hurts (debug), you get worse results faster and at slightly higher cost. This is useful signal for figuring out where these tools are worth using.
157159

158160
## MCP Tool Usage Patterns
159161

160-
Agents overwhelmingly default to keyword search. Deep Search was almost never invoked organically (6 tasks, 8 calls across 602 MCP runs). The agent relies on keyword search (4,813 calls) and file reading (6,324 calls) as its primary tools. Natural language search is used in ~42% of tasks but contributes only 587 calls. Agents seem to have a strong preference for exact keyword matching over semantic search, even when they are told outright about these tools.
162+
Agents overwhelmingly default to keyword search. Deep Search was almost never invoked organically (6 tasks, 8 calls across 602 MCP runs). The agent relies on keyword search (4,813 calls) and file reading (6,324 calls) as its primary MCP tools. Natural language search is used in ~42% of tasks but contributes only 587 calls vs 4,813 for keyword search. The search strategy breakdown: the vast majority of tasks use keyword-only or keyword-dominant approaches, with natural language search as a secondary fallback, and Deep Search effectively ignored. Agents seem to have a strong preference for exact keyword matching over semantic search, even when they are told outright about these tools.
161163

162164
## Auditable Results (Transcripts!)
163165

@@ -175,15 +177,19 @@ In addition to being able to navigate the results via markdowns, if you clone th
175177
python3 scripts/export_official_results.py --serve
176178
```
177179

178-
You get a local results explorer where you can browse every task run. It shows the full set of task runs across all suites, configs, and runs.
180+
You get a local results explorer where you can browse every task run. It shows 4,132 total task evaluations across all suites, configs, and runs.
179181

180182
The Official Results Browser lets you filter by suite, task run, config, and status. Every row links to the task's repo, benchmark definition, trajectory, and audit trail.
181183

184+
Drilling into a specific task, here's a baseline run of bustub-hyperloglog-impl-001 in the feature suite. You can see it took 2269 seconds (nearly 38 minutes), made 175 tool calls, consumed 28.5M input tokens, and scored a reward of 0.167. The full conversation history (388 messages) is right there for inspection.
185+
186+
And an example MCP-augmented run: mcp_CCX-compliance-124 in the compliance suite. 115 seconds total, 20 tool calls, MCP ratio of 0.950, reward of 0.7419. The agent trace starts with "I'll help you audit the CSP enforcement infrastructure in Firefox" and immediately goes to the relevant dom/security/ directory via Sourcegraph MCP tools.
187+
182188
Each task detail view has expandable sections for the tool breakdown, context metrics / IR analysis, and the complete conversation history. You can verify not just whether the agent succeeded, but how it approached the task, what tools it used, and where it went wrong or right.
183189

184190
## How I Built This (And What Broke)
185191

186-
I built CodeScaleBench almost entirely with Claude Code, the same AI coding agent I used for the benchmark runs. ~1000 conversation sessions over about a month, producing the task selection pipeline, 190+ Docker environment variants, a 3,500-line IR evaluation pipeline, a 7-function oracle scoring system, and helper skills for everything from benchmark design to pre-flight validation to results QA.
192+
I built CodeScaleBench almost entirely with Claude Code, the same AI coding agent I used for the initial benchmark runs. ~600 conversation sessions over about a month, producing the task selection pipeline, 190+ Docker environment variants, a 3,500-line IR evaluation pipeline, a 7-function oracle scoring system, and helper skills for everything from benchmark design to pre-flight validation to results QA.
187193

188194
The process taught me a lot about where AI-assisted development works well and where issues pop up.
189195

0 commit comments

Comments
 (0)