Skip to content

Commit 0447be5

Browse files
committed
Add baseline-vs-MCP precision/recall/F1 IR tables
1 parent 1f61178 commit 0447be5

File tree

4 files changed

+48
-12
lines changed

4 files changed

+48
-12
lines changed

docs/BLOG_POST.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -157,10 +157,10 @@ The refreshed retrieval pipeline run confirms moderate retrieval quality overall
157157

158158
On the computable subset, aggregated baseline vs MCP retrieval metrics are:
159159

160-
| Config Type | n | File Recall | MRR | MAP | Context Efficiency |
161-
|-------------|---|-------------|-----|-----|--------------------|
162-
| baseline | 132 | 0.330 | 0.346 | 0.231 | 0.184 |
163-
| mcp | 179 | 0.556 | 0.378 | 0.267 | 0.204 |
160+
| Config Type | n | File Recall | Precision@5 | Recall@5 | F1@5 | MRR |
161+
|-------------|---|-------------|-------------|----------|------|-----|
162+
| baseline | 132 | 0.330 | 0.212 | 0.237 | 0.185 | 0.346 |
163+
| mcp | 179 | 0.556 | 0.215 | 0.248 | 0.200 | 0.378 |
164164

165165
But better retrieval doesn't always mean better outcomes. Still investigating this but likely finding the right files is necessary but not sufficient. The agent still has to correctly apply what it finds, and in some tasks the local code modification step is where removing local code availability from the MCP run environment hurts more than others.
166166

docs/analysis/analysis_refresh_tables_20260303.json

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -440,14 +440,32 @@
440440
"file_recall": 0.3295,
441441
"mrr": 0.3462,
442442
"map_score": 0.2307,
443-
"context_efficiency": 0.1843
443+
"context_efficiency": 0.1843,
444+
"precision@1": 0.2727,
445+
"recall@1": 0.0769,
446+
"f1@1": 0.1048,
447+
"precision@5": 0.2121,
448+
"recall@5": 0.2368,
449+
"f1@5": 0.185,
450+
"precision@10": 0.1424,
451+
"recall@10": 0.2848,
452+
"f1@10": 0.1579
444453
},
445454
"mcp": {
446455
"n": 179,
447456
"file_recall": 0.5558,
448457
"mrr": 0.3778,
449458
"map_score": 0.2667,
450-
"context_efficiency": 0.2043
459+
"context_efficiency": 0.2043,
460+
"precision@1": 0.3073,
461+
"recall@1": 0.1005,
462+
"f1@1": 0.1309,
463+
"precision@5": 0.2145,
464+
"recall@5": 0.2476,
465+
"f1@5": 0.2001,
466+
"precision@10": 0.1419,
467+
"recall@10": 0.3227,
468+
"f1@10": 0.1724
451469
}
452470
},
453471
"ir_by_config": {

docs/analysis/analysis_set_metrics_20260303.json

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -440,14 +440,32 @@
440440
"file_recall": 0.3295,
441441
"mrr": 0.3462,
442442
"map_score": 0.2307,
443-
"context_efficiency": 0.1843
443+
"context_efficiency": 0.1843,
444+
"precision@1": 0.2727,
445+
"recall@1": 0.0769,
446+
"f1@1": 0.1048,
447+
"precision@5": 0.2121,
448+
"recall@5": 0.2368,
449+
"f1@5": 0.185,
450+
"precision@10": 0.1424,
451+
"recall@10": 0.2848,
452+
"f1@10": 0.1579
444453
},
445454
"mcp": {
446455
"n": 179,
447456
"file_recall": 0.5558,
448457
"mrr": 0.3778,
449458
"map_score": 0.2667,
450-
"context_efficiency": 0.2043
459+
"context_efficiency": 0.2043,
460+
"precision@1": 0.3073,
461+
"recall@1": 0.1005,
462+
"f1@1": 0.1309,
463+
"precision@5": 0.2145,
464+
"recall@5": 0.2476,
465+
"f1@5": 0.2001,
466+
"precision@10": 0.1419,
467+
"recall@10": 0.3227,
468+
"f1@10": 0.1724
451469
}
452470
},
453471
"ir_by_config": {

docs/technical_reports/TECHNICAL_REPORT_V2.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -942,10 +942,10 @@ This indicates retrieval quality remains moderate on computable tasks, but groun
942942
943943
**IR aggregates by configuration type (baseline vs MCP):**
944944
945-
| Config Type | n | File Recall | MRR | MAP | Context Efficiency |
946-
|-------------|---|-------------|-----|-----|--------------------|
947-
| baseline | 132 | 0.3295 | 0.3462 | 0.2307 | 0.1843 |
948-
| mcp | 179 | 0.5558 | 0.3778 | 0.2667 | 0.2043 |
945+
| Config Type | n | File Recall | Precision@5 | Recall@5 | F1@5 | MRR |
946+
|-------------|---|-------------|-------------|----------|------|-----|
947+
| baseline | 132 | 0.3295 | 0.2121 | 0.2368 | 0.1850 | 0.3462 |
948+
| mcp | 179 | 0.5558 | 0.2145 | 0.2476 | 0.2001 | 0.3778 |
949949
950950
MCP runs show higher recall and slightly higher ranking/efficiency metrics on computable retrieval tasks.
951951

0 commit comments

Comments
 (0)