Skip to content

Commit 38092ed

Browse files
sjarmakclaude
andcommitted
feat: US-020 - Report generator integration for retrieval metrics
- Add collect_retrieval_data() to ccb_metrics/discovery.py: walks runs dir and collects retrieval_metrics.json files from task output directories - Export collect_retrieval_data from ccb_metrics/__init__.py - Add 4 MCP Retrieval Performance table builders to generate_eval_report.py: per-task coverage/timing, per-suite aggregates, baseline vs MCP-Full comparison, and MCP tool discovery breakdown - Backwards-compatible: section omitted when no retrieval_metrics.json found - All py_compile checks pass; repo health gate passes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent b758c58 commit 38092ed

File tree

5 files changed

+313
-3
lines changed

5 files changed

+313
-3
lines changed

ralph-mcp-unique/prd.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -603,7 +603,7 @@
603603
"python3 scripts/generate_eval_report.py runs without errors"
604604
],
605605
"priority": 20,
606-
"passes": false,
606+
"passes": true,
607607
"notes": "The report should clearly show that baseline has lower oracle_coverage on mcp_only repos."
608608
},
609609
{

ralph-mcp-unique/progress.txt

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -513,3 +513,30 @@
513513
- DIR_PREFIX_TO_SUITE run dir prefix: `ccb_mcp_crossrepo_tracing_` (with trailing underscore) maps to suite name `ccb_mcp_crossrepo_tracing`
514514
- mcp_benefit_score assigned by difficulty: easy=0.70, medium=0.80-0.85, hard=0.90-0.95 (DS variants = 0.95)
515515
---
516+
[2026-02-20 21:39:55 UTC] Iteration 10 no story markers found
517+
[2026-02-20 21:39:55 UTC] Iteration 10 complete
518+
[2026-02-20 21:39:57 UTC] Iteration 11 started
519+
520+
## 2026-02-20 - US-020: Report generator integration for retrieval metrics
521+
- Extended `scripts/generate_eval_report.py` to include 'MCP Retrieval Performance' section
522+
- Added `collect_retrieval_data(runs_dir)` to `scripts/ccb_metrics/discovery.py` — walks same dir structure as `discover_runs()`, collects `retrieval_metrics.json` from each task output directory
523+
- Exported `collect_retrieval_data` from `scripts/ccb_metrics/__init__.py`
524+
- Added 4 new table builder functions to `generate_eval_report.py`:
525+
- `_build_retrieval_per_task()` — oracle_coverage, time-to-first-hit, repos/orgs per task
526+
- `_build_retrieval_per_suite()` — mean coverage/repos per suite aggregate
527+
- `_build_retrieval_comparison()` — baseline vs MCP-Full delta per task
528+
- `_build_retrieval_tool_breakdown()` — which MCP tools drive discovery, aggregated per suite
529+
- Section is backwards-compatible: omitted when no `retrieval_metrics.json` files found
530+
- `generate_report()` collects retrieval data and passes to table builders
531+
- All 4 table builders return None when no data — they are guarded by `if _has_retrieval_data(retrieval_data)`
532+
- Quality checks: `python3 -m py_compile` passes for all changed files
533+
- `python3 scripts/generate_eval_report.py --help` runs without errors
534+
- Repo health check: PASSED
535+
536+
- Files changed: `scripts/ccb_metrics/discovery.py`, `scripts/ccb_metrics/__init__.py`, `scripts/generate_eval_report.py`, `ralph-mcp-unique/prd.json`
537+
- **Learnings for future iterations:**
538+
- `collect_retrieval_data` uses same skip/dedup patterns as `discover_runs` — latest batch wins
539+
- Retrieval comparison table uses heuristic: baseline = config without "sourcegraph"/"mcp" in name; mcp = config with "sourcegraph_full"/"mcp_full"
540+
- Table builders are None-returning optional functions — consistent with other optional tables (swebench_partial, search_patterns, etc.)
541+
- Collecting retrieval data is a separate pass over the runs dir (not integrated into task discovery) to avoid schema changes to TaskMetrics
542+
---

scripts/ccb_metrics/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
"""CCB Metrics — data models and extractors for CodeContextBench evaluation."""
22

33
from .models import TaskMetrics, RunMetrics, EvalReport
4-
from .discovery import discover_runs
4+
from .discovery import discover_runs, collect_retrieval_data
55
from .extractors import extract_run_config
66
from .task_selection import (
77
load_selected_tasks,
@@ -15,6 +15,7 @@
1515
"RunMetrics",
1616
"EvalReport",
1717
"discover_runs",
18+
"collect_retrieval_data",
1819
"extract_run_config",
1920
"load_selected_tasks",
2021
"build_task_index",

scripts/ccb_metrics/discovery.py

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -405,3 +405,68 @@ def discover_runs(runs_dir: str | Path) -> list[RunMetrics]:
405405
results.append(run)
406406

407407
return results
408+
409+
410+
def collect_retrieval_data(
411+
runs_dir: str | Path,
412+
) -> dict[tuple[str, str, str], dict]:
413+
"""Collect retrieval_metrics.json files from all task output directories.
414+
415+
Walks the same directory structure as :func:`discover_runs` and collects
416+
``retrieval_metrics.json`` files written by
417+
``scripts/ccb_metrics/retrieval.py``.
418+
419+
Args:
420+
runs_dir: Path to the runs/official/ (or staging) directory.
421+
422+
Returns:
423+
Dict mapping ``(benchmark, config_name, task_id)`` to the parsed
424+
retrieval metrics dict. Empty dict if no files are found.
425+
When the same task appears in multiple batch directories, the latest
426+
batch's data is kept (same dedup policy as :func:`discover_runs`).
427+
"""
428+
runs_dir = Path(runs_dir)
429+
result: dict[tuple[str, str, str], dict] = {}
430+
431+
if not runs_dir.is_dir():
432+
return result
433+
434+
_SKIP_PATTERNS = (
435+
"archive", "__broken", "__duplicate", "__all_errored", "__partial", "__integrated"
436+
)
437+
438+
for run_dir in sorted(runs_dir.iterdir()):
439+
if not run_dir.is_dir():
440+
continue
441+
run_name = run_dir.name
442+
if any(pat in run_name for pat in _SKIP_PATTERNS):
443+
continue
444+
benchmark = normalize_benchmark_name(_infer_benchmark(run_name))
445+
446+
for config_dir in sorted(run_dir.iterdir()):
447+
if not config_dir.is_dir():
448+
continue
449+
config_name = config_dir.name
450+
451+
for batch_dir in sorted(config_dir.iterdir()):
452+
if not batch_dir.is_dir() or not _is_batch_dir(batch_dir):
453+
continue
454+
455+
for task_dir in sorted(batch_dir.iterdir()):
456+
if not _is_task_dir(task_dir):
457+
continue
458+
459+
ret_path = task_dir / "retrieval_metrics.json"
460+
if not ret_path.is_file():
461+
continue
462+
463+
task_id = _extract_task_id(task_dir.name)
464+
try:
465+
data = json.loads(ret_path.read_text())
466+
except (OSError, json.JSONDecodeError):
467+
continue
468+
469+
# Latest batch wins (same dedup policy as discover_runs)
470+
result[(benchmark, config_name, task_id)] = data
471+
472+
return result

scripts/generate_eval_report.py

Lines changed: 218 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@
3232
if str(_SCRIPT_DIR) not in sys.path:
3333
sys.path.insert(0, str(_SCRIPT_DIR))
3434

35-
from ccb_metrics import discover_runs, EvalReport, RunMetrics
35+
from ccb_metrics import discover_runs, collect_retrieval_data, EvalReport, RunMetrics
3636
from ccb_metrics.task_selection import (
3737
load_selected_tasks,
3838
build_task_index,
@@ -515,6 +515,193 @@ def _build_swebench_partial(runs: list[RunMetrics]) -> Optional[tuple[list[str],
515515
return headers, rows
516516

517517

518+
# ---------------------------------------------------------------------------
519+
# MCP Retrieval Performance tables
520+
# ---------------------------------------------------------------------------
521+
522+
# Type alias: (benchmark, config_name, task_id) -> retrieval metrics dict
523+
_RetrievalData = dict[tuple[str, str, str], dict]
524+
525+
526+
def _has_retrieval_data(retrieval_data: _RetrievalData) -> bool:
527+
return bool(retrieval_data)
528+
529+
530+
def _build_retrieval_per_task(
531+
runs: list[RunMetrics],
532+
retrieval_data: _RetrievalData,
533+
) -> Optional[tuple[list[str], list[list[str]]]]:
534+
"""Table: per-task oracle coverage, time-to-first-hit, repos/orgs touched."""
535+
# Collect rows for any task that has retrieval data
536+
rows = []
537+
for r in sorted(runs, key=lambda x: (x.benchmark, x.config_name)):
538+
for t in sorted(r.tasks, key=lambda x: x.task_id):
539+
key = (r.benchmark, r.config_name, t.task_id)
540+
m = retrieval_data.get(key)
541+
if m is None:
542+
continue
543+
ttfh = m.get("time_to_first_oracle_hit_ms")
544+
rows.append([
545+
r.benchmark,
546+
r.config_name,
547+
t.task_id,
548+
_fmt(m.get("oracle_coverage")),
549+
f"{int(ttfh):,}" if ttfh is not None else "-",
550+
str(m.get("unique_repos_touched", 0)),
551+
str(m.get("unique_orgs_touched", 0)),
552+
])
553+
554+
if not rows:
555+
return None
556+
557+
headers = [
558+
"Suite", "Config", "Task",
559+
"Oracle Coverage", "Time-to-First-Hit (ms)",
560+
"Repos Touched", "Orgs Touched",
561+
]
562+
return headers, rows
563+
564+
565+
def _build_retrieval_per_suite(
566+
runs: list[RunMetrics],
567+
retrieval_data: _RetrievalData,
568+
) -> Optional[tuple[list[str], list[list[str]]]]:
569+
"""Table: per-suite aggregate retrieval metrics."""
570+
# Group by (benchmark, config_name)
571+
agg: dict[tuple[str, str], list[dict]] = {}
572+
for r in runs:
573+
for t in r.tasks:
574+
key = (r.benchmark, r.config_name, t.task_id)
575+
m = retrieval_data.get(key)
576+
if m is None:
577+
continue
578+
gkey = (r.benchmark, r.config_name)
579+
agg.setdefault(gkey, []).append(m)
580+
581+
if not agg:
582+
return None
583+
584+
headers = [
585+
"Suite", "Config", "Tasks",
586+
"Mean Coverage", "Mean Repos Touched", "Mean Orgs Touched",
587+
]
588+
rows = []
589+
for (bench, config) in sorted(agg.keys()):
590+
items = agg[(bench, config)]
591+
n = len(items)
592+
mean_cov = _safe_mean([m.get("oracle_coverage") for m in items])
593+
mean_repos = _safe_mean([m.get("unique_repos_touched") for m in items])
594+
mean_orgs = _safe_mean([m.get("unique_orgs_touched") for m in items])
595+
rows.append([
596+
bench,
597+
config,
598+
str(n),
599+
_fmt(mean_cov),
600+
_fmt(mean_repos, 1),
601+
_fmt(mean_orgs, 1),
602+
])
603+
return headers, rows
604+
605+
606+
def _build_retrieval_comparison(
607+
runs: list[RunMetrics],
608+
retrieval_data: _RetrievalData,
609+
) -> Optional[tuple[list[str], list[list[str]]]]:
610+
"""Table: baseline vs MCP-Full oracle coverage comparison per task."""
611+
# Identify baseline and mcp configs
612+
configs = sorted({r.config_name for r in runs})
613+
# Heuristic: baseline has no "mcp" or "sourcegraph" in name; sg_full has "sourcegraph_full"
614+
baseline_configs = [c for c in configs if "sourcegraph" not in c.lower() and "mcp" not in c.lower()]
615+
mcp_configs = [c for c in configs if "sourcegraph_full" in c.lower() or "mcp_full" in c.lower()]
616+
617+
if not baseline_configs or not mcp_configs:
618+
return None
619+
620+
# Build (benchmark, task_id) -> {config -> metrics} lookup
621+
lookup: dict[tuple[str, str], dict[str, dict]] = {}
622+
for r in runs:
623+
for t in r.tasks:
624+
key = (r.benchmark, r.config_name, t.task_id)
625+
m = retrieval_data.get(key)
626+
if m is None:
627+
continue
628+
task_key = (r.benchmark, t.task_id)
629+
lookup.setdefault(task_key, {})[r.config_name] = m
630+
631+
rows = []
632+
for (bench, task_id) in sorted(lookup.keys()):
633+
cmap = lookup[(bench, task_id)]
634+
for bl_config in baseline_configs:
635+
for mcp_config in mcp_configs:
636+
bl = cmap.get(bl_config)
637+
mcp = cmap.get(mcp_config)
638+
if bl is None and mcp is None:
639+
continue
640+
bl_cov = bl.get("oracle_coverage") if bl else None
641+
mcp_cov = mcp.get("oracle_coverage") if mcp else None
642+
delta = (mcp_cov - bl_cov) if (bl_cov is not None and mcp_cov is not None) else None
643+
bl_orgs = str(bl.get("unique_orgs_touched", 0)) if bl else "-"
644+
mcp_orgs = str(mcp.get("unique_orgs_touched", 0)) if mcp else "-"
645+
rows.append([
646+
bench,
647+
task_id,
648+
_fmt(bl_cov),
649+
_fmt(mcp_cov),
650+
_fmt(delta) if delta is not None else "-",
651+
bl_orgs,
652+
mcp_orgs,
653+
])
654+
655+
if not rows:
656+
return None
657+
658+
headers = [
659+
"Suite", "Task",
660+
"Baseline Coverage", "MCP-Full Coverage", "Delta",
661+
"Baseline Orgs", "MCP Orgs",
662+
]
663+
return headers, rows
664+
665+
666+
def _build_retrieval_tool_breakdown(
667+
runs: list[RunMetrics],
668+
retrieval_data: _RetrievalData,
669+
) -> Optional[tuple[list[str], list[list[str]]]]:
670+
"""Table: which MCP tools drive oracle discovery, aggregated per suite."""
671+
# Aggregate mcp_tool_counts across all tasks with retrieval data
672+
# Key: (benchmark, config_name, tool_name) -> total_calls
673+
tool_agg: dict[tuple[str, str, str], int] = {}
674+
found_any = False
675+
676+
for r in runs:
677+
for t in r.tasks:
678+
key = (r.benchmark, r.config_name, t.task_id)
679+
m = retrieval_data.get(key)
680+
if m is None:
681+
continue
682+
mcp_counts = m.get("mcp_tool_counts") or {}
683+
for tool, count in mcp_counts.items():
684+
found_any = True
685+
agg_key = (r.benchmark, r.config_name, tool)
686+
tool_agg[agg_key] = tool_agg.get(agg_key, 0) + count
687+
688+
if not found_any:
689+
return None
690+
691+
# Sort by (benchmark, config, count desc)
692+
sorted_items = sorted(
693+
tool_agg.items(),
694+
key=lambda x: (x[0][0], x[0][1], -x[1]),
695+
)
696+
697+
headers = ["Suite", "Config", "MCP Tool", "Total Calls"]
698+
rows = [
699+
[bench, config, tool, str(count)]
700+
for (bench, config, tool), count in sorted_items
701+
]
702+
return headers, rows
703+
704+
518705
# ---------------------------------------------------------------------------
519706
# Report generation
520707
# ---------------------------------------------------------------------------
@@ -585,6 +772,14 @@ def generate_report(
585772
hc_path.write_text(json.dumps(harness_configs, indent=2) + "\n")
586773
print(f"Written: {hc_path}")
587774

775+
# Collect MCP retrieval data (backwards-compatible: empty dict if no files found)
776+
print(f"Collecting retrieval metrics from: {runs_dir}")
777+
retrieval_data = collect_retrieval_data(runs_dir)
778+
if retrieval_data:
779+
print(f"Found retrieval_metrics.json for {len(retrieval_data)} task(s).")
780+
else:
781+
print("No retrieval_metrics.json found — MCP Retrieval Performance section will be omitted.")
782+
588783
# Build all tables
589784
tables: list[tuple[str, str, list[str], list[list[str]]]] = []
590785

@@ -649,6 +844,28 @@ def generate_report(
649844
h, r = mcp_corr
650845
tables.append(("Performance by MCP Benefit Score", "mcp_benefit_correlation", h, r))
651846

847+
# MCP Retrieval Performance section (only when retrieval_metrics.json data exists)
848+
if _has_retrieval_data(retrieval_data):
849+
ret_per_task = _build_retrieval_per_task(runs, retrieval_data)
850+
if ret_per_task:
851+
h, r = ret_per_task
852+
tables.append(("MCP Retrieval Performance — Per Task", "retrieval_per_task", h, r))
853+
854+
ret_per_suite = _build_retrieval_per_suite(runs, retrieval_data)
855+
if ret_per_suite:
856+
h, r = ret_per_suite
857+
tables.append(("MCP Retrieval Performance — Per Suite", "retrieval_per_suite", h, r))
858+
859+
ret_cmp = _build_retrieval_comparison(runs, retrieval_data)
860+
if ret_cmp:
861+
h, r = ret_cmp
862+
tables.append(("MCP Retrieval Performance — Baseline vs MCP-Full", "retrieval_comparison", h, r))
863+
864+
ret_tools = _build_retrieval_tool_breakdown(runs, retrieval_data)
865+
if ret_tools:
866+
h, r = ret_tools
867+
tables.append(("MCP Retrieval Performance — Tool Discovery Breakdown", "retrieval_tool_breakdown", h, r))
868+
652869
# Write REPORT.md
653870
md_lines = [
654871
"# CodeContextBench Evaluation Report",

0 commit comments

Comments
 (0)