Skip to content

Commit c7d78c5

Browse files
sjarmakclaude
andcommitted
Fix GT coverage pipeline: 42% → 99.1% benchmark event coverage
- Move benchmarks/ccb_contextbench/ → calibration/curator_calibration/ to stop calibration tasks from polluting benchmark GT scans - Fix audit_gt_coverage.py: don't short-circuit on invalid-schema ground_truth.json when valid oracle_answer.json exists; recognize expected_files format from scaling-gap tasks - Fix update_gt_registry.py: handle expected_files format; replace stale entries instead of accumulating them (248 → 402 entries) - Fix normalize_retrieval_events.py: strip bl_/sgonly_/artifact_ prefixes and Harbor suffixes in _normalize_task_name(); add case-insensitive GT registry fallback for uppercase CCX- task names - Update script references for contextbench path change - Clean up 958 stale prefixed event files from retrieval_events/ Benchmark task GT: 100% (404/404 tasks) Benchmark event GT: 99.1% (4516/4559 events) Registry: 402 entries covering all benchmark tasks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 773096d commit c7d78c5

File tree

507 files changed

+13082
-4277
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

507 files changed

+13082
-4277
lines changed

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/environment/Dockerfile renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/environment/Dockerfile

File renamed without changes.

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/environment/Dockerfile.sg_only renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/environment/Dockerfile.sg_only

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44

55
FROM ubuntu:22.04
66

7-
ENV SOURCEGRAPH_REPO_NAME=sg-evals/ponyc--3ed6c09a
8-
97
ENV DEBIAN_FRONTEND=noninteractive
8+
ENV SOURCEGRAPH_REPOS="sg-evals/ponyc--3ed6c09a"
9+
1010
RUN apt-get update && apt-get install -y --no-install-recommends \
1111
git \
1212
ca-certificates \

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/instruction.md renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/instruction.md

File renamed without changes.
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# IMPORTANT: Source Code Access
2+
3+
**Local source files are not present.** Your workspace does not contain source code. You **MUST** use Sourcegraph MCP tools to discover, read, and understand code before making any changes.
4+
5+
**Target Repositories (version-pinned mirrors):**
6+
7+
- `github.com/sg-evals/ponyc--3ed6c09a` — use `repo:^github.com/sg-evals/ponyc--3ed6c09a$` filter
8+
9+
Scope ALL keyword_search/nls_search queries to these repos.
10+
Use the repo name as the `repo` parameter for read_file/go_to_definition/find_references.
11+
12+
13+
## Required Workflow
14+
15+
1. **Search first** — Use MCP tools to find relevant files and understand existing patterns
16+
2. **Read remotely** — Use `sg_read_file` to read full file contents from Sourcegraph
17+
3. **Edit locally** — Use Edit, Write, and Bash to create or modify files in your working directory
18+
4. **Verify locally** — Run tests with Bash to check your changes
19+
20+
## Tool Selection
21+
22+
| Goal | Tool |
23+
|------|------|
24+
| Exact symbol/string | `sg_keyword_search` |
25+
| Concepts/semantic search | `sg_nls_search` |
26+
| Trace usage/callers | `sg_find_references` |
27+
| See implementation | `sg_go_to_definition` |
28+
| Read full file | `sg_read_file` |
29+
| Browse structure | `sg_list_files` |
30+
| Find repos | `sg_list_repos` |
31+
| Search commits | `sg_commit_search` |
32+
| Track changes | `sg_diff_search` |
33+
| Compare versions | `sg_compare_revisions` |
34+
35+
**Decision logic:**
36+
1. Know the exact symbol? → `sg_keyword_search`
37+
2. Know the concept, not the name? → `sg_nls_search`
38+
3. Need definition of a symbol? → `sg_go_to_definition`
39+
4. Need all callers/references? → `sg_find_references`
40+
5. Need full file content? → `sg_read_file`
41+
42+
## Scoping (Always Do This)
43+
44+
```
45+
repo:^github.com/ORG/REPO$ # Exact repo (preferred)
46+
repo:github.com/ORG/ # All repos in org
47+
file:.*\.ts$ # TypeScript only
48+
file:src/api/ # Specific directory
49+
```
50+
51+
Start narrow. Expand only if results are empty.
52+
53+
## Efficiency Rules
54+
55+
- Chain searches logically: search → read → references → definition
56+
- Don't re-search for the same pattern; use results from prior calls
57+
- Prefer `sg_keyword_search` over `sg_nls_search` when you have exact terms
58+
- Read 2-3 related files before synthesising, rather than one at a time
59+
- Don't read 20+ remote files without writing code — once you understand the pattern, start implementing
60+
61+
## If Stuck
62+
63+
If MCP search returns no results:
64+
1. Broaden the search query (synonyms, partial identifiers)
65+
2. Try `sg_nls_search` for semantic matching
66+
3. Use `sg_list_files` to browse the directory structure
67+
4. Use `sg_list_repos` to verify the repository name
68+
69+
---
70+
71+
# Fix: Multi-SWE-Bench__c__maintenance__bugfix__634fe9b8
72+
73+
**Repository:** github.com/sg-evals/ponyc--3ed6c09a (mirror of ponylang/ponyc)
74+
**Language:** c
75+
**Category:** contextbench_cross_validation
76+
77+
## Description
78+
79+
Fix extracting docstring from constructors
80+
81+
from classes or actors with initialized fields
82+
83+
by executing `sugar_docstring` on the constructors before adding the initializers. This will actually execute `sugar_docstring` twice on such constructors, but as it is idempotent, there is no danger here.
84+
85+
The docstrings have been inlined into the constructor body before.
86+
87+
Now they will show up correctly in the generated docs.
88+
89+
Fixes #2582
90+
91+
## Task
92+
93+
Diagnose and fix the issue described above. The repository has been cloned at the relevant commit. Make the necessary code changes to resolve the bug.
94+
95+
## Success Criteria
96+
97+
Your code changes should resolve the described issue. The implementation will be verified against the expected patch using diff similarity scoring.
98+
99+
**Time Limit:** 30 minutes

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/task.toml renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/task.toml

File renamed without changes.

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/expected.patch renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/expected.patch

File renamed without changes.

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/gold_context.json renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/gold_context.json

File renamed without changes.

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/sgonly_verifier_wrapper.sh renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/sgonly_verifier_wrapper.sh

File renamed without changes.

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/test.sh renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/test.sh

File renamed without changes.

benchmarks/ccb_contextbench/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/verify_diff.py renamed to calibration/curator_calibration/cb-multi-swe-bench__c__maintenance__bugfix__634fe9b8/tests/verify_diff.py

File renamed without changes.

0 commit comments

Comments
 (0)