Skip to content

Commit 81b8f7d

Browse files
sjarmakclaude
andcommitted
feat: add 8 dual-mode MCP-unique tasks (113-120) with oracle curation
Generate, customize, and register 8 new dual-mode tasks spanning 6 suites (incident, migration, compliance, crossrepo_tracing, platform, domain) across 5 repo sets (grafana, kafka, django, kubernetes, flink, strata). Oracle curation: - Populate task_spec.json oracle (required_files, required_symbols, dependency_chains) for all 8 tasks using Sourcegraph-verified data - Add 4 evaluation checks per task: file_set_match, symbol_resolution, dependency_chain, keyword_presence - Create ground_truth.json for tasks 115, 118 (needed for IR metrics) - Align K8s oracle repo names with MCP mirror names (116, 117) Bug fixes: - Fix orphaned `fi` in 7 direct_verifier.sh files (bash syntax error from parent task copy stripping) - Fix copy_parent_verifier() in customize_mcp_skeletons.py: replace broken fi-stripping heuristic with block-tracking state machine - Fix .go → .java/.py file extension in instruction_mcp.md output format examples for non-Go tasks (114, 115, 118, 119, 120) Also includes: 3 new repo set fixtures, IR analysis improvements, ground truth registry update, retrieval eval smoke test script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ae1e970 commit 81b8f7d

File tree

125 files changed

+15590
-134
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

125 files changed

+15590
-134
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
FROM ubuntu:22.04
2+
3+
ENV DEBIAN_FRONTEND=noninteractive
4+
5+
# Base tools
6+
RUN apt-get update && apt-get install -y --no-install-recommends \
7+
git \
8+
ca-certificates \
9+
curl \
10+
python3 \
11+
python3 python3-pip \
12+
&& rm -rf /var/lib/apt/lists/*
13+
14+
WORKDIR /workspace
15+
16+
# Clone local checkout repos (baseline config: agent has local access to these)
17+
RUN git clone --depth 1 --branch 674eda1c https://github.com/sg-evals/django--674eda1c /workspace/django--674eda1c
18+
19+
# Initialize git identity for agent commits
20+
RUN git config --global user.email "agent@example.com" && \
21+
git config --global user.name "Agent" && \
22+
git config --global safe.directory '*'
23+
24+
# Create log directories
25+
RUN mkdir -p /logs/agent /logs/verifier
26+
27+
ENTRYPOINT []
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# ccx-compliance-115 — artifact_only variant
2+
# No local repo clone — agent uses Sourcegraph MCP exclusively for code access.
3+
# Agent produces answer.json artifact; verifier scores the artifact.
4+
5+
FROM ubuntu:22.04
6+
7+
ENV DEBIAN_FRONTEND=noninteractive
8+
ENV SOURCEGRAPH_REPOS="sg-evals/django--674eda1c"
9+
10+
RUN apt-get update && apt-get install -y --no-install-recommends \
11+
git \
12+
ca-certificates \
13+
python3 \
14+
curl \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
WORKDIR /workspace
18+
19+
# Empty workspace — agent discovers code via MCP tools only
20+
RUN git init && \
21+
git config user.email "agent@example.com" && \
22+
git config user.name "Agent" && \
23+
git config --global safe.directory '*'
24+
25+
# Create log directories
26+
RUN mkdir -p /logs/agent /logs/verifier
27+
28+
# Mark artifact-only mode — verifiers and eval scripts check this flag
29+
RUN touch /tmp/.artifact_only_mode
30+
31+
ENTRYPOINT []
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# CCX-compliance-115 — sg_only variant
2+
# No local repo clone — agent uses Sourcegraph MCP exclusively for code access.
3+
# The verifier clones mirror repos at verification time (no /repo_full/ backup).
4+
5+
FROM ubuntu:22.04
6+
7+
ENV DEBIAN_FRONTEND=noninteractive
8+
ENV SOURCEGRAPH_REPOS="sg-evals/django--674eda1c"
9+
10+
RUN apt-get update && apt-get install -y --no-install-recommends \
11+
git \
12+
ca-certificates \
13+
python3 \
14+
curl \
15+
&& rm -rf /var/lib/apt/lists/*
16+
17+
WORKDIR /workspace
18+
19+
# Empty workspace — agent discovers code via MCP tools only
20+
RUN git init && \
21+
git config user.email "agent@example.com" && \
22+
git config user.name "Agent" && \
23+
git config --global safe.directory '*'
24+
25+
# Create log directories
26+
RUN mkdir -p /logs/agent /logs/verifier
27+
28+
# Mark sg_only mode — verifiers and eval scripts check this flag
29+
RUN touch /tmp/.sg_only_mode
30+
31+
RUN echo '{"workdir":"/workspace","repos":[{"mirror":"sg-evals/django--674eda1c","target_dir":"django"}]}' > /tmp/.sg_only_clone_manifest.json
32+
33+
ENTRYPOINT []
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# Compliance Audit: Django Session Key Rotation Concurrency Safety
2+
3+
## Your Task
4+
5+
Audit Django's session framework for concurrency safety in the session key rotation path. Find: 1. The Python source file in `django/contrib/sessions/` that implements `cycle_key()` — the method called during login to rotate session keys. 2. The session backend base class file that defines `create()` and `_get_new_session_key()` — the methods responsible for generating and persisting new session keys. 3. The database backend file that implements the actual `create()` with a database INSERT. 4. Identify whether `create()` handles key collisions (duplicate session keys) or silently overwrites. Report the repo, file path, class name, and method name for each, plus a brief note on whether collision handling exists.
6+
7+
## Context
8+
9+
You are working on a codebase task involving repos from the compliance domain.
10+
11+
## Available Resources
12+
13+
The local `/workspace/` directory contains: sg-evals/django--674eda1c.
14+
15+
**Note:** Additional repositories are accessible via Sourcegraph MCP tools:
16+
- `sg-evals/django--674eda1c` (django/django)
17+
18+
## Output Format
19+
20+
Create a file at `/workspace/answer.json` with your findings in the following structure:
21+
22+
```json
23+
{
24+
"files": [
25+
{"repo": "org/repo-name", "path": "relative/path/to/file.py"}
26+
],
27+
"symbols": [
28+
{"repo": "org/repo-name", "path": "relative/path/to/file.py", "symbol": "SymbolName"}
29+
],
30+
"chain": [
31+
{"repo": "org/repo-name", "path": "relative/path/to/file.py", "symbol": "FunctionName"}
32+
],
33+
"text": "Narrative explanation of your findings, citing repos and file paths."
34+
}
35+
```
36+
37+
Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
38+
39+
## Evaluation
40+
41+
Your answer will be scored on:
42+
- **File recall and precision**: Did you find all relevant files?
Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# IMPORTANT: Source Code Access
2+
3+
**Local source files are not present.** Your workspace does not contain source code. You **MUST** use Sourcegraph MCP tools to discover, read, and understand code before making any changes.
4+
5+
**Target Repositories (version-pinned mirrors):**
6+
7+
- `github.com/sg-evals/django--674eda1c` — use `repo:^github.com/sg-evals/django--674eda1c$` filter
8+
9+
Scope ALL keyword_search/nls_search queries to these repos.
10+
Use the repo name as the `repo` parameter for read_file/go_to_definition/find_references.
11+
12+
13+
## Required Workflow
14+
15+
1. **Search first** — Use MCP tools to find relevant files and understand existing patterns
16+
2. **Read remotely** — Use `sg_read_file` to read full file contents from Sourcegraph
17+
3. **Edit locally** — Use Edit, Write, and Bash to create or modify files in your working directory
18+
4. **Verify locally** — Run tests with Bash to check your changes
19+
20+
## Tool Selection
21+
22+
| Goal | Tool |
23+
|------|------|
24+
| Exact symbol/string | `sg_keyword_search` |
25+
| Concepts/semantic search | `sg_nls_search` |
26+
| Trace usage/callers | `sg_find_references` |
27+
| See implementation | `sg_go_to_definition` |
28+
| Read full file | `sg_read_file` |
29+
| Browse structure | `sg_list_files` |
30+
| Find repos | `sg_list_repos` |
31+
| Search commits | `sg_commit_search` |
32+
| Track changes | `sg_diff_search` |
33+
| Compare versions | `sg_compare_revisions` |
34+
35+
**Decision logic:**
36+
1. Know the exact symbol? -> `sg_keyword_search`
37+
2. Know the concept, not the name? -> `sg_nls_search`
38+
3. Need definition of a symbol? -> `sg_go_to_definition`
39+
4. Need all callers/references? -> `sg_find_references`
40+
5. Need full file content? -> `sg_read_file`
41+
42+
## Scoping (Always Do This)
43+
44+
```
45+
repo:^github.com/ORG/REPO$ # Exact repo (preferred)
46+
repo:github.com/ORG/ # All repos in org
47+
file:.*\.ts$ # TypeScript only
48+
file:src/api/ # Specific directory
49+
```
50+
51+
Start narrow. Expand only if results are empty.
52+
53+
## Efficiency Rules
54+
55+
- Chain searches logically: search -> read -> references -> definition
56+
- Don't re-search for the same pattern; use results from prior calls
57+
- Prefer `sg_keyword_search` over `sg_nls_search` when you have exact terms
58+
- Read 2-3 related files before synthesising, rather than one at a time
59+
- Don't read 20+ remote files without writing code — once you understand the pattern, start implementing
60+
61+
## If Stuck
62+
63+
If MCP search returns no results:
64+
1. Broaden the search query (synonyms, partial identifiers)
65+
2. Try `sg_nls_search` for semantic matching
66+
3. Use `sg_list_files` to browse the directory structure
67+
4. Use `sg_list_repos` to verify the repository name
68+
69+
---
70+
71+
**Sourcegraph Repositories:** `github.com/sg-evals/django--674eda1c`
72+
73+
# Compliance Audit: Django Session Key Rotation Concurrency Safety
74+
75+
## Your Task
76+
77+
Audit Django's session framework for concurrency safety in the session key rotation path. Find: 1. The Python source file in `django/contrib/sessions/` that implements `cycle_key()` — the method called during login to rotate session keys. 2. The session backend base class file that defines `create()` and `_get_new_session_key()` — the methods responsible for generating and persisting new session keys. 3. The database backend file that implements the actual `create()` with a database INSERT. 4. Identify whether `create()` handles key collisions (duplicate session keys) or silently overwrites. Report the repo, file path, class name, and method name for each, plus a brief note on whether collision handling exists.
78+
79+
## Context
80+
81+
You are working on a codebase task involving repos from the compliance domain.
82+
83+
## Available Resources
84+
85+
The local `/workspace/` directory contains: sg-evals/django--674eda1c.
86+
87+
**Note:** Additional repositories are accessible via Sourcegraph MCP tools:
88+
- `sg-evals/django--674eda1c` (django/django)
89+
90+
## Output Format
91+
92+
Create a file at `/workspace/answer.json` with your findings in the following structure:
93+
94+
```json
95+
{
96+
"files": [
97+
{"repo": "org/repo-name", "path": "relative/path/to/file.py"}
98+
],
99+
"symbols": [
100+
{"repo": "org/repo-name", "path": "relative/path/to/file.py", "symbol": "SymbolName"}
101+
],
102+
"chain": [
103+
{"repo": "org/repo-name", "path": "relative/path/to/file.py", "symbol": "FunctionName"}
104+
],
105+
"text": "Narrative explanation of your findings, citing repos and file paths."
106+
}
107+
```
108+
109+
Include only the fields relevant to this task. Your answer is evaluated against a closed-world oracle — completeness matters.
110+
111+
## Evaluation
112+
113+
Your answer will be scored on:
114+
- **File recall and precision**: Did you find all relevant files?
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
version = "1.0"
2+
3+
[metadata]
4+
name = "CCX-compliance-115"
5+
description = "Compliance Audit: Django Session Key Rotation Concurrency Safety"
6+
license = "Apache-2.0"
7+
8+
[task]
9+
id = "CCX-compliance-115"
10+
repo = "sg-evals/django--674eda1c"
11+
category = "compliance-audit"
12+
language = "python"
13+
difficulty = "hard"
14+
time_limit_sec = 900
15+
mcp_suite = "ccb_mcp_compliance"
16+
use_case_id = 115
17+
repo_set_id = "django-web-framework"
18+
mcp_unique = true
19+
verification_modes = ["artifact", "direct"]
20+
21+
[verification]
22+
type = "test"
23+
command = "bash /tests/test.sh"
24+
25+
reward_type = "score"
26+
description = "Compliance Audit: Django Session Key Rotation Concurrency Safety"
27+
28+
[environment]
29+
build_timeout_sec = 600.0

0 commit comments

Comments
 (0)