Skip to content

Commit eff6719

Browse files
committed
fix: 修复 P1 高优先级问题
核心修复: - interactor: 修复 asyncio.wait() 返回 set 被错误索引的 bug - interactor: 参考解不存在时报错而非静默跳过,pass_rate 无测试时为 0.0 - stress_test: 默认使用完整 generator 协议 (6 参数) 而非只传 seed - checker: 区分 FAIL (checker 自身失败) 和 WA (选手答案错误) 文档修复: - README: 更正工具数量为 15 个,移除不存在的 autocode_ 前缀描述 - validator/checker/interactor: 更正 tool description 中的文件保存路径 测试改进: - 修复 test_packaging.py 中使用不存在的 prompt 名称 - 新增测试覆盖 interactor 参考解验证、pass_rate 默认值、checker FAIL 判定 - 修复 asyncio deprecation warning 142 个测试全部通过
1 parent 05108f2 commit eff6719

File tree

10 files changed

+169
-43
lines changed

10 files changed

+169
-43
lines changed

.github/workflows/ci.yml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,11 @@ jobs:
3434
- run: uv run mypy src/
3535

3636
test-unit:
37-
runs-on: ${{ matrix.os }}
37+
runs-on: ubuntu-latest
3838
strategy:
3939
fail-fast: false
4040
matrix:
41-
os: [ubuntu-latest, windows-latest, macos-latest]
42-
python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
41+
python-version: ["3.10", "3.14"]
4342
steps:
4443
- uses: actions/checkout@v4
4544
- uses: actions/setup-python@v5
@@ -51,7 +50,6 @@ jobs:
5150
- run: uv sync --all-extras
5251
- run: uv run pytest tests/ -v -m "not integration" --cov --cov-report=xml
5352
- uses: codecov/codecov-action@v4
54-
if: matrix.os == 'ubuntu-latest'
5553
with:
5654
files: coverage.xml
5755
flags: unit-tests

README.md

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,14 +7,14 @@
77

88
**An MCP Server for competitive programming problem creation, implementing the Validator-Generator-Checker framework from the AutoCode paper.**
99

10-
AutoCode MCP Server provides 14 atomic tools that enable AI assistants to create, validate, and test competitive programming problems. It handles compilation, execution, stress testing, and test data generation—letting the AI focus on problem design and solution logic.
10+
AutoCode MCP Server provides 15 atomic tools that enable AI assistants to create, validate, and test competitive programming problems. It handles compilation, execution, stress testing, and test data generation—letting the AI focus on problem design and solution logic.
1111

1212
[中文文档](README_CN.md)
1313

1414
## Features
1515

1616
- **Validator-Generator-Checker Framework** — Automated validation of input correctness, multi-strategy test generation, and output verification based on the AutoCode paper
17-
- **14 Atomic Tools** — File operations, solution building, stress testing, validator/generator/checker construction, and more
17+
- **15 Atomic Tools** — File operations, solution building, stress testing, validator/generator/checker construction, and more
1818
- **testlib.h Support** — Full integration with the competitive programming standard library for validators, generators, and checkers
1919
- **Multi-Strategy Generation** — Four generation strategies: tiny (exhaustive), random, extreme (edge cases), and TLE-inducing
2020
- **Stress Testing** — Automated comparison between optimal and brute-force solutions with configurable trial counts
@@ -189,11 +189,11 @@ For development or custom installations:
189189

190190
### Verify Installation
191191

192-
After configuration, restart your MCP client and check that tools are available. You should see 14 tools prefixed with `autocode_`.
192+
After configuration, restart your MCP client and check that tools are available. You should see 15 tools available.
193193

194194
## Tools Reference
195195

196-
AutoCode provides 14 atomic tools organized into 7 groups. All tools return a unified format:
196+
AutoCode provides 15 atomic tools organized into 7 groups. All tools return a unified format:
197197

198198
```json
199199
{
@@ -241,7 +241,7 @@ AutoCode provides 14 atomic tools organized into 7 groups. All tools return a un
241241

242242
| Tool | Description | Key Parameters |
243243
|------|-------------|----------------|
244-
| `interactor_build` | Build interactor for interactive problems | `problem_dir`, `code`, `test_scenarios` |
244+
| `interactor_build` | Build interactor for interactive problems | `problem_dir`, `code`, `reference_solution_path`, `mutant_solutions` |
245245

246246
### Stress Testing
247247

@@ -253,9 +253,9 @@ AutoCode provides 14 atomic tools organized into 7 groups. All tools return a un
253253

254254
| Tool | Description | Key Parameters |
255255
|------|-------------|----------------|
256-
| `problem_create` | Initialize problem directory | `problem_dir`, `title`, `time_limit`, `memory_limit` |
256+
| `problem_create` | Initialize problem directory | `problem_dir`, `problem_name` |
257257
| `problem_generate_tests` | Generate final test data | `problem_dir`, `test_count` |
258-
| `problem_pack_polygon` | Package for Polygon platform | `problem_dir`, `output_dir` |
258+
| `problem_pack_polygon` | Package for Polygon platform | `problem_dir`, `time_limit`, `memory_limit` |
259259

260260
## Workflow Tutorial: A+B Problem
261261

@@ -266,9 +266,7 @@ This tutorial walks through creating a simple A+B problem using AutoCode tools.
266266
```python
267267
problem_create(
268268
problem_dir="problems/ab",
269-
title="A + B",
270-
time_limit=1000,
271-
memory_limit=256
269+
problem_name="A + B"
272270
)
273271
```
274272

@@ -393,7 +391,8 @@ problem_generate_tests(
393391
```python
394392
problem_pack_polygon(
395393
problem_dir="problems/ab",
396-
output_dir="polygon/ab"
394+
time_limit=1,
395+
memory_limit=256
397396
)
398397
```
399398

src/autocode_mcp/tools/checker.py

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ def description(self) -> str:
2626
return """构建并验证输出检查器。
2727
2828
基于论文 Algorithm 3 实现:
29-
1. 保存代码到 problem_dir/checker.cpp
30-
2. 编译生成 checker.exe
29+
1. 保存代码到 problem_dir/files/checker.cpp
30+
2. 编译生成 files/checker.exe
3131
3. 运行测试场景验证准确性
3232
4. 返回准确率和详细结果
3333
@@ -59,7 +59,7 @@ def input_schema(self) -> dict:
5959
"reference_output": {"type": "string"},
6060
"expected_verdict": {
6161
"type": "string",
62-
"enum": ["AC", "WA", "PE"],
62+
"enum": ["AC", "WA", "PE", "FAIL"],
6363
},
6464
},
6565
"required": ["input", "contestant_output", "reference_output"],
@@ -155,7 +155,13 @@ async def execute(
155155
# Checker 返回码约定 (testlib.h):
156156
# 0 = AC, 1 = WA, 2 = PE, 3+ = Fail (checker error)
157157
verdict_map = {0: "AC", 1: "WA", 2: "PE"}
158-
actual_verdict = verdict_map.get(run_result.return_code, "WA")
158+
if run_result.return_code in verdict_map:
159+
actual_verdict = verdict_map[run_result.return_code]
160+
elif run_result.return_code >= 3:
161+
# Checker 自身失败,返回 FAIL 而非 WA
162+
actual_verdict = "FAIL"
163+
else:
164+
actual_verdict = "WA"
159165

160166
# 检查是否超时
161167
if run_result.timed_out:

src/autocode_mcp/tools/interactor.py

Lines changed: 30 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ def description(self) -> str:
2626
return """构建并验证交互器。
2727
2828
基于论文 Algorithm 4 实现:
29-
1. 保存代码到 problem_dir/interactor.cpp
30-
2. 编译生成 interactor.exe
29+
1. 保存代码到 problem_dir/files/interactor.cpp
30+
2. 编译生成 files/interactor.exe
3131
3. 运行变异测试验证区分能力
3232
4. 返回 pass_rate 和 fail_rate
3333
@@ -101,20 +101,33 @@ async def execute(
101101
compile_log=compile_result.stderr,
102102
)
103103

104-
# 如果没有提供参考解和变异解,直接返回成功
104+
# 如果没有提供参考解和变异解,直接返回成功(但 pass_rate 为 0)
105105
if not reference_solution_path and not mutant_solutions:
106106
return ToolResult.ok(
107107
source_path=source_path,
108108
binary_path=binary_path,
109109
compile_log=compile_result.stderr,
110+
pass_rate=0.0,
111+
fail_rate=0.0,
112+
pass_count=0,
113+
pass_total=0,
114+
fail_count=0,
115+
fail_total=0,
110116
message="Interactor built successfully (no validation performed)",
111117
)
112118

113119
# 验证正确解通过率
114120
pass_count = 0
115121
pass_total = 0
116122

117-
if reference_solution_path and os.path.exists(reference_solution_path):
123+
if reference_solution_path:
124+
# 检查参考解是否存在,不存在则报错而非静默跳过
125+
if not os.path.exists(reference_solution_path):
126+
return ToolResult.fail(
127+
f"Reference solution not found: {reference_solution_path}",
128+
source_path=source_path,
129+
binary_path=binary_path,
130+
)
118131
pass_total = 1
119132
# 运行交互测试:参考解应该被接受
120133
test_result = await self._run_interactor_test(
@@ -138,7 +151,8 @@ async def execute(
138151
if test_result.get("verdict") != "AC":
139152
fail_count += 1
140153

141-
pass_rate = pass_count / pass_total if pass_total > 0 else 1.0
154+
# 计算通过率 - 没有测试时为 0,不是 1.0
155+
pass_rate = pass_count / pass_total if pass_total > 0 else 0.0
142156
fail_rate = fail_count / fail_total if fail_total > 0 else 0.0
143157

144158
return ToolResult.ok(
@@ -209,8 +223,15 @@ async def _run_interactor_test(
209223
return_when=asyncio.FIRST_COMPLETED,
210224
)
211225

212-
# 检查是否超时
213-
timed_out = any(task.get_name() == "sleep" for task in done)
226+
# 检查是否超时 - asyncio.wait 返回 set,需要遍历
227+
timed_out = False
228+
comm_task = None
229+
for task in done:
230+
if task.get_name() == "sleep":
231+
timed_out = True
232+
else:
233+
comm_task = task
234+
214235
if timed_out:
215236
interactor.kill()
216237
solution.kill()
@@ -219,7 +240,6 @@ async def _run_interactor_test(
219240
return {"verdict": "TLE", "reason": "Timeout"}
220241

221242
# 获取通信结果
222-
comm_task = done[0] if done else None
223243
if comm_task and not comm_task.cancelled():
224244
result = comm_task.result()
225245
return result
@@ -282,9 +302,9 @@ async def pipe_data(reader, writer, name: str):
282302
),
283303
]
284304

285-
# 等待任一进程完成
305+
# 等待任一进程完成 - 使用 create_task 避免弃用警告
286306
await asyncio.wait(
287-
[interactor.wait(), solution.wait()],
307+
[asyncio.create_task(interactor.wait()), asyncio.create_task(solution.wait())],
288308
return_when=asyncio.FIRST_COMPLETED,
289309
)
290310
except NotImplementedError:

src/autocode_mcp/tools/stress_test.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ async def execute(
125125
for i in range(1, trials + 1):
126126
# 1. 生成输入数据
127127
gen_result = await self._generate_input(
128-
gen_exe, input_path, i, seed=i, timeout=timeout, generator_args=generator_args
128+
gen_exe, input_path, i, seed=i, timeout=timeout, n_max=n_max, generator_args=generator_args
129129
)
130130
if not gen_result["success"]:
131131
return ToolResult.fail(
@@ -201,6 +201,7 @@ async def _generate_input(
201201
round_num: int,
202202
seed: int,
203203
timeout: int,
204+
n_max: int = 100,
204205
generator_args: dict | None = None,
205206
) -> dict:
206207
"""
@@ -212,6 +213,7 @@ async def _generate_input(
212213
round_num: 当前轮次
213214
seed: 随机种子
214215
timeout: 超时时间(秒)
216+
n_max: N 最大值(用于默认协议)
215217
generator_args: Generator 完整参数(可选)
216218
217219
Returns:
@@ -225,13 +227,20 @@ async def _generate_input(
225227
str(seed),
226228
generator_args.get("type", "2"),
227229
str(generator_args.get("n_min", 1)),
228-
str(generator_args.get("n_max", 100)),
230+
str(generator_args.get("n_max", n_max)),
229231
str(generator_args.get("t_min", 1)),
230232
str(generator_args.get("t_max", 1)),
231233
]
232234
else:
233-
# 简单协议: gen.exe <seed>
234-
cmd_args = [str(seed)]
235+
# 默认使用完整协议,与 generator_run 和 problem_generate_tests 保持一致
236+
cmd_args = [
237+
str(seed),
238+
"2", # type=random
239+
"1", # n_min
240+
str(n_max), # n_max 使用参数
241+
"1", # t_min
242+
"1", # t_max
243+
]
235244

236245
gen_result = await run_binary_with_args(
237246
gen_exe,

src/autocode_mcp/tools/validator.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ def description(self) -> str:
2626
return """构建并验证数据校验器。
2727
2828
基于论文 Algorithm 1 实现:
29-
1. 保存代码到 problem_dir/val.cpp
30-
2. 编译生成 val.exe
29+
1. 保存代码到 problem_dir/files/val.cpp
30+
2. 编译生成 files/val.exe
3131
3. 运行测试用例验证健壮性
3232
4. 返回得分和详细结果
3333

tests/test_packaging.py

Lines changed: 98 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ async def test_mcp_get_prompt_result_type():
7373
from autocode_mcp.server import get_prompt
7474

7575
# 测试存在的 prompt
76-
result = await get_prompt("validator_workflow")
76+
result = await get_prompt("validator")
7777
assert isinstance(result, GetPromptResult)
7878
assert len(result.messages) > 0
7979

@@ -117,3 +117,100 @@ def test_all_template_files_exist():
117117
for template in expected_templates:
118118
path = os.path.join(TEMPLATES_DIR, template)
119119
assert os.path.exists(path), f"Template not found: {template}"
120+
121+
122+
@pytest.mark.asyncio
123+
async def test_interactor_reference_solution_not_found():
124+
"""测试 interactor_build 在参考解不存在时报错而非静默跳过。"""
125+
from autocode_mcp.server import call_tool, register_all_tools
126+
127+
register_all_tools()
128+
129+
import tempfile
130+
131+
with tempfile.TemporaryDirectory() as tmpdir:
132+
# 测试不存在的参考解路径
133+
result = await call_tool("interactor_build", {
134+
"problem_dir": tmpdir,
135+
"code": '#include "testlib.h"\nint main() { return 0; }',
136+
"reference_solution_path": os.path.join(tmpdir, "nonexistent.exe"),
137+
})
138+
139+
assert result.isError is True
140+
assert "Reference solution not found" in result.structuredContent.get("error", "")
141+
142+
143+
@pytest.mark.asyncio
144+
async def test_interactor_pass_rate_without_tests():
145+
"""测试 interactor_build 没有测试时 pass_rate 为 0 而非 1.0。"""
146+
from autocode_mcp.server import call_tool, register_all_tools
147+
148+
register_all_tools()
149+
150+
import tempfile
151+
152+
with tempfile.TemporaryDirectory() as tmpdir:
153+
# 不提供参考解和变异解
154+
result = await call_tool("interactor_build", {
155+
"problem_dir": tmpdir,
156+
"code": '#include "testlib.h"\nint main() { return 0; }',
157+
})
158+
159+
assert result.isError is False
160+
# pass_rate 在 data 字段中
161+
data = result.structuredContent.get("data", {})
162+
assert data.get("pass_rate", 1.0) == 0.0
163+
164+
165+
@pytest.mark.asyncio
166+
async def test_checker_fail_verdict():
167+
"""测试 checker_build 能区分 FAIL 和 WA。"""
168+
from autocode_mcp.server import call_tool, register_all_tools
169+
170+
register_all_tools()
171+
172+
import tempfile
173+
174+
with tempfile.TemporaryDirectory() as tmpdir:
175+
# 创建一个会返回非标准退出码的 checker
176+
# testlib.h 的 quitf(_fail, ...) 会返回退出码 3+
177+
checker_code = '''
178+
#include "testlib.h"
179+
int main(int argc, char* argv[]) {
180+
registerTestlibCmd(argc, argv);
181+
// 强制返回 FAIL
182+
quitf(_fail, "Checker internal error");
183+
return 3;
184+
}
185+
'''
186+
result = await call_tool("checker_build", {
187+
"problem_dir": tmpdir,
188+
"code": checker_code,
189+
"test_scenarios": [
190+
{
191+
"input": "1",
192+
"contestant_output": "1",
193+
"reference_output": "1",
194+
"expected_verdict": "FAIL",
195+
},
196+
],
197+
})
198+
199+
assert result.isError is False
200+
test_results = result.structuredContent.get("test_results", [])
201+
if test_results:
202+
# 应该识别为 FAIL 而非 WA
203+
assert test_results[0].get("actual_verdict") == "FAIL"
204+
205+
206+
def test_all_prompts_exist():
207+
"""测试所有声明的 prompt 都存在。"""
208+
from autocode_mcp.prompts import get_prompt, list_prompts
209+
210+
prompts = list_prompts()
211+
assert len(prompts) == 6
212+
213+
for name in prompts:
214+
content = get_prompt(name)
215+
assert content, f"Prompt '{name}' is empty"
216+
assert len(content) > 100, f"Prompt '{name}' seems too short"

0 commit comments

Comments
 (0)