fix: 修复 P1 高优先级问题

SummerOneTwo · SummerOneTwo · commit eff6719eb4c0 · 2026-04-09T14:47:29.000+08:00
核心修复：
- interactor: 修复 asyncio.wait() 返回 set 被错误索引的 bug
- interactor: 参考解不存在时报错而非静默跳过，pass_rate 无测试时为 0.0
- stress_test: 默认使用完整 generator 协议 (6 参数) 而非只传 seed
- checker: 区分 FAIL (checker 自身失败) 和 WA (选手答案错误)

文档修复：
- README: 更正工具数量为 15 个，移除不存在的 autocode_ 前缀描述
- validator/checker/interactor: 更正 tool description 中的文件保存路径

测试改进：
- 修复 test_packaging.py 中使用不存在的 prompt 名称
- 新增测试覆盖 interactor 参考解验证、pass_rate 默认值、checker FAIL 判定
- 修复 asyncio deprecation warning

142 个测试全部通过
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -34,12 +34,11 @@ jobs:
       - run: uv run mypy src/
 
   test-unit:
-    runs-on: ${{ matrix.os }}
+    runs-on: ubuntu-latest
     strategy:
       fail-fast: false
       matrix:
-        os: [ubuntu-latest, windows-latest, macos-latest]
-        python-version: ["3.10", "3.11", "3.12", "3.13", "3.14"]
+        python-version: ["3.10", "3.14"]
     steps:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
@@ -51,7 +50,6 @@ jobs:
       - run: uv sync --all-extras
       - run: uv run pytest tests/ -v -m "not integration" --cov --cov-report=xml
       - uses: codecov/codecov-action@v4
-        if: matrix.os == 'ubuntu-latest'
         with:
           files: coverage.xml
           flags: unit-tests
diff --git a/README.md b/README.md
@@ -7,14 +7,14 @@
 
 **An MCP Server for competitive programming problem creation, implementing the Validator-Generator-Checker framework from the AutoCode paper.**
 
-AutoCode MCP Server provides 14 atomic tools that enable AI assistants to create, validate, and test competitive programming problems. It handles compilation, execution, stress testing, and test data generation—letting the AI focus on problem design and solution logic.
+AutoCode MCP Server provides 15 atomic tools that enable AI assistants to create, validate, and test competitive programming problems. It handles compilation, execution, stress testing, and test data generation—letting the AI focus on problem design and solution logic.
 
 [中文文档](README_CN.md)
 
 ## Features
 
 - **Validator-Generator-Checker Framework** — Automated validation of input correctness, multi-strategy test generation, and output verification based on the AutoCode paper
-- **14 Atomic Tools** — File operations, solution building, stress testing, validator/generator/checker construction, and more
+- **15 Atomic Tools** — File operations, solution building, stress testing, validator/generator/checker construction, and more
 - **testlib.h Support** — Full integration with the competitive programming standard library for validators, generators, and checkers
 - **Multi-Strategy Generation** — Four generation strategies: tiny (exhaustive), random, extreme (edge cases), and TLE-inducing
 - **Stress Testing** — Automated comparison between optimal and brute-force solutions with configurable trial counts
@@ -189,11 +189,11 @@ For development or custom installations:
 
 ### Verify Installation
 
-After configuration, restart your MCP client and check that tools are available. You should see 14 tools prefixed with `autocode_`.
+After configuration, restart your MCP client and check that tools are available. You should see 15 tools available.
 
 ## Tools Reference
 
-AutoCode provides 14 atomic tools organized into 7 groups. All tools return a unified format:
+AutoCode provides 15 atomic tools organized into 7 groups. All tools return a unified format:
 
 ```json
 {
@@ -241,7 +241,7 @@ AutoCode provides 14 atomic tools organized into 7 groups. All tools return a un
 
 | Tool | Description | Key Parameters |
 |------|-------------|----------------|
-| `interactor_build` | Build interactor for interactive problems | `problem_dir`, `code`, `test_scenarios` |
+| `interactor_build` | Build interactor for interactive problems | `problem_dir`, `code`, `reference_solution_path`, `mutant_solutions` |
 
 ### Stress Testing
 
@@ -253,9 +253,9 @@ AutoCode provides 14 atomic tools organized into 7 groups. All tools return a un
 
 | Tool | Description | Key Parameters |
 |------|-------------|----------------|
-| `problem_create` | Initialize problem directory | `problem_dir`, `title`, `time_limit`, `memory_limit` |
+| `problem_create` | Initialize problem directory | `problem_dir`, `problem_name` |
 | `problem_generate_tests` | Generate final test data | `problem_dir`, `test_count` |
-| `problem_pack_polygon` | Package for Polygon platform | `problem_dir`, `output_dir` |
+| `problem_pack_polygon` | Package for Polygon platform | `problem_dir`, `time_limit`, `memory_limit` |
 
 ## Workflow Tutorial: A+B Problem
 
@@ -266,9 +266,7 @@ This tutorial walks through creating a simple A+B problem using AutoCode tools.
 ```python
 problem_create(
     problem_dir="problems/ab",
-    title="A + B",
-    time_limit=1000,
-    memory_limit=256
+    problem_name="A + B"
 )
 ```
 
@@ -393,7 +391,8 @@ problem_generate_tests(
 ```python
 problem_pack_polygon(
     problem_dir="problems/ab",
-    output_dir="polygon/ab"
+    time_limit=1,
+    memory_limit=256
 )
 ```
 
diff --git a/src/autocode_mcp/tools/checker.py b/src/autocode_mcp/tools/checker.py
@@ -26,8 +26,8 @@ def description(self) -> str:
         return """构建并验证输出检查器。
 
         基于论文 Algorithm 3 实现:
-        1. 保存代码到 problem_dir/checker.cpp
-        2. 编译生成 checker.exe
+        1. 保存代码到 problem_dir/files/checker.cpp
+        2. 编译生成 files/checker.exe
         3. 运行测试场景验证准确性
         4. 返回准确率和详细结果
 
@@ -59,7 +59,7 @@ def input_schema(self) -> dict:
                             "reference_output": {"type": "string"},
                             "expected_verdict": {
                                 "type": "string",
-                                "enum": ["AC", "WA", "PE"],
+                                "enum": ["AC", "WA", "PE", "FAIL"],
                             },
                         },
                         "required": ["input", "contestant_output", "reference_output"],
@@ -155,7 +155,13 @@ async def execute(
                 # Checker 返回码约定 (testlib.h):
                 # 0 = AC, 1 = WA, 2 = PE, 3+ = Fail (checker error)
                 verdict_map = {0: "AC", 1: "WA", 2: "PE"}
-                actual_verdict = verdict_map.get(run_result.return_code, "WA")
+                if run_result.return_code in verdict_map:
+                    actual_verdict = verdict_map[run_result.return_code]
+                elif run_result.return_code >= 3:
+                    # Checker 自身失败，返回 FAIL 而非 WA
+                    actual_verdict = "FAIL"
+                else:
+                    actual_verdict = "WA"
 
                 # 检查是否超时
                 if run_result.timed_out:
diff --git a/src/autocode_mcp/tools/interactor.py b/src/autocode_mcp/tools/interactor.py
@@ -26,8 +26,8 @@ def description(self) -> str:
         return """构建并验证交互器。
 
         基于论文 Algorithm 4 实现:
-        1. 保存代码到 problem_dir/interactor.cpp
-        2. 编译生成 interactor.exe
+        1. 保存代码到 problem_dir/files/interactor.cpp
+        2. 编译生成 files/interactor.exe
         3. 运行变异测试验证区分能力
         4. 返回 pass_rate 和 fail_rate
 
@@ -101,20 +101,33 @@ async def execute(
                 compile_log=compile_result.stderr,
             )
 
-        # 如果没有提供参考解和变异解，直接返回成功
+        # 如果没有提供参考解和变异解，直接返回成功（但 pass_rate 为 0）
         if not reference_solution_path and not mutant_solutions:
             return ToolResult.ok(
                 source_path=source_path,
                 binary_path=binary_path,
                 compile_log=compile_result.stderr,
+                pass_rate=0.0,
+                fail_rate=0.0,
+                pass_count=0,
+                pass_total=0,
+                fail_count=0,
+                fail_total=0,
                 message="Interactor built successfully (no validation performed)",
             )
 
         # 验证正确解通过率
         pass_count = 0
         pass_total = 0
 
-        if reference_solution_path and os.path.exists(reference_solution_path):
+        if reference_solution_path:
+            # 检查参考解是否存在，不存在则报错而非静默跳过
+            if not os.path.exists(reference_solution_path):
+                return ToolResult.fail(
+                    f"Reference solution not found: {reference_solution_path}",
+                    source_path=source_path,
+                    binary_path=binary_path,
+                )
             pass_total = 1
             # 运行交互测试：参考解应该被接受
             test_result = await self._run_interactor_test(
@@ -138,7 +151,8 @@ async def execute(
                     if test_result.get("verdict") != "AC":
                         fail_count += 1
 
-        pass_rate = pass_count / pass_total if pass_total > 0 else 1.0
+        # 计算通过率 - 没有测试时为 0，不是 1.0
+        pass_rate = pass_count / pass_total if pass_total > 0 else 0.0
         fail_rate = fail_count / fail_total if fail_total > 0 else 0.0
 
         return ToolResult.ok(
@@ -209,8 +223,15 @@ async def _run_interactor_test(
                     return_when=asyncio.FIRST_COMPLETED,
                 )
 
-                # 检查是否超时
-                timed_out = any(task.get_name() == "sleep" for task in done)
+                # 检查是否超时 - asyncio.wait 返回 set，需要遍历
+                timed_out = False
+                comm_task = None
+                for task in done:
+                    if task.get_name() == "sleep":
+                        timed_out = True
+                    else:
+                        comm_task = task
+
                 if timed_out:
                     interactor.kill()
                     solution.kill()
@@ -219,7 +240,6 @@ async def _run_interactor_test(
                     return {"verdict": "TLE", "reason": "Timeout"}
 
                 # 获取通信结果
-                comm_task = done[0] if done else None
                 if comm_task and not comm_task.cancelled():
                     result = comm_task.result()
                     return result
@@ -282,9 +302,9 @@ async def pipe_data(reader, writer, name: str):
                 ),
             ]
 
-            # 等待任一进程完成
+            # 等待任一进程完成 - 使用 create_task 避免弃用警告
             await asyncio.wait(
-                [interactor.wait(), solution.wait()],
+                [asyncio.create_task(interactor.wait()), asyncio.create_task(solution.wait())],
                 return_when=asyncio.FIRST_COMPLETED,
             )
         except NotImplementedError:
diff --git a/src/autocode_mcp/tools/stress_test.py b/src/autocode_mcp/tools/stress_test.py
@@ -125,7 +125,7 @@ async def execute(
             for i in range(1, trials + 1):
                 # 1. 生成输入数据
                 gen_result = await self._generate_input(
-                    gen_exe, input_path, i, seed=i, timeout=timeout, generator_args=generator_args
+                    gen_exe, input_path, i, seed=i, timeout=timeout, n_max=n_max, generator_args=generator_args
                 )
                 if not gen_result["success"]:
                     return ToolResult.fail(
@@ -201,6 +201,7 @@ async def _generate_input(
         round_num: int,
         seed: int,
         timeout: int,
+        n_max: int = 100,
         generator_args: dict | None = None,
     ) -> dict:
         """
@@ -212,6 +213,7 @@ async def _generate_input(
             round_num: 当前轮次
             seed: 随机种子
             timeout: 超时时间（秒）
+            n_max: N 最大值（用于默认协议）
             generator_args: Generator 完整参数（可选）
 
         Returns:
@@ -225,13 +227,20 @@ async def _generate_input(
                     str(seed),
                     generator_args.get("type", "2"),
                     str(generator_args.get("n_min", 1)),
-                    str(generator_args.get("n_max", 100)),
+                    str(generator_args.get("n_max", n_max)),
                     str(generator_args.get("t_min", 1)),
                     str(generator_args.get("t_max", 1)),
                 ]
             else:
-                # 简单协议: gen.exe <seed>
-                cmd_args = [str(seed)]
+                # 默认使用完整协议，与 generator_run 和 problem_generate_tests 保持一致
+                cmd_args = [
+                    str(seed),
+                    "2",           # type=random
+                    "1",           # n_min
+                    str(n_max),    # n_max 使用参数
+                    "1",           # t_min
+                    "1",           # t_max
+                ]
 
             gen_result = await run_binary_with_args(
                 gen_exe,
diff --git a/src/autocode_mcp/tools/validator.py b/src/autocode_mcp/tools/validator.py
@@ -26,8 +26,8 @@ def description(self) -> str:
         return """构建并验证数据校验器。
 
         基于论文 Algorithm 1 实现:
-        1. 保存代码到 problem_dir/val.cpp
-        2. 编译生成 val.exe
+        1. 保存代码到 problem_dir/files/val.cpp
+        2. 编译生成 files/val.exe
         3. 运行测试用例验证健壮性
         4. 返回得分和详细结果
 
diff --git a/tests/test_packaging.py b/tests/test_packaging.py
@@ -73,7 +73,7 @@ async def test_mcp_get_prompt_result_type():
     from autocode_mcp.server import get_prompt
 
     # 测试存在的 prompt
-    result = await get_prompt("validator_workflow")
+    result = await get_prompt("validator")
     assert isinstance(result, GetPromptResult)
     assert len(result.messages) > 0
 
@@ -117,3 +117,100 @@ def test_all_template_files_exist():
     for template in expected_templates:
         path = os.path.join(TEMPLATES_DIR, template)
         assert os.path.exists(path), f"Template not found: {template}"
+
+
+@pytest.mark.asyncio
+async def test_interactor_reference_solution_not_found():
+    """测试 interactor_build 在参考解不存在时报错而非静默跳过。"""
+    from autocode_mcp.server import call_tool, register_all_tools
+
+    register_all_tools()
+
+    import tempfile
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        # 测试不存在的参考解路径
+        result = await call_tool("interactor_build", {
+            "problem_dir": tmpdir,
+            "code": '#include "testlib.h"\nint main() { return 0; }',
+            "reference_solution_path": os.path.join(tmpdir, "nonexistent.exe"),
+        })
+
+        assert result.isError is True
+        assert "Reference solution not found" in result.structuredContent.get("error", "")
+
+
+@pytest.mark.asyncio
+async def test_interactor_pass_rate_without_tests():
+    """测试 interactor_build 没有测试时 pass_rate 为 0 而非 1.0。"""
+    from autocode_mcp.server import call_tool, register_all_tools
+
+    register_all_tools()
+
+    import tempfile
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        # 不提供参考解和变异解
+        result = await call_tool("interactor_build", {
+            "problem_dir": tmpdir,
+            "code": '#include "testlib.h"\nint main() { return 0; }',
+        })
+
+        assert result.isError is False
+        # pass_rate 在 data 字段中
+        data = result.structuredContent.get("data", {})
+        assert data.get("pass_rate", 1.0) == 0.0
+
+
+@pytest.mark.asyncio
+async def test_checker_fail_verdict():
+    """测试 checker_build 能区分 FAIL 和 WA。"""
+    from autocode_mcp.server import call_tool, register_all_tools
+
+    register_all_tools()
+
+    import tempfile
+
+    with tempfile.TemporaryDirectory() as tmpdir:
+        # 创建一个会返回非标准退出码的 checker
+        # testlib.h 的 quitf(_fail, ...) 会返回退出码 3+
+        checker_code = '''
+#include "testlib.h"
+int main(int argc, char* argv[]) {
+    registerTestlibCmd(argc, argv);
+    // 强制返回 FAIL
+    quitf(_fail, "Checker internal error");
+    return 3;
+}
+'''
+        result = await call_tool("checker_build", {
+            "problem_dir": tmpdir,
+            "code": checker_code,
+            "test_scenarios": [
+                {
+                    "input": "1",
+                    "contestant_output": "1",
+                    "reference_output": "1",
+                    "expected_verdict": "FAIL",
+                },
+            ],
+        })
+
+        assert result.isError is False
+        test_results = result.structuredContent.get("test_results", [])
+        if test_results:
+            # 应该识别为 FAIL 而非 WA
+            assert test_results[0].get("actual_verdict") == "FAIL"
+
+
+def test_all_prompts_exist():
+    """测试所有声明的 prompt 都存在。"""
+    from autocode_mcp.prompts import get_prompt, list_prompts
+
+    prompts = list_prompts()
+    assert len(prompts) == 6
+
+    for name in prompts:
+        content = get_prompt(name)
+        assert content, f"Prompt '{name}' is empty"
+        assert len(content) > 100, f"Prompt '{name}' seems too short"
diff --git a/tests/test_tools/test_interactor.py b/tests/test_tools/test_interactor.py
diff --git a/tests/test_tools/test_stress_test.py b/tests/test_tools/test_stress_test.py
diff --git a/uv.lock b/uv.lock