Skip to content

Commit 7576337

Browse files
committed
fix mcp test generation stability and recovery
Made-with: Cursor
1 parent 76505cc commit 7576337

17 files changed

Lines changed: 1297 additions & 38 deletions

File tree

CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -88,9 +88,9 @@ AutoCode/
8888
├── statements/ # 题面
8989
│ └── README.md
9090
└── tests/ # 生成的测试数据
91-
├── 01.in
92-
├── 01.ans
93-
└── ...
91+
├── 01.in
92+
├── 01.ans / 01.out(由 answer_ext 控制)
93+
└── ...
9494
```
9595

9696
## 出题工作流程
@@ -102,7 +102,7 @@ AutoCode/
102102
5. 构建生成器 (`generator_build`)
103103
6. 运行压力测试 (`stress_test_run`, completed_rounds == total_rounds)
104104
7. 按需构建检查器 (`checker_build`, accuracy >= 0.9)
105-
8. 生成测试数据(`problem_generate_tests`, generated_test_count > 0,且最终 extreme/tle 至少占一半;候选不足时尽量满足)
105+
8. 生成测试数据(`problem_generate_tests`, generated_test_count > 0,支持 `answer_ext`;最终 extreme/tle 至少占一半;候选不足时尽量满足;长任务中断可 `resume=true` 续跑
106106
9. 验证测试数据 (`problem_verify_tests`, passed)
107107
10. 打包 Polygon (`problem_pack_polygon`)
108108

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ AutoCode 提供 15 个原子工具,分为 7 组。所有工具返回统一格
246246
| 工具 | 描述 | 关键参数 |
247247
|------|------|----------|
248248
| `problem_create` | 初始化题目目录 | `problem_dir`, `problem_name` |
249-
| `problem_generate_tests` | 生成最终测试数据(最终数据集中 extreme/tle 至少占一半,候选不足时尽量满足) | `problem_dir`, `test_count` |
249+
| `problem_generate_tests` | 生成最终测试数据(最终数据集中 extreme/tle 至少占一半,候选不足时尽量满足) | `problem_dir`, `test_count`, `answer_ext`, `resume`, `hard_timeout_seconds` |
250250
| `problem_verify_tests` | 验证测试数据质量(含 extreme/tle 占比硬校验) | `problem_dir`, `tests_dir`, `verify_types` |
251251
| `problem_pack_polygon` | 打包为 Polygon 格式 | `problem_dir`, `time_limit`, `memory_limit` |
252252

@@ -375,11 +375,13 @@ All 1000 rounds passed
375375
```python
376376
problem_generate_tests(
377377
problem_dir="problems/ab",
378-
test_count=50
378+
test_count=50,
379+
answer_ext=".out", # 可选,默认 .ans
380+
hard_timeout_seconds=600
379381
)
380382
```
381383

382-
说明:最终写入的测试中,`extreme`(type=3)与 `tle`(type=4)合计不少于一半;若候选里极限类不足,则会在可用候选范围内尽量满足并返回对应统计字段。
384+
说明:最终写入的测试中,`extreme`(type=3)与 `tle`(type=4)合计不少于一半;若候选里极限类不足,则会在可用候选范围内尽量满足并返回对应统计字段。若长任务被中断,可使用 `resume=true` 从 checkpoint 续跑。
383385

384386
### 步骤 7:打包为 Polygon 格式
385387

@@ -499,7 +501,7 @@ problems/your-problem/
499501
│ └── README.md # 题目描述
500502
├── tests/
501503
│ ├── 01.in # 测试输入
502-
│ ├── 01.ans # 期望输出
504+
│ ├── 01.ans/.out # 期望输出(由 answer_ext 控制)
503505
│ └── ...
504506
└── problem.xml # Polygon 配置
505507
```

agents/autocode-workflow.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ Always work through this sequence unless the task is explicitly outside problem
2525

2626
When the user asks for a later step directly, explain which prerequisite step is missing and complete the missing work first.
2727

28-
When running `problem_generate_tests`, enforce test quality: final test data should contain at least half limit-oriented cases (`type=3` extreme + `type=4` tle) when candidate availability allows.
28+
When running `problem_generate_tests`, enforce test quality: final test data should contain at least half limit-oriented cases (`type=3` extreme + `type=4` tle) when candidate availability allows. Also enforce that generator logic for type=3 and type=4 is semantically different (type=4 should include targeted worst-case patterns, not only max-parameter scaling).
29+
30+
For long-running `problem_generate_tests`, warn that new user messages can interrupt MCP execution. If interrupted, prefer resuming with checkpoint (`resume=true`) rather than restarting from scratch.
2931

3032
Treat hook feedback as authoritative. If a hook denies a tool call, fix the workflow gap instead of retrying the same call.

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@ exclude_lines = [
8080

8181
[dependency-groups]
8282
dev = [
83+
"twine>=6.2.0",
8384
"types-psutil>=7.2.2.20260402",
8485
"types-pywin32>=311.0.0.20260402",
8586
"types-pyyaml>=6.0.12.20250915",

scripts/workflow_guard.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -272,6 +272,8 @@ def session_start() -> int:
272272
"problem_validate(validation_passed) -> "
273273
"problem_generate_tests(generated_test_count > 0, and prefer >=50% type3/type4 in final tests when candidates are sufficient) -> "
274274
"problem_verify_tests(passed) -> problem_pack_polygon. "
275+
"When running long problem_generate_tests tasks, avoid sending new chat messages because that can interrupt MCP calls; if interrupted, resume with checkpoint state (resume=true). "
276+
"Generator quality gate: ensure type=3 and type=4 branches are semantically different, and type=4 includes targeted worst-case patterns rather than only max parameters. "
275277
"If a hook blocks a step, complete the missing prerequisite instead of retrying blindly."
276278
)
277279
print(

skills/autocode-workflow/SKILL.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -233,9 +233,10 @@ CRITICAL: Must pass validation before generating final tests
233233
Tool: problem_generate_tests
234234
Required: problem_dir
235235
Recommended: test_count=50, enable_dedup=true, enable_validator_filter=true
236-
Output: tests/01.in ~ tests/50.in + corresponding .ans files
236+
Output: tests/01.in ~ tests/50.in + corresponding answer files (`.ans` by default, or configured `answer_ext` such as `.out`)
237237
Verify: Check generated_tests count matches test_count
238238
Quality Gate: In final tests, type 3/4 (extreme + tle) should be >= ceil(test_count/2) when candidates are sufficient
239+
Long-running note: sending new user messages may interrupt MCP execution; prefer waiting, or resume with `resume=true` if interrupted.
239240
```
240241

241242
### Phase 9: Packaging
@@ -337,6 +338,7 @@ Before considering the problem complete:
337338
- [ ] Sample files validated (problem_validate passed)
338339
- [ ] Final test data generated (50+ tests)
339340
- [ ] Final test data has at least 50% extreme/tle cases when candidate pool allows
341+
- [ ] type=3/type=4 generation logic is semantically different (not just max-parameter duplication)
340342
- [ ] Polygon package created
341343

342344
## Example Complete Workflow

src/autocode_mcp/prompts/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@
6565
- 先保证最终测试中至少一半是 extreme/tle(type=3/4,候选不足时尽量满足)
6666
- 再平衡分布
6767
- 采样
68+
- 长任务期间避免发送新消息(可能中断 MCP 调用);若中断,优先使用 resume/checkpoint 续跑
6869
6970
## 质量指标
7071
- Consistency > 90%
@@ -124,6 +125,7 @@
124125
- type=2 (random): 随机数据
125126
- type=3 (extreme): 极端数据(溢出、精度、hash碰撞)
126127
- type=4 (tle): TLE 诱导数据
128+
- 要求 type=3 与 type=4 分支有实质差异,type=4 应包含针对性卡法,不应仅靠 n_max/t_max 拉满
127129
128130
### 代码模板
129131
```cpp

src/autocode_mcp/server.py

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,12 @@
3232
from .tools.file_ops import FileReadTool, FileSaveTool
3333
from .tools.generator import GeneratorBuildTool, GeneratorRunTool
3434
from .tools.interactor import InteractorBuildTool
35-
from .tools.problem import ProblemCreateTool, ProblemGenerateTestsTool, ProblemPackPolygonTool
35+
from .tools.problem import (
36+
ProblemCleanupProcessesTool,
37+
ProblemCreateTool,
38+
ProblemGenerateTestsTool,
39+
ProblemPackPolygonTool,
40+
)
3641
from .tools.solution import SolutionBuildTool, SolutionRunTool
3742
from .tools.stress_test import StressTestRunTool
3843
from .tools.test_verify import ProblemVerifyTestsTool
@@ -68,6 +73,7 @@ def register_all_tools() -> None:
6873
# Problem 工具组
6974
register_tool(ProblemCreateTool())
7075
register_tool(ProblemGenerateTestsTool())
76+
register_tool(ProblemCleanupProcessesTool())
7177
register_tool(ProblemVerifyTestsTool())
7278
register_tool(ProblemPackPolygonTool())
7379
register_tool(ProblemValidateTool())
@@ -118,6 +124,18 @@ async def call_tool(name: str, arguments: dict[str, Any]) -> CallToolResult:
118124
structuredContent=result_dict,
119125
isError=not result.success,
120126
)
127+
except asyncio.CancelledError:
128+
cancel_result = ToolResult.fail(
129+
"Tool call interrupted by cancellation",
130+
interrupted=True,
131+
resume_hint="Retry with resume=true if tool supports checkpoints",
132+
)
133+
cancel_dict = cancel_result.to_dict()
134+
return CallToolResult(
135+
content=[TextContent(type="text", text=json.dumps(cancel_dict, ensure_ascii=False))],
136+
structuredContent=cancel_dict,
137+
isError=True,
138+
)
121139
except Exception as e:
122140
error_result = ToolResult.fail(str(e))
123141
error_dict = error_result.to_dict()

src/autocode_mcp/tools/generator.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88

99
import hashlib
1010
import os
11+
import re
1112

1213
from ..utils.compiler import run_binary, run_binary_with_args
1314
from ..utils.platform import get_exe_extension
@@ -58,6 +59,16 @@ def input_schema(self) -> dict:
5859
"description": "编译器名称",
5960
"default": "g++",
6061
},
62+
"enable_semantic_check": {
63+
"type": "boolean",
64+
"description": "是否启用 type=3/type=4 语义静态检查",
65+
"default": True,
66+
},
67+
"strict_semantic_check": {
68+
"type": "boolean",
69+
"description": "语义静态检查不通过时是否直接失败",
70+
"default": False,
71+
},
6172
},
6273
"required": ["problem_dir"],
6374
"anyOf": [
@@ -72,6 +83,8 @@ async def execute(
7283
code: str | None = None,
7384
source_path: str | None = None,
7485
compiler: str = "g++",
86+
enable_semantic_check: bool = True,
87+
strict_semantic_check: bool = False,
7588
) -> ToolResult:
7689
"""执行 Generator 构建。"""
7790
resolved, err = resolve_source(problem_dir, code, source_path)
@@ -107,15 +120,52 @@ async def execute(
107120

108121
binary_size = os.path.getsize(binary_path) if os.path.exists(binary_path) else 0
109122

123+
semantic_check = self._check_type34_semantics(resolved.code) if enable_semantic_check else {"enabled": False}
124+
if (
125+
enable_semantic_check
126+
and strict_semantic_check
127+
and not semantic_check.get("passed", True)
128+
):
129+
return ToolResult.fail(
130+
"Generator semantic check failed: type=3/type=4 lack substantial difference",
131+
semantic_check=semantic_check,
132+
)
133+
110134
return ToolResult.ok(
111135
source_path=compile_source,
112136
canonical_path=canonical_path,
113137
binary_path=binary_path,
114138
binary_size=binary_size,
115139
compile_log=compile_result.stderr,
140+
semantic_check=semantic_check,
116141
message="Generator built successfully",
117142
)
118143

144+
def _check_type34_semantics(self, code: str) -> dict:
145+
has_type3 = bool(re.search(r"type\s*==\s*3", code))
146+
has_type4 = bool(re.search(r"type\s*==\s*4", code))
147+
if not has_type3 or not has_type4:
148+
return {
149+
"enabled": True,
150+
"passed": False,
151+
"reason": "generator lacks explicit type==3/type==4 branches",
152+
"hint": "需要给 type=3/type=4 设计不同逻辑,避免仅靠参数放大",
153+
}
154+
155+
type3_blocks = re.findall(r"type\s*==\s*3[\s\S]{0,240}", code)
156+
type4_blocks = re.findall(r"type\s*==\s*4[\s\S]{0,240}", code)
157+
norm3 = " ".join(type3_blocks).replace(" ", "")
158+
norm4 = " ".join(type4_blocks).replace(" ", "")
159+
output_lines = [line.strip() for line in code.splitlines() if "cout" in line or "printf" in line]
160+
duplicate_outputs = len(set(output_lines)) <= 1 and len(output_lines) > 0
161+
similar = norm3 == norm4 or (norm3 and norm4 and abs(len(norm3) - len(norm4)) < 10) or duplicate_outputs
162+
return {
163+
"enabled": True,
164+
"passed": not similar,
165+
"reason": "" if not similar else "type=3/type=4 branch snippets are too similar",
166+
"hint": "为 type=4 增加针对性卡法,而不仅是 n_max/t_max 取最大值",
167+
}
168+
119169

120170
class GeneratorRunTool(Tool):
121171
"""运行多策略数据生成器。"""

0 commit comments

Comments
 (0)