Skip to content

Commit 161a80c

Browse files
committed
feat: 添加 problem_validate 工具验证题面样例
新增 MCP 工具 problem_validate,用于验证题面样例和样例文件的正确性: - 验证题面中的样例答案是否正确(运行 sol) - 验证 tests/ 目录下的样例文件是否与 sol 输出一致 - 支持多种样例格式(Markdown code block、纯文本格式) 工作流变更: - stress_test_run -> problem_validate -> problem_generate_tests - problem_generate_tests 前必须先通过 problem_validate 验证 修复 Codex review 发现的问题: - 无样例时正确返回失败而非成功 - 支持纯文本格式的样例提取 - 重新验证失败后正确清除缓存状态
1 parent aa5269d commit 161a80c

9 files changed

Lines changed: 966 additions & 19 deletions

File tree

agents/autocode-workflow.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,9 @@ Always work through this sequence unless the task is explicitly outside problem
1919
5. `generator_build`
2020
6. `stress_test_run`
2121
7. `checker_build` when the problem requires a non-exact checker
22-
8. `problem_generate_tests`
23-
9. `problem_pack_polygon`
22+
8. `problem_validate`
23+
9. `problem_generate_tests`
24+
10. `problem_pack_polygon`
2425

2526
When the user asks for a later step directly, explain which prerequisite step is missing and complete the missing work first.
2627

scripts/workflow_guard.py

Lines changed: 28 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ def infer_state(problem_dir: str) -> dict[str, Any]:
4949
"stress_total_rounds": 0,
5050
"checker_ready": (root / "files" / "checker.cpp").exists() or any(root.glob("files/checker.*")),
5151
"checker_accuracy": None,
52+
"statement_validated": False,
53+
"sample_files_validated": False,
54+
"validation_passed": False,
5255
"tests_generated": any((root / "tests").glob("*.in")) if (root / "tests").exists() else False,
5356
"generated_test_count": len(list((root / "tests").glob("*.in"))) if (root / "tests").exists() else 0,
5457
"packaged": (root / "problem.xml").exists(),
@@ -115,7 +118,8 @@ def pre_tool(payload: dict[str, Any]) -> int:
115118
"generator_build": "必须先完成 validator_build,并且 validator accuracy >= 0.9。",
116119
"stress_test_run": "必须先完成 validator_build(accuracy >= 0.9) 和 generator_build,然后再进行 stress_test_run。",
117120
"checker_build": "必须先通过 stress_test_run(completed_rounds == total_rounds),再构建 checker。",
118-
"problem_generate_tests": "必须先通过 stress_test_run(completed_rounds == total_rounds),才能生成最终测试数据。",
121+
"problem_validate": "必须先通过 stress_test_run(completed_rounds == total_rounds),再进行验证。",
122+
"problem_generate_tests": "必须先通过 problem_validate(验证通过),才能生成最终测试数据。",
119123
"problem_pack_polygon": "必须先生成最终测试数据,并且生成数量 > 0,再进行打包。",
120124
}
121125

@@ -153,7 +157,13 @@ def pre_tool(payload: dict[str, Any]) -> int:
153157
deny(reasons["checker_build"])
154158
return 0
155159

156-
if short_name == "problem_generate_tests" and not state["stress_passed"]:
160+
if short_name == "problem_validate" and not state["stress_passed"]:
161+
deny(reasons["problem_validate"])
162+
return 0
163+
164+
if short_name == "problem_generate_tests" and not (
165+
state["stress_passed"] and state.get("validation_passed", False)
166+
):
157167
deny(reasons["problem_generate_tests"])
158168
return 0
159169

@@ -173,6 +183,17 @@ def post_tool(payload: dict[str, Any]) -> int:
173183
return 0
174184

175185
success, data = parse_tool_result(payload)
186+
187+
# 特殊处理:problem_validate 失败时也需要更新状态
188+
# 确保重新验证失败后清除旧的 validation_passed 状态
189+
if short_name == "problem_validate" and not success:
190+
state = load_state(problem_dir)
191+
state["statement_validated"] = data.get("statement_samples", {}).get("validated", False)
192+
state["sample_files_validated"] = data.get("sample_files", {}).get("validated", False)
193+
state["validation_passed"] = False
194+
save_state(problem_dir, state)
195+
return 0
196+
176197
if not success:
177198
return 0
178199

@@ -200,6 +221,10 @@ def post_tool(payload: dict[str, Any]) -> int:
200221
accuracy = data.get("accuracy")
201222
state["checker_accuracy"] = accuracy
202223
state["checker_ready"] = accuracy is None or accuracy >= 0.9
224+
elif short_name == "problem_validate":
225+
state["statement_validated"] = data.get("statement_samples", {}).get("validated", False)
226+
state["sample_files_validated"] = data.get("sample_files", {}).get("validated", False)
227+
state["validation_passed"] = success
203228
elif short_name == "problem_generate_tests":
204229
generated_tests = data.get("generated_tests", [])
205230
state["tests_generated"] = bool(generated_tests)
@@ -218,6 +243,7 @@ def session_start() -> int:
218243
"validator_build(accuracy >= 0.9) -> generator_build -> "
219244
"stress_test_run(completed_rounds == total_rounds) -> "
220245
"checker_build if needed (accuracy >= 0.9) -> "
246+
"problem_validate(validation_passed) -> "
221247
"problem_generate_tests(generated_test_count > 0) -> problem_pack_polygon. "
222248
"If a hook blocks a step, complete the missing prerequisite instead of retrying blindly."
223249
)

skills/autocode-workflow/SKILL.md

Lines changed: 57 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -52,13 +52,19 @@ Based on the paper "AutoCode: LLMs as Problem Setters for Competitive Programmin
5252
│ │ (for non-exact problems) │ │
5353
│ └────────────────────┬────────────────────┘ │
5454
│ │ │
55-
│ Phase 7: Test Generation │
55+
│ Phase 7: Sample Validation │
56+
│ ┌────────────────────┴────────────────────┐ │
57+
│ │ problem_validate │ Validate statement samples │
58+
│ │ (statement_samples + sample_files) │ and test files │
59+
│ └────────────────────┬────────────────────┘ │
60+
│ │ │
61+
│ Phase 8: Test Generation │
5662
│ ┌────────────────────┴────────────────────┐ │
5763
│ │ problem_generate_tests │ Generate final test data │
5864
│ │ (dedup + validator filter + balance) │ │
5965
│ └────────────────────┬────────────────────┘ │
6066
│ │ │
61-
│ Phase 8: Packaging │
67+
│ Phase 9: Packaging │
6268
│ ┌────────────────────┴────────────────────┐ │
6369
│ │ problem_pack_polygon │ Export for Codeforces/Polygon │
6470
│ └─────────────────────────────────────────┘ │
@@ -198,9 +204,31 @@ Verify: Check accuracy >= 0.9
198204
]
199205
```
200206

201-
### Phase 7: Test Generation
207+
### Phase 7: Sample Validation
202208

203-
**Step 7.1: Generate Final Tests**
209+
**Step 7.1: Validate Statement Samples**
210+
```
211+
Tool: problem_validate
212+
Required: problem_dir
213+
Optional: statement_samples (if not provided, auto-extract from README.md)
214+
Output: validation results for statement_samples and sample_files
215+
Verify: Check success=true, all samples passed
216+
CRITICAL: Must pass validation before generating final tests
217+
```
218+
219+
**Validation Types:**
220+
- `statement_samples`: Validate samples in problem statement (README.md)
221+
- `sample_files`: Validate sample files in tests/ directory
222+
223+
**If validation fails:**
224+
1. Check the failing sample's expected output
225+
2. Run sol manually to verify correct output
226+
3. Update README.md or sample files as needed
227+
4. Re-run validation
228+
229+
### Phase 8: Test Generation
230+
231+
**Step 8.1: Generate Final Tests**
204232
```
205233
Tool: problem_generate_tests
206234
Required: problem_dir
@@ -209,9 +237,9 @@ Output: tests/01.in ~ tests/50.in + corresponding .ans files
209237
Verify: Check generated_tests count matches test_count
210238
```
211239

212-
### Phase 8: Packaging
240+
### Phase 9: Packaging
213241

214-
**Step 8.1: Pack for Polygon**
242+
**Step 9.1: Pack for Polygon**
215243
```
216244
Tool: problem_pack_polygon
217245
Required: problem_dir
@@ -254,16 +282,18 @@ Generate 3-5 mutant solutions with common bugs:
254282
| 4 | `generator_build` | Step 3 | `success=true`, gen.exe exists |
255283
| 5 | `stress_test_run` | Step 4 | `"All N rounds passed"` |
256284
| 6 | `checker_build` (optional) | Step 5 | `accuracy >= 0.9` |
257-
| 7 | `problem_generate_tests` | Step 5 or 6 | `generated_tests == test_count` |
258-
| 8 | `problem_pack_polygon` | Step 7 | `success=true` |
285+
| 7 | `problem_validate` | Step 5 or 6 | `success=true`, all samples passed |
286+
| 8 | `problem_generate_tests` | Step 7 | `generated_tests == test_count` |
287+
| 9 | `problem_pack_polygon` | Step 8 | `success=true` |
259288

260289
### FORBIDDEN Actions
261290

262291
1. **NEVER** call `generator_build` before `validator_build`
263292
2. **NEVER** call `stress_test_run` before building BOTH sol AND brute
264-
3. **NEVER** call `problem_generate_tests` before stress test passes
265-
4. **NEVER** skip stress test verification
266-
5. **NEVER** proceed if any step returns `success=false`
293+
3. **NEVER** call `problem_validate` before stress test passes
294+
4. **NEVER** call `problem_generate_tests` before validation passes
295+
5. **NEVER** skip stress test verification
296+
6. **NEVER** proceed if any step returns `success=false`
267297

268298
## Error Recovery
269299

@@ -284,6 +314,13 @@ Generate 3-5 mutant solutions with common bugs:
284314
3. Fix the buggy solution
285315
4. Rebuild and re-run stress test
286316

317+
### Validation Failure
318+
1. The result contains `statement_samples` or `sample_files` details
319+
2. Check which sample failed (expected vs actual output)
320+
3. Verify correct output by running sol manually
321+
4. Update README.md or sample files with correct output
322+
5. Re-run validation
323+
287324
## Quality Checklist
288325

289326
Before considering the problem complete:
@@ -295,6 +332,8 @@ Before considering the problem complete:
295332
- [ ] Generator produces valid inputs
296333
- [ ] Stress test passes 1000+ rounds
297334
- [ ] (If applicable) Checker passes 90%+ scenarios
335+
- [ ] Statement samples validated (problem_validate passed)
336+
- [ ] Sample files validated (problem_validate passed)
298337
- [ ] Final test data generated (50+ tests)
299338
- [ ] Polygon package created
300339

@@ -329,11 +368,15 @@ assert result["completed_rounds"] == result["total_rounds"]
329368
result = checker_build(problem_dir="problems/ab", code=checker_code, test_scenarios=checker_tests)
330369
assert result["accuracy"] >= 0.9
331370
332-
# Phase 7: Generate Tests
371+
# Phase 7: Validate Samples
372+
result = problem_validate(problem_dir="problems/ab")
373+
assert result["success"] == True
374+
375+
# Phase 8: Generate Tests
333376
result = problem_generate_tests(problem_dir="problems/ab", test_count=50)
334377
assert len(result["generated_tests"]) == 50
335378
336-
# Phase 8: Package
379+
# Phase 9: Package
337380
result = problem_pack_polygon(problem_dir="problems/ab", time_limit=1, memory_limit=256)
338381
assert result["success"] == True
339382
```
@@ -356,4 +399,5 @@ If the user asks to skip steps (e.g., "just generate tests"), you MUST:
356399
| `checker_build` | Algorithm 3 | Build output verification |
357400
| `interactor_build` | Algorithm 4 | Build interactive problem handler |
358401
| `stress_test_run` | - | Verify solution correctness |
402+
| `problem_validate` | - | Validate statement samples and sample files |
359403
| `problem_generate_tests` | - | Generate final test dataset |

0 commit comments

Comments
 (0)