Skip to content

Commit 08268de

Browse files
authored
feat: enforce limit-ratio quality gate and bump 0.8.0 (#4)
* feat: enforce extreme/tle ratio in final tests and bump to 0.8.0 Guarantee final generated tests prioritize limit-oriented coverage by requiring at least half type=3/4 cases by default, and verify this via manifest-backed quality checks with an explicit opt-out. Also synchronize workflow docs and plugin/package versions for the 0.8.0 release line. Made-with: Cursor * fix: align balance description and preserve duplicates in sampling Address Copilot review by matching schema wording with actual deterministic ordering and preventing unconditional signature-based de-duplication during final sampling, so enable_dedup=false semantics remain effective. Made-with: Cursor
1 parent 75ea114 commit 08268de

16 files changed

Lines changed: 441 additions & 54 deletions

File tree

.claude-plugin/plugin.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"name": "autocode",
3-
"version": "0.7.0",
3+
"version": "0.8.0",
44
"description": "Claude Code plugin for competitive programming problem-setting workflows.",
55
"author": {
66
"name": "SummerOneTwo",

CHANGELOG.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,14 @@ All notable changes to this project will be documented in this file.
55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
66
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
77

8+
## [0.8.0] - 2026-04-28
9+
10+
### Improvements
11+
12+
- **最终测试数据配比约束**: `problem_generate_tests` 采样策略更新为优先保证最终测试集中 `type=3/4`(extreme + tle)不少于一半(候选不足时尽量满足),并返回 `limit_case_count``limit_case_minimum_required``limit_case_quota_met` 统计字段。
13+
- **验证阶段硬约束**: `problem_verify_tests` 新增 `limit_ratio` 校验(默认启用),基于生成 manifest 强制检查最终测试中 `type=3/4` 是否达到至少一半,不满足将直接验证失败;可通过 `enable_limit_ratio=false` 显式关闭。
14+
- **文档与工作流同步**: 更新 README、workflow skill、agent 提示与 prompts 文案,统一说明“最终测试至少一半极限数据”的质量门槛。
15+
816
## [0.7.0] - 2026-04-27
917

1018
### Features

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ AutoCode/
102102
5. 构建生成器 (`generator_build`)
103103
6. 运行压力测试 (`stress_test_run`, completed_rounds == total_rounds)
104104
7. 按需构建检查器 (`checker_build`, accuracy >= 0.9)
105-
8. 生成测试数据 (`problem_generate_tests`, generated_test_count > 0)
105+
8. 生成测试数据`problem_generate_tests`, generated_test_count > 0,且最终 extreme/tle 至少占一半;候选不足时尽量满足)
106106
9. 验证测试数据 (`problem_verify_tests`, passed)
107107
10. 打包 Polygon (`problem_pack_polygon`)
108108

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,8 @@ AutoCode 提供 15 个原子工具,分为 7 组。所有工具返回统一格
246246
| 工具 | 描述 | 关键参数 |
247247
|------|------|----------|
248248
| `problem_create` | 初始化题目目录 | `problem_dir`, `problem_name` |
249-
| `problem_generate_tests` | 生成最终测试数据 | `problem_dir`, `test_count` |
249+
| `problem_generate_tests` | 生成最终测试数据(最终数据集中 extreme/tle 至少占一半,候选不足时尽量满足) | `problem_dir`, `test_count` |
250+
| `problem_verify_tests` | 验证测试数据质量(含 extreme/tle 占比硬校验) | `problem_dir`, `tests_dir`, `verify_types` |
250251
| `problem_pack_polygon` | 打包为 Polygon 格式 | `problem_dir`, `time_limit`, `memory_limit` |
251252

252253
## 工作流教程:A+B 问题
@@ -378,6 +379,8 @@ problem_generate_tests(
378379
)
379380
```
380381

382+
说明:最终写入的测试中,`extreme`(type=3)与 `tle`(type=4)合计不少于一半;若候选里极限类不足,则会在可用候选范围内尽量满足并返回对应统计字段。
383+
381384
### 步骤 7:打包为 Polygon 格式
382385

383386
```python
@@ -477,6 +480,8 @@ problem_pack_polygon(
477480
| `extreme` | 3 | 边界情况:溢出、精度、hash 碰撞 |
478481
| `tle` | 4 | 诱导 TLE 的性能测试数据 |
479482

483+
`problem_generate_tests` 的默认采样策略会优先保证最终测试集中 `extreme` + `tle` 至少占 50%,剩余名额再按配置平衡分配(或按确定性顺序填充)。
484+
480485
### 文件结构
481486

482487
```

agents/autocode-workflow.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,6 @@ Always work through this sequence unless the task is explicitly outside problem
2525

2626
When the user asks for a later step directly, explain which prerequisite step is missing and complete the missing work first.
2727

28+
When running `problem_generate_tests`, enforce test quality: final test data should contain at least half limit-oriented cases (`type=3` extreme + `type=4` tle) when candidate availability allows.
29+
2830
Treat hook feedback as authoritative. If a hook denies a tool call, fix the workflow gap instead of retrying the same call.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "autocode-mcp"
3-
version = "0.7.0"
3+
version = "0.8.0"
44
description = "MCP Server for competitive programming problem creation, based on AutoCode paper"
55
readme = "README.md"
66
requires-python = ">=3.10"

scripts/workflow_guard.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -270,7 +270,7 @@ def session_start() -> int:
270270
"stress_test_run(completed_rounds == total_rounds) -> "
271271
"checker_build if needed (accuracy >= 0.9) -> "
272272
"problem_validate(validation_passed) -> "
273-
"problem_generate_tests(generated_test_count > 0) -> "
273+
"problem_generate_tests(generated_test_count > 0, and prefer >=50% type3/type4 in final tests when candidates are sufficient) -> "
274274
"problem_verify_tests(passed) -> problem_pack_polygon. "
275275
"If a hook blocks a step, complete the missing prerequisite instead of retrying blindly."
276276
)

skills/autocode-workflow/SKILL.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ Based on the paper "AutoCode: LLMs as Problem Setters for Competitive Programmin
6161
│ Phase 8: Test Generation │
6262
│ ┌────────────────────┴────────────────────┐ │
6363
│ │ problem_generate_tests │ Generate final test data │
64-
│ │ (dedup + validator filter + balance) │ │
64+
│ │ (dedup + validator filter + extreme>=50%)│ │
6565
│ └────────────────────┬────────────────────┘ │
6666
│ │ │
6767
│ Phase 9: Packaging │
@@ -235,6 +235,7 @@ Required: problem_dir
235235
Recommended: test_count=50, enable_dedup=true, enable_validator_filter=true
236236
Output: tests/01.in ~ tests/50.in + corresponding .ans files
237237
Verify: Check generated_tests count matches test_count
238+
Quality Gate: In final tests, type 3/4 (extreme + tle) should be >= ceil(test_count/2) when candidates are sufficient
238239
```
239240

240241
### Phase 9: Packaging
@@ -283,7 +284,7 @@ Generate 3-5 mutant solutions with common bugs:
283284
| 5 | `stress_test_run` | Step 4 | `"All N rounds passed"` |
284285
| 6 | `checker_build` (optional) | Step 5 | `accuracy >= 0.9` |
285286
| 7 | `problem_validate` | Step 5 or 6 | `success=true`, all samples passed |
286-
| 8 | `problem_generate_tests` | Step 7 | `generated_tests == test_count` |
287+
| 8 | `problem_generate_tests` | Step 7 | `generated_tests == test_count` and `type3+type4 >= ceil(test_count/2)` (if candidates sufficient) |
287288
| 9 | `problem_pack_polygon` | Step 8 | `success=true` |
288289

289290
### FORBIDDEN Actions
@@ -335,6 +336,7 @@ Before considering the problem complete:
335336
- [ ] Statement samples validated (problem_validate passed)
336337
- [ ] Sample files validated (problem_validate passed)
337338
- [ ] Final test data generated (50+ tests)
339+
- [ ] Final test data has at least 50% extreme/tle cases when candidate pool allows
338340
- [ ] Polygon package created
339341

340342
## Example Complete Workflow

src/autocode_mcp/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"""
77
import os
88

9-
__version__ = "0.7.0"
9+
__version__ = "0.8.0"
1010

1111
# 获取 templates 目录路径(包内目录)
1212
_PACKAGE_DIR = os.path.dirname(__file__)

src/autocode_mcp/prompts/__init__.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,8 @@
6262
## 3. 后处理
6363
- 使用 Validator 过滤无效输入
6464
- 去重(基于 signature)
65-
- 平衡分布
65+
- 先保证最终测试中至少一半是 extreme/tle(type=3/4,候选不足时尽量满足)
66+
- 再平衡分布
6667
- 采样
6768
6869
## 质量指标
@@ -141,8 +142,9 @@
141142
### 后处理
142143
1. Validator 过滤
143144
2. 去重(MD5 signature)
144-
3. 平衡分布
145-
4. 采样
145+
3. 先保证最终测试中 extreme/tle(type=3/4)不少于一半(候选不足时尽量满足)
146+
4. 对剩余名额平衡分布
147+
5. 采样
146148
"""
147149

148150
# Checker 构建提示词

0 commit comments

Comments
 (0)