Measuring how well coding agents plan against a real-world product specification.
This repository contains a standardized benchmark for evaluating the planning capabilities of LLM-based coding agents. Instead of measuring code generation speed or test-pass rates, this benchmark answers a different question: When given a complex, multi-document product spec, how well can an agent produce a comprehensive implementation plan?
| Rank | Agent / Tool | Model | Overall | Critical | Important | Detail | Req. Total |
|---|---|---|---|---|---|---|---|
| π₯ | Kimi CLI | Qwen3.6-Plus | 94.4% | 100.0% | 91.8% | 100.0% | 99 |
| π₯ | Kimi CLI | Qwen3.6-35B | 93.9% | 96.7% | 92.5% | 100.0% | 99 |
| π₯ | Kimi CLI | minimax-m2.7 | 93.0% | 91.7% | 94.0% | 75.0% | 99 |
| 4 | Kimi CLI | Kimi K2.6 preview | 92.4% | 100.0% | 88.8% | 100.0% | 99 |
| 5 | Claude Code | Kimi K2.6 preview | 91.4% | 96.7% | 89.6% | 75.0% | 99 |
| 6 | Claude Code | Opus 4.6 (high) | 89.7% | 93.3% | 87.1% | 87.5% | 73 |
| 7 | Codex CLI | GPT 5.3 Codex (xhigh) | 88.5% | 91.9% | 88.1% | 75.0% | 87 |
| 8 | Kimi CLI | GLM-5.1 | 88.4% | 98.3% | 85.1% | 100.0% | 99 |
| 9 | Kimi CLI | Qwen3.5-35B-uncensored | 85.4% | 96.7% | 80.6% | 75.0% | 99 |
| 10 | Claude Code | Opus 4.5 | 82.9% | 85.0% | 82.9% | 77.8% | 70 |
| 11 | Claude Code | Sonnet 4.6 (high) | 75.6% | 82.1% | 71.9% | 75.0% | 78 |
| 12 | Claude Code | Sonnet 4.5 | 66.5% | 74.4% | 60.6% | 62.5% | 94 |
| 13 | Gemini CLI (no plan mode) | Gemini 3.2 Pro | 49.4% | 69.1% | 40.5% | 7.1% | 83 |
| 14 | Antigravity | Gemini 3.2 Pro | 47.5% | 63.3% | 42.5% | 20.0% | 80 |
| 15 | Cursor | Gemini 3.2 Pro | 42.1% | 54.2% | 39.1% | 16.7% | 38 |
| 16 | Gemini CLI | Gemini 3.2 Pro | 36.5% | 42.0% | 35.7% | 0.0% | 63 |
Note: Requirement totals vary slightly across runs because some evaluators extracted or consolidated requirements differently. The frozen canonical catalog is the authoritative denominator, but agents may produce plans of varying scope that influence how evaluators count. For a fair comparison, focus on the Overall score, which is always normalized against the run's own denominator.
Most coding benchmarks focus on implementation β can the agent write code that compiles and passes tests? This benchmark focuses on planning β can the agent read, understand, and synthesize a complex Product Requirements Document (PRD) into a coherent, complete implementation plan?
The benchmark uses a real, non-trivial product spec (a media-discovery application with AI chat, voice interaction, collections, search, and export features) spread across multiple interdependent documents. The spec is rich enough that a surface-level read will miss critical constraints and relationships.
Planning quality is scored against a frozen canonical requirement catalog (evaluator/requirements_catalog_v1.md) that contains approximately 80β100 requirements across 10 functional areas, each tagged by severity:
- Critical β Must be addressed for the product to function
- Important β Required for a complete, polished product
- Detail β Fine-grained behaviors, edge cases, and polish
Coverage is scored with a weighted formula:
score = (full_count Γ 1.0 + partial_count Γ 0.5) / total_count Γ 100
An honest evaluator audits the plan requirement-by-requirement. No partial credit for hand-waving.
.
βββ docs/prd/ # The product specification
β βββ product_prd.md # Core product requirements
β βββ infra_rider_prd.md # Infrastructure & build constraints
β βββ supporting_docs/ # Technical schemas, AI prompting, UX details
βββ evaluator/
β βββ requirements_catalog_v1.md # Frozen scoring denominator
βββ tools/
β βββ fetch_evaluator.py # Downloads/updates the evaluator bundle
βββ results/ # Benchmark outputs (see below)
βββ 1-START_HERE.md # Step 1 prompt: generate a plan
βββ 2-EVALUATE_PLAN.md # Step 2 prompt: evaluate the plan
βββ 3-PLAN_EVAL_REPORT.md # Optional fallback: re-render HTML report
βββ runClaude.sh # Automated runner for Claude Code
βββ runKimi.sh # Automated runner for Kimi CLI
βββ INSTRUCTIONS.md # Development guidelines & architecture patterns
βββ AGENTS.md / CLAUDE.md / GEMINI.md # Agent-specific auto-loaded instructions
All completed benchmark runs are stored in the results/ folder, organized by agent/tool and model. Each run produces:
PLAN.mdβ The implementation plan generated by the agentPLAN_EVAL.mdβ The human-readable coverage evaluationPLAN_EVAL_REPORT.htmlβ A stakeholder-ready visual report
- Qwen3.6-Plus tops the leaderboard. Kimi CLI with Qwen3.6-Plus achieved the highest score at 94.4%, with perfect critical coverage (100.0%) and 100% detail coverage. This demonstrates that smaller, efficient open-weight models can excel at planning tasks.
- Kimi CLI dominates the top tier. Seven of the top nine positions are held by Kimi CLI across different models (Qwen3.6-Plus, Qwen3.6-35B, minimax-m2.7, Kimi K2.6 preview, GLM-5.1, Qwen3.5-35B-uncensored), showing consistent strong planning performance.
- Claude Sonnet models show a significant gap. Sonnet 4.6 (high) at 75.6% and Sonnet 4.5 at 66.5% fall well behind Opus 4.6 (89.7%) and Kimi models, with particularly weak detail coverage.
- Gemini 3.2 Pro consistently underperforms, with even the best Gemini-based run (Antigravity) only reaching 47.5%. Detail coverage is particularly weak across all Gemini runs.
- Tool choice matters. The same model (Gemini 3.2 Pro) scores very differently depending on whether it runs through Cursor (42.1%), Gemini CLI (36.5%), or Antigravity (47.5%), suggesting that prompting strategy and context management are as important as raw model capability.
Two bash scripts are provided for fully automated benchmark execution against specific CLI tools:
Prerequisites: claude CLI installed and in your PATH.
./runClaude.shThis script will:
- Check prerequisites (
claude,python3). - Step 1: Launch Claude Code with the prompt
Read 1-START_HERE.md and follow its instructions.to generateresults/PLAN.md. - Verify
results/PLAN.mdwas produced. - Fetch the evaluator bundle if missing (
python3 tools/fetch_evaluator.py). - Step 2: Launch a fresh Claude Code session with the prompt
Read 2-EVALUATE_PLAN.md and follow its instructions.to generateresults/PLAN_EVAL.mdandresults/PLAN_EVAL_REPORT.html. - Print execution timing for both steps.
Prerequisites: kimi CLI installed and in your PATH.
./runKimi.shThis script follows the same two-step workflow as runClaude.sh, but uses the Kimi CLI with --print --yolo --work-dir flags for non-interactive execution.
Both scripts:
- Run each step in isolation (fresh context), matching the manual workflow.
- Time each step and report total elapsed time.
- Exit with an error if expected output files are missing.
If you prefer to run the benchmark manually, or if you are using an agent/tool not covered by the bash scripts, follow these steps exactly. Each step must be run in a fresh conversation/context to maximize available context window and isolate steps for re-runnability.
Open a fresh conversation with your coding agent and say:
Read
1-START_HERE.mdand follow its instructions.
The agent will:
- Read the full PRD in
docs/prd/(starting withproduct_prd.md, theninfra_rider_prd.md, then all supporting documents recursively). - Synthesize a comprehensive implementation plan.
- Write it to
results/PLAN.md.
Important: This is planning only. The agent must not start implementing code.
Output: results/PLAN.md
Open a new conversation (fresh context) and say:
Read
2-EVALUATE_PLAN.mdand follow its instructions.
The agent will:
- Read the frozen requirement catalog at
evaluator/requirements_catalog_v1.md. - Read the PRD files for semantic context.
- Read the plan from
results/PLAN.md. - Audit every requirement for coverage (
full,partial, ormissing). - Write the evaluation to
results/PLAN_EVAL.md. - Generate a stakeholder-ready HTML report at
results/PLAN_EVAL_REPORT.html.
If evaluator/requirements_catalog_v1.md is missing, run python3 tools/fetch_evaluator.py first.
Requires: results/PLAN.md from Step 1
Outputs: results/PLAN_EVAL.md, results/PLAN_EVAL_REPORT.html
If results/PLAN_EVAL.md already exists and you only need to regenerate the HTML report (e.g., after a styling tweak), open a fresh conversation and say:
Read
3-PLAN_EVAL_REPORT.mdand follow its instructions.
Requires: results/PLAN_EVAL.md
Output: results/PLAN_EVAL_REPORT.html
Each step consumes a significant portion of the agent's context window. Starting fresh ensures:
- Maximum tokens available for the task at hand.
- Isolation between steps β you can re-run evaluation without regenerating the plan.
- Cleaner reasoning β the evaluator should not be primed by the plan generation process.
When the benchmark PRD asks an agent to plan or reason about code architecture, the following patterns are expected (defined in INSTRUCTIONS.md):
- Fractal Architecture: Pages β Features β Sub-Features, each self-contained.
- Humble Components: TSX files contain markup only; logic lives in custom hooks.
- No Magic Numbers: All constants and styling tokens extracted to config/theme.
- Co-location: Feature-specific code lives inside the feature's directory.
These standards are part of what the agent must account for when planning.
To add a new benchmark run:
- Create a new folder under
results/with a descriptive name:{Tool}_@_{Model}[_{variant}]/. - Run the benchmark using either the automated scripts or the manual workflow.
- Ensure all four artifacts are present:
PLAN.mdPLAN_EVAL.mdPLAN_EVAL_REPORT.htmlrun_metadata.json(if using the control workflow)
- Update this README with the new scores in the leaderboard table.
This benchmark is provided for research and comparison purposes. The PRD documents and requirement catalog represent a realistic product specification used strictly for evaluating agent planning capabilities.