Personal LLM Benchmark

Measuring how well coding agents plan against a real-world product specification.

This repository contains a standardized benchmark for evaluating the planning capabilities of LLM-based coding agents. Instead of measuring code generation speed or test-pass rates, this benchmark answers a different question: When given a complex, multi-document product spec, how well can an agent produce a comprehensive implementation plan?

Leaderboard

Leaderboard (Ordered by Overall Score)

Rank	Agent / Tool	Model	Overall	Critical	Important	Detail	Req. Total
🥇	Kimi CLI	Qwen3.6-Plus	94.4%	100.0%	91.8%	100.0%	99
🥈	Kimi CLI	Qwen3.6-35B	93.9%	96.7%	92.5%	100.0%	99
🥉	Kimi CLI	minimax-m2.7	93.0%	91.7%	94.0%	75.0%	99
4	Kimi CLI	Kimi K2.6 preview	92.4%	100.0%	88.8%	100.0%	99
5	Claude Code	Kimi K2.6 preview	91.4%	96.7%	89.6%	75.0%	99
6	Claude Code	Opus 4.6 (high)	89.7%	93.3%	87.1%	87.5%	73
7	Codex CLI	GPT 5.3 Codex (xhigh)	88.5%	91.9%	88.1%	75.0%	87
8	Kimi CLI	GLM-5.1	88.4%	98.3%	85.1%	100.0%	99
9	Kimi CLI	Qwen3.5-35B-uncensored	85.4%	96.7%	80.6%	75.0%	99
10	Claude Code	Opus 4.5	82.9%	85.0%	82.9%	77.8%	70
11	Claude Code	Sonnet 4.6 (high)	75.6%	82.1%	71.9%	75.0%	78
12	Claude Code	Sonnet 4.5	66.5%	74.4%	60.6%	62.5%	94
13	Gemini CLI (no plan mode)	Gemini 3.2 Pro	49.4%	69.1%	40.5%	7.1%	83
14	Antigravity	Gemini 3.2 Pro	47.5%	63.3%	42.5%	20.0%	80
15	Cursor	Gemini 3.2 Pro	42.1%	54.2%	39.1%	16.7%	38
16	Gemini CLI	Gemini 3.2 Pro	36.5%	42.0%	35.7%	0.0%	63

Note: Requirement totals vary slightly across runs because some evaluators extracted or consolidated requirements differently. The frozen canonical catalog is the authoritative denominator, but agents may produce plans of varying scope that influence how evaluators count. For a fair comparison, focus on the Overall score, which is always normalized against the run's own denominator.

The Concept

Most coding benchmarks focus on implementation — can the agent write code that compiles and passes tests? This benchmark focuses on planning — can the agent read, understand, and synthesize a complex Product Requirements Document (PRD) into a coherent, complete implementation plan?

The benchmark uses a real, non-trivial product spec (a media-discovery application with AI chat, voice interaction, collections, search, and export features) spread across multiple interdependent documents. The spec is rich enough that a surface-level read will miss critical constraints and relationships.

Scoring Philosophy

Planning quality is scored against a frozen canonical requirement catalog (evaluator/requirements_catalog_v1.md) that contains approximately 80–100 requirements across 10 functional areas, each tagged by severity:

Critical — Must be addressed for the product to function
Important — Required for a complete, polished product
Detail — Fine-grained behaviors, edge cases, and polish

Coverage is scored with a weighted formula:

score = (full_count × 1.0 + partial_count × 0.5) / total_count × 100

An honest evaluator audits the plan requirement-by-requirement. No partial credit for hand-waving.

Repository Structure

.
├── docs/prd/                          # The product specification
│   ├── product_prd.md                 # Core product requirements
│   ├── infra_rider_prd.md             # Infrastructure & build constraints
│   └── supporting_docs/               # Technical schemas, AI prompting, UX details
├── evaluator/
│   └── requirements_catalog_v1.md     # Frozen scoring denominator
├── tools/
│   └── fetch_evaluator.py             # Downloads/updates the evaluator bundle
├── results/                           # Benchmark outputs (see below)
├── 1-START_HERE.md                    # Step 1 prompt: generate a plan
├── 2-EVALUATE_PLAN.md                 # Step 2 prompt: evaluate the plan
├── 3-PLAN_EVAL_REPORT.md              # Optional fallback: re-render HTML report
├── runClaude.sh                       # Automated runner for Claude Code
├── runKimi.sh                         # Automated runner for Kimi CLI
├── INSTRUCTIONS.md                    # Development guidelines & architecture patterns
└── AGENTS.md / CLAUDE.md / GEMINI.md  # Agent-specific auto-loaded instructions

Benchmark Results

All completed benchmark runs are stored in the results/ folder, organized by agent/tool and model. Each run produces:

PLAN.md — The implementation plan generated by the agent
PLAN_EVAL.md — The human-readable coverage evaluation
PLAN_EVAL_REPORT.html — A stakeholder-ready visual report

Key Observations

Qwen3.6-Plus tops the leaderboard. Kimi CLI with Qwen3.6-Plus achieved the highest score at 94.4%, with perfect critical coverage (100.0%) and 100% detail coverage. This demonstrates that smaller, efficient open-weight models can excel at planning tasks.
Kimi CLI dominates the top tier. Seven of the top nine positions are held by Kimi CLI across different models (Qwen3.6-Plus, Qwen3.6-35B, minimax-m2.7, Kimi K2.6 preview, GLM-5.1, Qwen3.5-35B-uncensored), showing consistent strong planning performance.
Claude Sonnet models show a significant gap. Sonnet 4.6 (high) at 75.6% and Sonnet 4.5 at 66.5% fall well behind Opus 4.6 (89.7%) and Kimi models, with particularly weak detail coverage.
Gemini 3.2 Pro consistently underperforms, with even the best Gemini-based run (Antigravity) only reaching 47.5%. Detail coverage is particularly weak across all Gemini runs.
Tool choice matters. The same model (Gemini 3.2 Pro) scores very differently depending on whether it runs through Cursor (42.1%), Gemini CLI (36.5%), or Antigravity (47.5%), suggesting that prompting strategy and context management are as important as raw model capability.

Running the Benchmark

Automated Runners

Two bash scripts are provided for fully automated benchmark execution against specific CLI tools:

`runClaude.sh` — Claude Code Runner

Prerequisites: claude CLI installed and in your PATH.

./runClaude.sh

This script will:

Check prerequisites (claude, python3).
Step 1: Launch Claude Code with the prompt Read 1-START_HERE.md and follow its instructions. to generate results/PLAN.md.
Verify results/PLAN.md was produced.
Fetch the evaluator bundle if missing (python3 tools/fetch_evaluator.py).
Step 2: Launch a fresh Claude Code session with the prompt Read 2-EVALUATE_PLAN.md and follow its instructions. to generate results/PLAN_EVAL.md and results/PLAN_EVAL_REPORT.html.
Print execution timing for both steps.

`runKimi.sh` — Kimi CLI Runner

Prerequisites: kimi CLI installed and in your PATH.

./runKimi.sh

This script follows the same two-step workflow as runClaude.sh, but uses the Kimi CLI with --print --yolo --work-dir flags for non-interactive execution.

Both scripts:

Run each step in isolation (fresh context), matching the manual workflow.
Time each step and report total elapsed time.
Exit with an error if expected output files are missing.

Manual Workflow (Step by Step)

If you prefer to run the benchmark manually, or if you are using an agent/tool not covered by the bash scripts, follow these steps exactly. Each step must be run in a fresh conversation/context to maximize available context window and isolate steps for re-runnability.

Step 1: Generate the Plan

Open a fresh conversation with your coding agent and say:

Read 1-START_HERE.md and follow its instructions.

The agent will:

Read the full PRD in docs/prd/ (starting with product_prd.md, then infra_rider_prd.md, then all supporting documents recursively).
Synthesize a comprehensive implementation plan.
Write it to results/PLAN.md.

Important: This is planning only. The agent must not start implementing code.

Output: results/PLAN.md

Step 2: Evaluate the Plan

Open a new conversation (fresh context) and say:

Read 2-EVALUATE_PLAN.md and follow its instructions.

The agent will:

Read the frozen requirement catalog at evaluator/requirements_catalog_v1.md.
Read the PRD files for semantic context.
Read the plan from results/PLAN.md.
Audit every requirement for coverage (full, partial, or missing).
Write the evaluation to results/PLAN_EVAL.md.
Generate a stakeholder-ready HTML report at results/PLAN_EVAL_REPORT.html.

If evaluator/requirements_catalog_v1.md is missing, run python3 tools/fetch_evaluator.py first.

Requires: results/PLAN.md from Step 1
Outputs: results/PLAN_EVAL.md, results/PLAN_EVAL_REPORT.html

Optional Step 3: Re-render the Report

If results/PLAN_EVAL.md already exists and you only need to regenerate the HTML report (e.g., after a styling tweak), open a fresh conversation and say:

Read 3-PLAN_EVAL_REPORT.md and follow its instructions.

Requires: results/PLAN_EVAL.md
Output: results/PLAN_EVAL_REPORT.html

Why Fresh Conversations?

Each step consumes a significant portion of the agent's context window. Starting fresh ensures:

Maximum tokens available for the task at hand.
Isolation between steps — you can re-run evaluation without regenerating the plan.
Cleaner reasoning — the evaluator should not be primed by the plan generation process.

Development Guidelines

When the benchmark PRD asks an agent to plan or reason about code architecture, the following patterns are expected (defined in INSTRUCTIONS.md):

Fractal Architecture: Pages → Features → Sub-Features, each self-contained.
Humble Components: TSX files contain markup only; logic lives in custom hooks.
No Magic Numbers: All constants and styling tokens extracted to config/theme.
Co-location: Feature-specific code lives inside the feature's directory.

These standards are part of what the agent must account for when planning.

Contributing a New Run

To add a new benchmark run:

Create a new folder under results/ with a descriptive name: {Tool}_@_{Model}[_{variant}]/.
Run the benchmark using either the automated scripts or the manual workflow.
Ensure all four artifacts are present:
- PLAN.md
- PLAN_EVAL.md
- PLAN_EVAL_REPORT.html
- run_metadata.json (if using the control workflow)
Update this README with the new scores in the leaderboard table.

License

This benchmark is provided for research and comparison purposes. The PRD documents and requirement catalog represent a realistic product specification used strictly for evaluating agent planning capabilities.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personal LLM Benchmark

Leaderboard

Leaderboard (Ordered by Overall Score)

The Concept

Scoring Philosophy

Repository Structure

Benchmark Results

Key Observations

Running the Benchmark

Automated Runners

`runClaude.sh` — Claude Code Runner

`runKimi.sh` — Kimi CLI Runner

Manual Workflow (Step by Step)

Step 1: Generate the Plan

Step 2: Evaluate the Plan

Optional Step 3: Re-render the Report

Why Fresh Conversations?

Development Guidelines

Contributing a New Run

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
docs		docs
results		results
tools		tools
.gitignore		.gitignore
1-START_HERE.md		1-START_HERE.md
2-EVALUATE_PLAN.md		2-EVALUATE_PLAN.md
3-PLAN_EVAL_REPORT.md		3-PLAN_EVAL_REPORT.md
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
INSTRUCTIONS.md		INSTRUCTIONS.md
README.md		README.md
runClaude.sh		runClaude.sh
runKimi.sh		runKimi.sh

Folders and files

Latest commit

History

Repository files navigation

Personal LLM Benchmark

Leaderboard

Leaderboard (Ordered by Overall Score)

The Concept

Scoring Philosophy

Repository Structure

Benchmark Results

Key Observations

Running the Benchmark

Automated Runners

runClaude.sh — Claude Code Runner

runKimi.sh — Kimi CLI Runner

Manual Workflow (Step by Step)

Step 1: Generate the Plan

Step 2: Evaluate the Plan

Optional Step 3: Re-render the Report

Why Fresh Conversations?

Development Guidelines

Contributing a New Run

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`runClaude.sh` — Claude Code Runner

`runKimi.sh` — Kimi CLI Runner

Packages