Skip to content

mark-sch/Personal-LLM-benchmark

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

22 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Personal LLM Benchmark

Measuring how well coding agents plan against a real-world product specification.

This repository contains a standardized benchmark for evaluating the planning capabilities of LLM-based coding agents. Instead of measuring code generation speed or test-pass rates, this benchmark answers a different question: When given a complex, multi-document product spec, how well can an agent produce a comprehensive implementation plan?


Leaderboard

Leaderboard (Ordered by Overall Score)

Rank Agent / Tool Model Overall Critical Important Detail Req. Total
πŸ₯‡ Kimi CLI Qwen3.6-Plus 94.4% 100.0% 91.8% 100.0% 99
πŸ₯ˆ Kimi CLI Qwen3.6-35B 93.9% 96.7% 92.5% 100.0% 99
πŸ₯‰ Kimi CLI minimax-m2.7 93.0% 91.7% 94.0% 75.0% 99
4 Kimi CLI Kimi K2.6 preview 92.4% 100.0% 88.8% 100.0% 99
5 Claude Code Kimi K2.6 preview 91.4% 96.7% 89.6% 75.0% 99
6 Claude Code Opus 4.6 (high) 89.7% 93.3% 87.1% 87.5% 73
7 Codex CLI GPT 5.3 Codex (xhigh) 88.5% 91.9% 88.1% 75.0% 87
8 Kimi CLI GLM-5.1 88.4% 98.3% 85.1% 100.0% 99
9 Kimi CLI Qwen3.5-35B-uncensored 85.4% 96.7% 80.6% 75.0% 99
10 Claude Code Opus 4.5 82.9% 85.0% 82.9% 77.8% 70
11 Claude Code Sonnet 4.6 (high) 75.6% 82.1% 71.9% 75.0% 78
12 Claude Code Sonnet 4.5 66.5% 74.4% 60.6% 62.5% 94
13 Gemini CLI (no plan mode) Gemini 3.2 Pro 49.4% 69.1% 40.5% 7.1% 83
14 Antigravity Gemini 3.2 Pro 47.5% 63.3% 42.5% 20.0% 80
15 Cursor Gemini 3.2 Pro 42.1% 54.2% 39.1% 16.7% 38
16 Gemini CLI Gemini 3.2 Pro 36.5% 42.0% 35.7% 0.0% 63

Note: Requirement totals vary slightly across runs because some evaluators extracted or consolidated requirements differently. The frozen canonical catalog is the authoritative denominator, but agents may produce plans of varying scope that influence how evaluators count. For a fair comparison, focus on the Overall score, which is always normalized against the run's own denominator.


The Concept

Most coding benchmarks focus on implementation β€” can the agent write code that compiles and passes tests? This benchmark focuses on planning β€” can the agent read, understand, and synthesize a complex Product Requirements Document (PRD) into a coherent, complete implementation plan?

The benchmark uses a real, non-trivial product spec (a media-discovery application with AI chat, voice interaction, collections, search, and export features) spread across multiple interdependent documents. The spec is rich enough that a surface-level read will miss critical constraints and relationships.

Scoring Philosophy

Planning quality is scored against a frozen canonical requirement catalog (evaluator/requirements_catalog_v1.md) that contains approximately 80–100 requirements across 10 functional areas, each tagged by severity:

  • Critical β€” Must be addressed for the product to function
  • Important β€” Required for a complete, polished product
  • Detail β€” Fine-grained behaviors, edge cases, and polish

Coverage is scored with a weighted formula:

score = (full_count Γ— 1.0 + partial_count Γ— 0.5) / total_count Γ— 100

An honest evaluator audits the plan requirement-by-requirement. No partial credit for hand-waving.


Repository Structure

.
β”œβ”€β”€ docs/prd/                          # The product specification
β”‚   β”œβ”€β”€ product_prd.md                 # Core product requirements
β”‚   β”œβ”€β”€ infra_rider_prd.md             # Infrastructure & build constraints
β”‚   └── supporting_docs/               # Technical schemas, AI prompting, UX details
β”œβ”€β”€ evaluator/
β”‚   └── requirements_catalog_v1.md     # Frozen scoring denominator
β”œβ”€β”€ tools/
β”‚   └── fetch_evaluator.py             # Downloads/updates the evaluator bundle
β”œβ”€β”€ results/                           # Benchmark outputs (see below)
β”œβ”€β”€ 1-START_HERE.md                    # Step 1 prompt: generate a plan
β”œβ”€β”€ 2-EVALUATE_PLAN.md                 # Step 2 prompt: evaluate the plan
β”œβ”€β”€ 3-PLAN_EVAL_REPORT.md              # Optional fallback: re-render HTML report
β”œβ”€β”€ runClaude.sh                       # Automated runner for Claude Code
β”œβ”€β”€ runKimi.sh                         # Automated runner for Kimi CLI
β”œβ”€β”€ INSTRUCTIONS.md                    # Development guidelines & architecture patterns
└── AGENTS.md / CLAUDE.md / GEMINI.md  # Agent-specific auto-loaded instructions

Benchmark Results

All completed benchmark runs are stored in the results/ folder, organized by agent/tool and model. Each run produces:

  • PLAN.md β€” The implementation plan generated by the agent
  • PLAN_EVAL.md β€” The human-readable coverage evaluation
  • PLAN_EVAL_REPORT.html β€” A stakeholder-ready visual report

Key Observations

  • Qwen3.6-Plus tops the leaderboard. Kimi CLI with Qwen3.6-Plus achieved the highest score at 94.4%, with perfect critical coverage (100.0%) and 100% detail coverage. This demonstrates that smaller, efficient open-weight models can excel at planning tasks.
  • Kimi CLI dominates the top tier. Seven of the top nine positions are held by Kimi CLI across different models (Qwen3.6-Plus, Qwen3.6-35B, minimax-m2.7, Kimi K2.6 preview, GLM-5.1, Qwen3.5-35B-uncensored), showing consistent strong planning performance.
  • Claude Sonnet models show a significant gap. Sonnet 4.6 (high) at 75.6% and Sonnet 4.5 at 66.5% fall well behind Opus 4.6 (89.7%) and Kimi models, with particularly weak detail coverage.
  • Gemini 3.2 Pro consistently underperforms, with even the best Gemini-based run (Antigravity) only reaching 47.5%. Detail coverage is particularly weak across all Gemini runs.
  • Tool choice matters. The same model (Gemini 3.2 Pro) scores very differently depending on whether it runs through Cursor (42.1%), Gemini CLI (36.5%), or Antigravity (47.5%), suggesting that prompting strategy and context management are as important as raw model capability.

Running the Benchmark

Automated Runners

Two bash scripts are provided for fully automated benchmark execution against specific CLI tools:

runClaude.sh β€” Claude Code Runner

Prerequisites: claude CLI installed and in your PATH.

./runClaude.sh

This script will:

  1. Check prerequisites (claude, python3).
  2. Step 1: Launch Claude Code with the prompt Read 1-START_HERE.md and follow its instructions. to generate results/PLAN.md.
  3. Verify results/PLAN.md was produced.
  4. Fetch the evaluator bundle if missing (python3 tools/fetch_evaluator.py).
  5. Step 2: Launch a fresh Claude Code session with the prompt Read 2-EVALUATE_PLAN.md and follow its instructions. to generate results/PLAN_EVAL.md and results/PLAN_EVAL_REPORT.html.
  6. Print execution timing for both steps.

runKimi.sh β€” Kimi CLI Runner

Prerequisites: kimi CLI installed and in your PATH.

./runKimi.sh

This script follows the same two-step workflow as runClaude.sh, but uses the Kimi CLI with --print --yolo --work-dir flags for non-interactive execution.

Both scripts:

  • Run each step in isolation (fresh context), matching the manual workflow.
  • Time each step and report total elapsed time.
  • Exit with an error if expected output files are missing.

Manual Workflow (Step by Step)

If you prefer to run the benchmark manually, or if you are using an agent/tool not covered by the bash scripts, follow these steps exactly. Each step must be run in a fresh conversation/context to maximize available context window and isolate steps for re-runnability.

Step 1: Generate the Plan

Open a fresh conversation with your coding agent and say:

Read 1-START_HERE.md and follow its instructions.

The agent will:

  1. Read the full PRD in docs/prd/ (starting with product_prd.md, then infra_rider_prd.md, then all supporting documents recursively).
  2. Synthesize a comprehensive implementation plan.
  3. Write it to results/PLAN.md.

Important: This is planning only. The agent must not start implementing code.

Output: results/PLAN.md

Step 2: Evaluate the Plan

Open a new conversation (fresh context) and say:

Read 2-EVALUATE_PLAN.md and follow its instructions.

The agent will:

  1. Read the frozen requirement catalog at evaluator/requirements_catalog_v1.md.
  2. Read the PRD files for semantic context.
  3. Read the plan from results/PLAN.md.
  4. Audit every requirement for coverage (full, partial, or missing).
  5. Write the evaluation to results/PLAN_EVAL.md.
  6. Generate a stakeholder-ready HTML report at results/PLAN_EVAL_REPORT.html.

If evaluator/requirements_catalog_v1.md is missing, run python3 tools/fetch_evaluator.py first.

Requires: results/PLAN.md from Step 1
Outputs: results/PLAN_EVAL.md, results/PLAN_EVAL_REPORT.html

Optional Step 3: Re-render the Report

If results/PLAN_EVAL.md already exists and you only need to regenerate the HTML report (e.g., after a styling tweak), open a fresh conversation and say:

Read 3-PLAN_EVAL_REPORT.md and follow its instructions.

Requires: results/PLAN_EVAL.md
Output: results/PLAN_EVAL_REPORT.html


Why Fresh Conversations?

Each step consumes a significant portion of the agent's context window. Starting fresh ensures:

  • Maximum tokens available for the task at hand.
  • Isolation between steps β€” you can re-run evaluation without regenerating the plan.
  • Cleaner reasoning β€” the evaluator should not be primed by the plan generation process.

Development Guidelines

When the benchmark PRD asks an agent to plan or reason about code architecture, the following patterns are expected (defined in INSTRUCTIONS.md):

  • Fractal Architecture: Pages β†’ Features β†’ Sub-Features, each self-contained.
  • Humble Components: TSX files contain markup only; logic lives in custom hooks.
  • No Magic Numbers: All constants and styling tokens extracted to config/theme.
  • Co-location: Feature-specific code lives inside the feature's directory.

These standards are part of what the agent must account for when planning.


Contributing a New Run

To add a new benchmark run:

  1. Create a new folder under results/ with a descriptive name: {Tool}_@_{Model}[_{variant}]/.
  2. Run the benchmark using either the automated scripts or the manual workflow.
  3. Ensure all four artifacts are present:
    • PLAN.md
    • PLAN_EVAL.md
    • PLAN_EVAL_REPORT.html
    • run_metadata.json (if using the control workflow)
  4. Update this README with the new scores in the leaderboard table.

License

This benchmark is provided for research and comparison purposes. The PRD documents and requirement catalog represent a realistic product specification used strictly for evaluating agent planning capabilities.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • HTML 97.3%
  • Shell 2.1%
  • Python 0.6%