Edit surface benchmark suite for ChangeDown. Measures how different editing tool surfaces affect agent efficiency (tokens, rounds, duration) when performing document editing tasks.
# Build the harness
npm run build -w packages/benchmarks
# Run a single trial
npm run trial -w packages/benchmarks -- --surface C --task task5
# Run the full benchmark matrix
npm run benchmark -w packages/benchmarks
# Analyze token usage for a completed run
npx tsx packages/benchmarks/harness/analyze-tokens.ts results/<run-dir>
# Analyze all runs in batch
npx tsx packages/benchmarks/harness/analyze-tokens.ts --all results/
# Run tests
cd packages/benchmarks && npx vitest runpackages/benchmarks/
├── harness/ # TypeScript benchmark runner (OpenCode CLI-based)
│ ├── analyze-tokens.ts # Post-run token analyzer (tiktoken)
│ ├── run-full-benchmark.ts # Full benchmark matrix runner
│ ├── run-trial-cli.ts # Single trial runner
│ └── __tests__/ # Vitest test suite
│ └── analyze-tokens.test.ts
├── fixtures/ # Task documents and prompts
│ ├── task1-rename/ # Multi-file concept rename (4 docs, ~20 rename sites)
│ ├── task2-audit/ # Discovery-driven audit (1 doc, 6 planted issues)
│ ├── task3-restructure/ # Section restructure (1 doc, ~300 lines)
│ ├── task4-review/ # Review and amend cycle (1 doc, existing CriticMarkup)
│ ├── task5-copyedit/ # Single-file copyedit (1 doc, 22 planted errors)
│ ├── benchmark-adr/ # Early-era ADR fixture with golden file
│ └── prompts.json # All task prompts (instructed + outcome-only variants)
├── results/ # Benchmark run outputs
│ ├── canonical/ # Post-bugfix, reproducible runs (cite these)
│ ├── exploratory/ # Smoke, pre-fix, one-off experiments
│ └── outcome-only/ # No tool instructions in prompt
├── docs/ # Documentation index
│ ├── THREADS.md # Open threads and next steps
│ ├── plans/ # Symlinks to docs/plans/ (6 benchmark design docs)
│ ├── research/ # Symlinks to docs/research/ (15 benchmark research docs)
│ └── presentations/ # Symlinks to docs/research/ (slide decks, PDFs)
└── golden/ # Expected outputs for correctness scoring (FUTURE)
Post-run tool that reads events.jsonl from benchmark runs, tokenizes MCP tool payloads with tiktoken, and produces a token-audit.json with per-tool breakdowns and verification against API-reported totals.
# Analyze a single run
npx tsx packages/benchmarks/harness/analyze-tokens.ts results/G-task1_minimax-m2.5
# Analyze all runs in batch
npx tsx packages/benchmarks/harness/analyze-tokens.ts --all results/
# Via npm script (from packages/benchmarks/)
npm run analyze-tokens -- <run-dir>Output (token-audit.json):
verification— API-reported totals vs tokenized tool payloads, withaccountingGap(fraction of tokens not attributable to tool payloads — system prompt, schemas, and conversation history account for the gap, typically 90-96%)perTool— Per-tool breakdown: call count, total input/output tokens, averages per callperStep— Per-step breakdown with API token counts and individual tool callsmeta— Tokenizer info (tiktoken/cl100k_base) and timestamp
Tests:
cd packages/benchmarks && npx vitest run harness/__tests__/analyze-tokens.test.ts| Surface | Mechanism | Description |
|---|---|---|
| A | Raw file edit/write | Baseline — agent uses standard file tools |
| B | propose_change (old_text/new_text) |
One change per tool call, string matching |
| C | propose_batch (LINE:HASH + at/op DSL) |
Batch changes, stable addressing, MCP tools |
| D | sc CLI via Bash |
Same ops as C but via shell commands, no MCP schema overhead |
Canonical Task 5 (copyedit, 22 fixes):
| Surface | Tools | Rounds | Output Tokens | vs. A |
|---|---|---|---|---|
| A | 26 | 27 | 6,730 | baseline |
| B | 30 | 29 | 18,444 | 2.7x worse |
| C | 7 | 7 | 2,971 | 2.3x better |
Outcome-only (no tool instructions): C is 3x faster, 6.5x fewer tokens than A.
See docs/research/2026-02-15-edit-surface-benchmark-findings.md for full analysis.
| Directory | Classification | Notes |
|---|---|---|
canonical/task1-v2 |
Canonical | Post-bugfix, instructed, Sonnet 4.5 |
canonical/task5-v2 |
Canonical | Post-bugfix, instructed, Sonnet 4.5 |
outcome-only/task5 |
Canonical | No tool instructions, post-bugfix |
exploratory/canonical-task1-pre-v2 |
Exploratory | Superseded by v2 |
exploratory/canonical-task5-pre-bugfix |
Exploratory | Surface C broken (4 propose_batch bugs) |
exploratory/smoke-* |
Exploratory | Validation runs |
exploratory/early-* |
Exploratory | First isolated runs |
exploratory/v2-A-svelte |
Exploratory | Svelte prompt variant |
exploratory/v2-C-propose |
Exploratory | Led to desire-to-close finding |
exploratory/results-smoke |
Exploratory | Separate smoke run set |
- Deliberation Effectiveness Benchmark — Early benchmark concept
- OpenCode Harness Design — CLI harness architecture
- OpenCode Harness Implementation — Implementation plan
- Edit Surface Benchmark Design — 4-task benchmark specification
- Edit Surface Benchmark Implementation — Implementation plan
- Benchmark of Our Dreams — Pre-registered protocol with 4 claims
- Benchmark Findings — Canonical results (cite this)
- Tool Surface Audit — Verification of findings against raw data
- Raw Traces — Verbatim events.jsonl excerpts
- Raw Traces Appendix — Extended traces
- Show Your Work A vs C — Human-readable tool-call walkthrough
- Edit Surface Comparison — Token-level comparison
- Character-Level Edit Gap — 3.4x overhead analysis
- Desire to Close — Agent self-review behavior finding
- Skeptical Hand Check — Independent verification
- Emergent Audit Behavior — Behavioral observations
- Batch Efficiency Proposal — ADR restructure benchmark
- Initial Workflow Results — Early Qwen results
- Compact Mode Feedback — First use observations
- Agent Stress Test — Batch stress testing
- Clean Surface Comparison — Cleaned version
- SC CLI UX Research — Sonnet first-use CLI testing (Surface D motivation)
Tasks (rows) vs Surfaces (columns). Checkmarks = prompts defined in fixtures/prompts.json.
| Task | A | B | C | D | Notes |
|---|---|---|---|---|---|
| task1 (rename) | x | x | x | x | Multi-file, ~20 rename sites |
| task2 (audit) | x | x | x | x | Discovery-driven, 6 planted issues |
| task3 (restructure) | x | x | x | x | Structural moves, cross-references |
| task4 (review) | - | x | x | x | Accept/reject/amend/respond cycle |
| task5 (copyedit) | x | x | x | x | 22 character-level fixes |
| task5_outcome | x | x | x | x | No tool instructions in prompt |
| task5_v2 | x | - | x | x | Minimal instruction variant |
Run status: Only tasks 1 and 5 have canonical results (Surfaces A/B/C). Surface D and tasks 2-4 have zero runs.
See THREADS.md for 13 tracked open items.