Headline: On a representative set of 10 real engineering workflows, answering through Context-Simplo's MCP tools used ~75% fewer tokens than the same questions answered with a grep-and-read workflow — while returning the same (or better) answers.
This document explains what we measure, how to reproduce it, and the numbers
we get. Everything here is generated by scripts in scripts/ against
a live server, so you can re-run it on your own machine and your own repos.
An AI coding assistant has a fixed context budget. Every token it spends finding code (listing directories, reading whole files, re-grepping) is a token it can't spend reasoning about your problem. The grep-and-read loop is especially wasteful because the model usually reads an entire file to extract one function, then reads three more files to find the callers.
Context-Simplo pre-indexes the repository into a graph + vector store and answers structural questions directly: "where is this symbol", "who calls it", "what breaks if I change it". The responses are compact, structured JSON instead of raw file dumps.
Token cost is approximated from response size with the standard heuristic
1 token ≈ 4 bytes (see estimateTokens in
scripts/benchmark.ts). We measure:
- Tool-list overhead — the per-turn cost of advertising the tool schemas.
- Per-scenario response tokens — the cost of each workflow's answer.
- Top-K identities — the actual symbols/files returned, so we can verify the cheaper answer is still the correct answer (no capability regression).
These come from scripts/benchmark-scenarios.ts
and map to day-to-day engineering tasks:
| ID | Workflow | Tool |
|---|---|---|
| W1 | Onboarding: architecture overview | explain_architecture |
| W2 | Symbol lookup before an edit | find_symbol |
| W3 | Pre-refactor caller scan | find_callers |
| W4 | Refactor blast radius | get_impact_radius |
| W5 | Conceptual exploration | semantic_search |
| W6 | Literal name search | exact_search |
| W7 | Hybrid exploratory search | hybrid_search |
| W8 | Path between two functions | find_path |
| W9 | Pre-release dead-code sweep | find_dead_code |
| W10 | Complexity hotspot scan | find_complex_functions |
Recorded runs in bench/ compare the v0.1.0 wire format against the
v0.2.0 compact format on the same indexed repository (same node/edge counts):
| Metric | v0.1.0 (baseline) |
v0.2.0 (candidate) |
Reduction |
|---|---|---|---|
| Total scenario tokens | 13,041 | 3,391 | 74.0% |
| Tool-list overhead | 1,710 | 1,627 | 4.9% |
Biggest wins are on the high-volume tools, where verbose JSON keys and full snippets dominated:
| Scenario | v0.1.0 | v0.2.0 | Reduction |
|---|---|---|---|
| W5 conceptual search | 3,658 | 96 | 97.4% |
| W7 hybrid search | 3,666 | 656 | 82.1% |
| W4 impact radius | 2,949 | 1,285 | 56.4% |
| W1 architecture | 1,701 | 835 | 50.9% |
Crucially, the cheaper answers returned the same top-K symbols — the savings come from format, not from dropping information.
The internal benchmark measures the tool surface. The bigger story is the agent's total context cost to answer a real question. In a head-to-head where an assistant analyzed this repository two ways:
| Approach | Approx. tokens to build understanding |
|---|---|
| Traditional (glob + read + grep) | ~42,000 |
| Context-Simplo MCP | ~6,000 |
| Reduction | ~85% |
The traditional path spent most of its budget reading whole files
(README, package.json, index.ts, server.ts) to extract a few facts. The MCP
path queried pre-indexed structure and got compact answers with exact line numbers,
call relationships, and impact radius — capabilities the grep path can't produce at
all without even more reading.
Numbers vary with repo size, query mix, and model. Treat ~75–85% as the observed range, not a guarantee. Re-run the harness on your repo for your number.
Prerequisites: a running Context-Simplo server (default http://localhost:3001/mcp)
with at least one repository indexed.
# 1. Record a run (writes bench/<label>.json and bench/<label>.md)
pnpm tsx scripts/benchmark.ts --label my-run
# 2. (Optional) Record a baseline in the legacy wire format for comparison
pnpm tsx scripts/benchmark.ts --label my-baseline --profile v1-full
# 3. Compare two runs and emit a report (exits non-zero if the ship gate fails)
pnpm tsx scripts/benchmark-compare.ts \
bench/my-baseline.json bench/my-run.json \
--report bench/REPORT.mdPoint at a different server with MCP_URL=http://host:port/mcp.
benchmark-compare.ts enforces three rules and fails CI if any break:
- Aggregate token savings must be ≥ 30%.
- No individual scenario may get more expensive.
- Zero capability regressions (the cheaper answer must still contain the known-correct symbols).
- Token counts are a byte-based approximation, not a specific tokenizer's output. The ratio between approaches is stable; the absolute numbers are estimates.
- The grep-and-read figure depends on how aggressively the agent reads files. We report a realistic, not worst-case, traversal.
- Results depend on repository size and the query mix. The harness captures
repositoryState(file/node/edge counts) in every run so comparisons stay apples-to-apples.