|
| 1 | +# mcpproxy benchmark harness |
| 2 | + |
| 3 | +The reproducible numbers behind mcpproxy's marketing claims — **token reduction**, |
| 4 | +**discovery accuracy**, and **latency** — comparing three ways an agent can be |
| 5 | +wired to upstream MCP tools. |
| 6 | + |
| 7 | +> Roadmap item #19 (MCP-42). In-repo (`bench/`), reproducible, intended to be |
| 8 | +> refreshed on release. Reports are **never committed** (Spec 065 CN-003); only |
| 9 | +> code, fixtures, and this methodology are versioned. |
| 10 | +
|
| 11 | +## The three modes |
| 12 | + |
| 13 | +| Mode | What the agent sees in context | mcpproxy server | |
| 14 | +|------|--------------------------------|-----------------| |
| 15 | +| `baseline` | Every upstream tool definition, loaded directly | (no proxy discovery) | |
| 16 | +| `retrieve_tools` | `retrieve_tools` + `call_tool_read/write/destructive` + `read_cache` + `code_execution` + management tools; tools found on demand via BM25 | `callToolServer` | |
| 17 | +| `code_execution` | `code_execution` + `retrieve_tools` + management tools; many tools orchestrated from sandboxed JS in one round-trip | `codeExecServer` | |
| 18 | + |
| 19 | +Both proxy modes also append the shared **management tool set** — |
| 20 | +`upstream_servers`, `quarantine_security`, `search_servers`, `list_registries` |
| 21 | +— that the live routing-mode servers expose. These count against the proxy |
| 22 | +context cost: omitting them undercounts that cost and inflates the savings. |
| 23 | + |
| 24 | +The per-mode catalog is **derived directly from the live tool builders** |
| 25 | +(`buildCallToolModeTools` / `buildCodeExecModeTools` in |
| 26 | +`internal/server/mcp_routing.go`, via `server.ProxyModeToolDefs`), so it can |
| 27 | +never drift from production. |
| 28 | + |
| 29 | +## What ships today (deterministic, offline) |
| 30 | + |
| 31 | +The **token-reduction** measurement is fully deterministic and runs with no |
| 32 | +network or LLM: |
| 33 | + |
| 34 | +```bash |
| 35 | +go run ./bench/cmd/bench # scores the committed Spec 065 corpus |
| 36 | +go test ./bench/ # unit + invariant tests |
| 37 | +``` |
| 38 | + |
| 39 | +It counts the context-token cost of each mode over a **frozen tool corpus** and |
| 40 | +reports the savings of each proxy mode versus the baseline. Output: a |
| 41 | +`report.json` and a self-contained `dashboard.html` in `bench/results/` |
| 42 | +(gitignored). |
| 43 | + |
| 44 | +#### Current deterministic result |
| 45 | + |
| 46 | +Over the 45-tool Spec 065 reference corpus, counting **tool name + description |
| 47 | +only** (schemas excluded uniformly — see limitations), `cl100k_base`: |
| 48 | + |
| 49 | +| Mode | Context tools | Tokens | Savings vs. baseline | |
| 50 | +|------|---------------|--------|----------------------| |
| 51 | +| `baseline` | 45 | 1730 | — | |
| 52 | +| `retrieve_tools` | 10 | 1431 | **~17%** | |
| 53 | +| `code_execution` | 6 | 986 | **~43%** | |
| 54 | + |
| 55 | +These are deliberately modest: the proxy context here is the *full* per-mode |
| 56 | +tool set (discovery + call-tool variants + management tools), and the corpus is |
| 57 | +small. Savings grow toward the asymptote as the upstream tool count rises (the |
| 58 | +baseline grows linearly while the proxy context stays fixed) — always quote the |
| 59 | +corpus size alongside a percentage. Reproduce with `go run ./bench/cmd/bench`. |
| 60 | + |
| 61 | +### Scoring rubric — token reduction |
| 62 | + |
| 63 | +- **Tool universe**: the frozen Spec 065 snapshot |
| 64 | + `specs/065-evaluation-foundation/datasets/corpus_v1.tools.json` — 45 tools |
| 65 | + across 7 no-auth reference servers. Frozen + versioned so scoring never runs |
| 66 | + against a drifting corpus (CN-002). |
| 67 | +- **Tokenizer**: `tiktoken cl100k_base`, a widely-used reproducible BPE |
| 68 | + (already a repo dependency). It is a **model-agnostic estimator**; exact |
| 69 | + counts for a specific pinned model (e.g. Claude) will differ, but the |
| 70 | + *relative* savings between modes are stable. |
| 71 | +- **Proxy-mode tools**: the *complete* per-mode catalog, derived from the live |
| 72 | + server builders — discovery, the call-tool variants, `code_execution`, **and |
| 73 | + the shared management tool set** (`upstream_servers`, `quarantine_security`, |
| 74 | + `search_servers`, `list_registries`). Nothing the agent actually sees is |
| 75 | + dropped from the proxy cost. |
| 76 | +- **Cost of a tool**: `name + "\n" + description`. JSON input schemas are |
| 77 | + excluded **uniformly** across all modes (the committed corpus snapshot does |
| 78 | + not carry schemas). |
| 79 | +- **Savings** for a mode `m`: `1 - tokens(m) / tokens(baseline)`. |
| 80 | + |
| 81 | +### Known limitations (read before quoting a number) |
| 82 | + |
| 83 | +- **Schemas excluded — direction is not clean.** Input schemas are dropped from |
| 84 | + *both* sides. The 45 baseline tools lose their schemas, but so do the proxy |
| 85 | + modes' management tools (e.g. `upstream_servers` carries a large multi-field |
| 86 | + schema). So the name+description-only number is **not** unambiguously |
| 87 | + conservative — it is its own well-defined metric. The live run below adds full |
| 88 | + schemas from `GET /api/v1/tools` for the exact headline number; quote that for |
| 89 | + marketing, not this offline estimate. |
| 90 | +- **Savings scale with tool count.** The 45-tool reference corpus is small; real |
| 91 | + deployments expose hundreds–thousands of tools, where the baseline grows |
| 92 | + linearly and the proxy context stays fixed, so savings approach the asymptote. |
| 93 | + Quote the corpus size alongside any percentage. |
| 94 | +- **`cl100k_base` ≠ the pinned model's tokenizer.** Pinning the exact tokenizer |
| 95 | + for the headline model is tracked as a follow-up (see "Roadmap"). |
| 96 | + |
| 97 | +## What is scoped but not yet built (follow-ups) |
| 98 | + |
| 99 | +These require decisions and/or other roles, so they are tracked as child issues |
| 100 | +rather than landed here: |
| 101 | + |
| 102 | +- **Live run with full schemas + accuracy + latency** — boot mcpproxy over the |
| 103 | + Spec 065 `snapshot-servers.config.json` (see `docker-compose.yml`), pull |
| 104 | + `GET /api/v1/tools` for exact schemas, and: |
| 105 | + - **Accuracy**: replay the Spec 065 retrieval golden set |
| 106 | + (`retrieval_golden_v1.json`) through `retrieve_tools` and score Recall@k / |
| 107 | + MRR / nDCG (deterministic, no LLM) — reuses the D1 scorer. |
| 108 | + - **Latency**: measure proxy-side `retrieve_tools` search latency vs. the |
| 109 | + fixed cost of loading all tools. |
| 110 | +- **End-to-end task success with a pinned LLM** — requires a pinned model + an |
| 111 | + LLM-call budget; this is the only part that costs spend. |
| 112 | +- **CI publish-on-release-tag → public static dashboard** — Release/DevOps lane. |
| 113 | + |
| 114 | +## Dataset sources & provenance |
| 115 | + |
| 116 | +- Tool corpus + retrieval golden set: Spec 065 frozen datasets |
| 117 | + (`specs/065-evaluation-foundation/datasets/`), generated from 7 permissively |
| 118 | + reachable no-auth reference servers (filesystem, git, memory, sqlite, fetch, |
| 119 | + time, sequential-thinking). |
| 120 | +- Proxy + management tool definitions: derived at run time from the live server |
| 121 | + tool builders (`internal/server/mcp_routing.go` → |
| 122 | + `buildCallToolModeTools` / `buildCodeExecModeTools`, exposed via |
| 123 | + `internal/server.ProxyModeToolDefs`). No hand-maintained fixture — the |
| 124 | + benchmark cannot drift from the tools the proxy actually serves. |
| 125 | + |
| 126 | +## Reproducible live run (skeleton) |
| 127 | + |
| 128 | +`docker-compose.yml` boots mcpproxy over the frozen reference-server config so |
| 129 | +the corpus and live tool list are reproducible across machines. Wiring the live |
| 130 | +accuracy/latency scorers into it is the follow-up above. |
| 131 | + |
| 132 | +## Reviewer contact |
| 133 | + |
| 134 | +Methodology questions / disputes: open an issue in `smart-mcp-proxy/mcpproxy-go` |
| 135 | +and tag the maintainers, or comment on the roadmap benchmark ticket (MCP-42). |
0 commit comments