Skip to content

Commit 4a24175

Browse files
feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42) (#747)
* feat(bench): token-reduction benchmark harness over frozen corpus (MCP-42) Ship the first, fully-deterministic slice of the roadmap-#19 benchmark: the token-reduction numbers behind the "massive token savings" claim. Reuses the frozen Spec 065 tool corpus (45 tools, 7 reference servers) as a versioned, non-drifting universe and tiktoken cl100k_base (already a dep) as a reproducible model-agnostic estimator. Compares the three routing modes' static context cost: - baseline (all upstream tools loaded directly) - retrieve_tools (BM25 discovery + call_tool variants) - code_execution (orchestration + retrieve_tools) over the corpus and reports per-mode savings. Real proxy tool defs are captured verbatim from internal/server/mcp.go into bench/proxy_tools_v1.json (provenance recorded). Emits report.json + a self-contained dashboard.html (gitignored; reports never committed, per Spec 065 CN-003). Conservative by construction: input schemas excluded uniformly understates the baseline, so measured savings (65.5% / 70.3% on the 45-tool corpus) are a floor. Methodology, limitations, and the scoped-but-not-yet-built follow-ups (live run with full schemas + accuracy/latency, LLM e2e, CI publish) are in bench/README.md. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(bench): drop stale line numbers from provenance; add WriteReports smoke test KimiReviewer finding 2: code_execution is at line 626 in mcp.go at 89f06b5, not 675 as claimed. Line numbers drift with unrelated edits and the actual function names are the stable identifier — remove all line numbers from the provenance comment to prevent future rot. KimiReviewer finding 3: add TestWriteReports_SmokeTest covering WriteReports output (JSON round-trips to Report, HTML is non-empty and contains all mode names). All 5 tests pass; golangci-lint v2 clean. Related #MCP-42 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(bench): derive per-mode tool catalog from live server builders incl. management tools (MCP-3161) The token-reduction benchmark scored only 6 hand-maintained proxy tools and omitted the shared management tool set (upstream_servers, quarantine_security, search_servers, list_registries) that both routing modes append via buildManagementTools. That undercounted the proxy-mode context cost and inflated the headline savings (Codex finding on PR #747). Replace bench/proxy_tools_v1.json with server.ProxyModeToolDefs, which builds the catalog from the live builders (buildCallToolModeTools / buildCodeExecModeTools in internal/server/mcp_routing.go) so it can never drift from production and always reflects the tools the agent actually sees. This also fixes a second drift: the fixture's retrieve_tools descriptions did not match the per-mode builder descriptions. Corrected figures over the 45-tool Spec 065 corpus (name+description only): retrieve_tools ~17% (10 tools), code_execution ~43% (6 tools). Updated README and notes; the schema-exclusion claim is no longer unambiguously conservative now that large-schema management tools are in the proxy cost. Tests: bench asserts both modes include the 4 management tools; internal/server pins ProxyModeToolDefs to the builders so the catalog can't silently drift. Related #747 --------- Co-authored-by: Paperclip <noreply@paperclip.ing>
1 parent e3ae24a commit 4a24175

10 files changed

Lines changed: 883 additions & 0 deletions

File tree

bench/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Benchmark run artifacts are never committed (Spec 065 CN-003).
2+
results/

bench/README.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# mcpproxy benchmark harness
2+
3+
The reproducible numbers behind mcpproxy's marketing claims — **token reduction**,
4+
**discovery accuracy**, and **latency** — comparing three ways an agent can be
5+
wired to upstream MCP tools.
6+
7+
> Roadmap item #19 (MCP-42). In-repo (`bench/`), reproducible, intended to be
8+
> refreshed on release. Reports are **never committed** (Spec 065 CN-003); only
9+
> code, fixtures, and this methodology are versioned.
10+
11+
## The three modes
12+
13+
| Mode | What the agent sees in context | mcpproxy server |
14+
|------|--------------------------------|-----------------|
15+
| `baseline` | Every upstream tool definition, loaded directly | (no proxy discovery) |
16+
| `retrieve_tools` | `retrieve_tools` + `call_tool_read/write/destructive` + `read_cache` + `code_execution` + management tools; tools found on demand via BM25 | `callToolServer` |
17+
| `code_execution` | `code_execution` + `retrieve_tools` + management tools; many tools orchestrated from sandboxed JS in one round-trip | `codeExecServer` |
18+
19+
Both proxy modes also append the shared **management tool set**
20+
`upstream_servers`, `quarantine_security`, `search_servers`, `list_registries`
21+
— that the live routing-mode servers expose. These count against the proxy
22+
context cost: omitting them undercounts that cost and inflates the savings.
23+
24+
The per-mode catalog is **derived directly from the live tool builders**
25+
(`buildCallToolModeTools` / `buildCodeExecModeTools` in
26+
`internal/server/mcp_routing.go`, via `server.ProxyModeToolDefs`), so it can
27+
never drift from production.
28+
29+
## What ships today (deterministic, offline)
30+
31+
The **token-reduction** measurement is fully deterministic and runs with no
32+
network or LLM:
33+
34+
```bash
35+
go run ./bench/cmd/bench # scores the committed Spec 065 corpus
36+
go test ./bench/ # unit + invariant tests
37+
```
38+
39+
It counts the context-token cost of each mode over a **frozen tool corpus** and
40+
reports the savings of each proxy mode versus the baseline. Output: a
41+
`report.json` and a self-contained `dashboard.html` in `bench/results/`
42+
(gitignored).
43+
44+
#### Current deterministic result
45+
46+
Over the 45-tool Spec 065 reference corpus, counting **tool name + description
47+
only** (schemas excluded uniformly — see limitations), `cl100k_base`:
48+
49+
| Mode | Context tools | Tokens | Savings vs. baseline |
50+
|------|---------------|--------|----------------------|
51+
| `baseline` | 45 | 1730 ||
52+
| `retrieve_tools` | 10 | 1431 | **~17%** |
53+
| `code_execution` | 6 | 986 | **~43%** |
54+
55+
These are deliberately modest: the proxy context here is the *full* per-mode
56+
tool set (discovery + call-tool variants + management tools), and the corpus is
57+
small. Savings grow toward the asymptote as the upstream tool count rises (the
58+
baseline grows linearly while the proxy context stays fixed) — always quote the
59+
corpus size alongside a percentage. Reproduce with `go run ./bench/cmd/bench`.
60+
61+
### Scoring rubric — token reduction
62+
63+
- **Tool universe**: the frozen Spec 065 snapshot
64+
`specs/065-evaluation-foundation/datasets/corpus_v1.tools.json` — 45 tools
65+
across 7 no-auth reference servers. Frozen + versioned so scoring never runs
66+
against a drifting corpus (CN-002).
67+
- **Tokenizer**: `tiktoken cl100k_base`, a widely-used reproducible BPE
68+
(already a repo dependency). It is a **model-agnostic estimator**; exact
69+
counts for a specific pinned model (e.g. Claude) will differ, but the
70+
*relative* savings between modes are stable.
71+
- **Proxy-mode tools**: the *complete* per-mode catalog, derived from the live
72+
server builders — discovery, the call-tool variants, `code_execution`, **and
73+
the shared management tool set** (`upstream_servers`, `quarantine_security`,
74+
`search_servers`, `list_registries`). Nothing the agent actually sees is
75+
dropped from the proxy cost.
76+
- **Cost of a tool**: `name + "\n" + description`. JSON input schemas are
77+
excluded **uniformly** across all modes (the committed corpus snapshot does
78+
not carry schemas).
79+
- **Savings** for a mode `m`: `1 - tokens(m) / tokens(baseline)`.
80+
81+
### Known limitations (read before quoting a number)
82+
83+
- **Schemas excluded — direction is not clean.** Input schemas are dropped from
84+
*both* sides. The 45 baseline tools lose their schemas, but so do the proxy
85+
modes' management tools (e.g. `upstream_servers` carries a large multi-field
86+
schema). So the name+description-only number is **not** unambiguously
87+
conservative — it is its own well-defined metric. The live run below adds full
88+
schemas from `GET /api/v1/tools` for the exact headline number; quote that for
89+
marketing, not this offline estimate.
90+
- **Savings scale with tool count.** The 45-tool reference corpus is small; real
91+
deployments expose hundreds–thousands of tools, where the baseline grows
92+
linearly and the proxy context stays fixed, so savings approach the asymptote.
93+
Quote the corpus size alongside any percentage.
94+
- **`cl100k_base` ≠ the pinned model's tokenizer.** Pinning the exact tokenizer
95+
for the headline model is tracked as a follow-up (see "Roadmap").
96+
97+
## What is scoped but not yet built (follow-ups)
98+
99+
These require decisions and/or other roles, so they are tracked as child issues
100+
rather than landed here:
101+
102+
- **Live run with full schemas + accuracy + latency** — boot mcpproxy over the
103+
Spec 065 `snapshot-servers.config.json` (see `docker-compose.yml`), pull
104+
`GET /api/v1/tools` for exact schemas, and:
105+
- **Accuracy**: replay the Spec 065 retrieval golden set
106+
(`retrieval_golden_v1.json`) through `retrieve_tools` and score Recall@k /
107+
MRR / nDCG (deterministic, no LLM) — reuses the D1 scorer.
108+
- **Latency**: measure proxy-side `retrieve_tools` search latency vs. the
109+
fixed cost of loading all tools.
110+
- **End-to-end task success with a pinned LLM** — requires a pinned model + an
111+
LLM-call budget; this is the only part that costs spend.
112+
- **CI publish-on-release-tag → public static dashboard** — Release/DevOps lane.
113+
114+
## Dataset sources & provenance
115+
116+
- Tool corpus + retrieval golden set: Spec 065 frozen datasets
117+
(`specs/065-evaluation-foundation/datasets/`), generated from 7 permissively
118+
reachable no-auth reference servers (filesystem, git, memory, sqlite, fetch,
119+
time, sequential-thinking).
120+
- Proxy + management tool definitions: derived at run time from the live server
121+
tool builders (`internal/server/mcp_routing.go`
122+
`buildCallToolModeTools` / `buildCodeExecModeTools`, exposed via
123+
`internal/server.ProxyModeToolDefs`). No hand-maintained fixture — the
124+
benchmark cannot drift from the tools the proxy actually serves.
125+
126+
## Reproducible live run (skeleton)
127+
128+
`docker-compose.yml` boots mcpproxy over the frozen reference-server config so
129+
the corpus and live tool list are reproducible across machines. Wiring the live
130+
accuracy/latency scorers into it is the follow-up above.
131+
132+
## Reviewer contact
133+
134+
Methodology questions / disputes: open an issue in `smart-mcp-proxy/mcpproxy-go`
135+
and tag the maintainers, or comment on the roadmap benchmark ticket (MCP-42).

bench/cmd/bench/main.go

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
// Command bench runs the mcpproxy token-reduction benchmark over a frozen tool
2+
// corpus and writes a JSON report plus a static HTML dashboard.
3+
//
4+
// Usage:
5+
//
6+
// go run ./bench/cmd/bench [-corpus PATH] [-out DIR] [-encoding NAME]
7+
//
8+
// With no flags it scores the committed Spec 065 frozen corpus and writes the
9+
// reports to bench/results/ (gitignored — reports are never committed, per the
10+
// Spec 065 CN-003 repo rule).
11+
package main
12+
13+
import (
14+
"flag"
15+
"fmt"
16+
"log"
17+
"os"
18+
19+
"github.com/smart-mcp-proxy/mcpproxy-go/bench"
20+
)
21+
22+
func main() {
23+
corpusPath := flag.String("corpus", "specs/065-evaluation-foundation/datasets/corpus_v1.tools.json", "path to the frozen tool corpus snapshot")
24+
outDir := flag.String("out", "bench/results", "output directory for report.json and dashboard.html")
25+
encoding := flag.String("encoding", bench.DefaultEncoding, "tiktoken encoding name")
26+
flag.Parse()
27+
28+
tk, err := bench.NewTokenizer(*encoding)
29+
if err != nil {
30+
log.Fatalf("bench: %v", err)
31+
}
32+
corpus, err := bench.LoadCorpus(*corpusPath)
33+
if err != nil {
34+
log.Fatalf("bench: %v", err)
35+
}
36+
37+
report := bench.ComputeReport(tk, corpus)
38+
jsonPath, htmlPath, err := report.WriteReports(*outDir)
39+
if err != nil {
40+
log.Fatalf("bench: %v", err)
41+
}
42+
43+
fmt.Fprintf(os.Stdout, "mcpproxy token-reduction benchmark (corpus %s, %d tools, %s)\n", report.CorpusVersion, report.CorpusTools, report.Encoding)
44+
for _, m := range report.Modes {
45+
if m.Mode == bench.ModeBaseline {
46+
fmt.Fprintf(os.Stdout, " %-16s %6d tokens (%d tools) baseline\n", m.Mode, m.Tokens, m.ContextTools)
47+
continue
48+
}
49+
fmt.Fprintf(os.Stdout, " %-16s %6d tokens (%d tools) %.1f%% fewer tokens\n", m.Mode, m.Tokens, m.ContextTools, m.SavingsRatio*100)
50+
}
51+
fmt.Fprintf(os.Stdout, "wrote %s and %s\n", jsonPath, htmlPath)
52+
}

bench/docker-compose.yml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Reproducible benchmark substrate (skeleton).
2+
#
3+
# Boots mcpproxy over the frozen Spec 065 reference-server config so the tool
4+
# corpus and live tool list are identical across machines. The live
5+
# accuracy/latency scorers (see bench/README.md "follow-ups") attach to this.
6+
#
7+
# Usage:
8+
# docker compose -f bench/docker-compose.yml up --build
9+
# # then, against the running proxy on 127.0.0.1:8092:
10+
# # GET /api/v1/tools -> full tool defs (with schemas) for the live token run
11+
# # retrieve_tools -> Recall@k accuracy over retrieval_golden_v1.json
12+
#
13+
# The committed corpus_v1 snapshot was frozen from exactly this config
14+
# (specs/065-evaluation-foundation/datasets/README.md), so a live snapshot here
15+
# reproduces it (modulo upstream-server version drift — pin images before
16+
# publishing headline numbers).
17+
services:
18+
mcpproxy:
19+
build:
20+
context: ..
21+
dockerfile: Dockerfile
22+
command:
23+
- serve
24+
- --config=/data/snapshot-servers.config.json
25+
- --data-dir=/data/state
26+
- --listen=0.0.0.0:8092
27+
environment:
28+
MCPPROXY_API_KEY: eval-corpus-snapshot
29+
ports:
30+
- "127.0.0.1:8092:8092"
31+
volumes:
32+
# The frozen, servable reference-server config (7 no-auth servers).
33+
- ../specs/065-evaluation-foundation/datasets/snapshot-servers.config.json:/data/snapshot-servers.config.json:ro
34+
- bench-state:/data/state
35+
36+
volumes:
37+
bench-state:

bench/proxytools.go

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
package bench
2+
3+
import (
4+
"github.com/smart-mcp-proxy/mcpproxy-go/internal/config"
5+
"github.com/smart-mcp-proxy/mcpproxy-go/internal/server"
6+
)
7+
8+
// ProxyToolsForMode returns the built-in mcpproxy proxy + management tool
9+
// definitions that occupy the agent's context window in the given routing mode.
10+
//
11+
// The catalog is derived directly from the live server tool builders
12+
// (internal/server.ProxyModeToolDefs → buildCallToolModeTools /
13+
// buildCodeExecModeTools in internal/server/mcp_routing.go). This is the single
14+
// source of truth: both routing modes append the shared management tool set
15+
// (upstream_servers, quarantine_security, search_servers, list_registries), so
16+
// deriving from the builders guarantees the benchmark counts the real per-mode
17+
// context cost and can never drift from production by re-introducing the
18+
// undercount that inflated the headline savings (MCP-3161).
19+
func ProxyToolsForMode(mode string) []Tool {
20+
var routingMode string
21+
switch mode {
22+
case ModeCodeExecution:
23+
routingMode = config.RoutingModeCodeExecution
24+
case ModeRetrieveTools:
25+
routingMode = config.RoutingModeRetrieveTools
26+
default:
27+
return nil
28+
}
29+
30+
defs := server.ProxyModeToolDefs(routingMode)
31+
out := make([]Tool, 0, len(defs))
32+
for _, d := range defs {
33+
out = append(out, Tool{
34+
ToolID: "mcpproxy:" + d.Name,
35+
Name: d.Name,
36+
Description: d.Description,
37+
})
38+
}
39+
return out
40+
}

bench/report.go

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
package bench
2+
3+
import (
4+
"encoding/json"
5+
"fmt"
6+
"html/template"
7+
"os"
8+
"path/filepath"
9+
)
10+
11+
// WriteJSON writes the report as indented JSON to path.
12+
func (r *Report) WriteJSON(path string) error {
13+
data, err := json.MarshalIndent(r, "", " ")
14+
if err != nil {
15+
return fmt.Errorf("marshal report: %w", err)
16+
}
17+
if err := os.WriteFile(path, append(data, '\n'), 0o644); err != nil {
18+
return fmt.Errorf("write %q: %w", path, err)
19+
}
20+
return nil
21+
}
22+
23+
// WriteHTML renders the report as a self-contained static dashboard. The output
24+
// is a single file with no external assets so it can be published as-is to a
25+
// static host (CI release-tag publishing is tracked as a follow-up).
26+
func (r *Report) WriteHTML(path string) error {
27+
tmpl, err := template.New("dashboard").Funcs(template.FuncMap{
28+
"pct": func(f float64) string { return fmt.Sprintf("%.1f%%", f*100) },
29+
}).Parse(dashboardHTML)
30+
if err != nil {
31+
return fmt.Errorf("parse template: %w", err)
32+
}
33+
f, err := os.Create(path)
34+
if err != nil {
35+
return fmt.Errorf("create %q: %w", path, err)
36+
}
37+
defer f.Close()
38+
if err := tmpl.Execute(f, r); err != nil {
39+
return fmt.Errorf("render dashboard: %w", err)
40+
}
41+
return nil
42+
}
43+
44+
// WriteReports writes both report.json and dashboard.html into dir.
45+
func (r *Report) WriteReports(dir string) (jsonPath, htmlPath string, err error) {
46+
if err = os.MkdirAll(dir, 0o755); err != nil {
47+
return "", "", fmt.Errorf("mkdir %q: %w", dir, err)
48+
}
49+
jsonPath = filepath.Join(dir, "report.json")
50+
htmlPath = filepath.Join(dir, "dashboard.html")
51+
if err = r.WriteJSON(jsonPath); err != nil {
52+
return "", "", err
53+
}
54+
if err = r.WriteHTML(htmlPath); err != nil {
55+
return "", "", err
56+
}
57+
return jsonPath, htmlPath, nil
58+
}
59+
60+
const dashboardHTML = `<!doctype html>
61+
<html lang="en">
62+
<head>
63+
<meta charset="utf-8">
64+
<meta name="viewport" content="width=device-width, initial-scale=1">
65+
<title>mcpproxy benchmark — token reduction</title>
66+
<style>
67+
:root { color-scheme: light dark; }
68+
body { font: 16px/1.5 system-ui, sans-serif; max-width: 880px; margin: 2rem auto; padding: 0 1rem; }
69+
h1 { margin-bottom: .25rem; }
70+
.sub { opacity: .7; margin-top: 0; }
71+
table { border-collapse: collapse; width: 100%; margin: 1.5rem 0; }
72+
th, td { padding: .6rem .8rem; text-align: right; border-bottom: 1px solid #8884; }
73+
th:first-child, td:first-child { text-align: left; }
74+
.save { font-weight: 600; color: #1a8f3c; }
75+
code { background: #8881; padding: .1rem .35rem; border-radius: 4px; }
76+
.notes { font-size: .9rem; opacity: .8; }
77+
.notes li { margin: .3rem 0; }
78+
</style>
79+
</head>
80+
<body>
81+
<h1>mcpproxy benchmark</h1>
82+
<p class="sub">Token cost of loading tools into an agent's context, by routing mode.</p>
83+
<p>Corpus <code>{{.CorpusVersion}}</code> &middot; {{.CorpusTools}} tools &middot; encoding <code>{{.Encoding}}</code></p>
84+
<table>
85+
<thead>
86+
<tr><th>Mode</th><th>Tools in context</th><th>Context tokens</th><th>Savings vs. baseline</th></tr>
87+
</thead>
88+
<tbody>
89+
{{range .Modes}}
90+
<tr>
91+
<td><code>{{.Mode}}</code></td>
92+
<td>{{.ContextTools}}</td>
93+
<td>{{.Tokens}}</td>
94+
<td class="save">{{if eq .Mode "baseline"}}&mdash;{{else}}{{pct .SavingsRatio}}{{end}}</td>
95+
</tr>
96+
{{end}}
97+
</tbody>
98+
</table>
99+
<h2>Methodology notes</h2>
100+
<ul class="notes">
101+
{{range .Notes}}<li>{{.}}</li>{{end}}
102+
</ul>
103+
</body>
104+
</html>
105+
`

0 commit comments

Comments
 (0)