Skip to content

Commit e3588fa

Browse files
feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a) (#748)
* feat(bench): live benchmark run — full schemas + Recall@k + latency (MCP-42a) Extends the bench/ harness (PR #747) with a live run against a running proxy: - Exact token number: GET /api/v1/tools pulls upstream tools WITH full JSON input schemas; proxy-mode tools carry their live schemas via the extended server.ProxyModeToolDefs (BenchProxyToolDef.Schema). Schemas counted on BOTH sides so the headline savings is authoritative — and withheld (authoritative_headline=false) if any proxy tool lacks a schema, the MCP-3161 overstatement guard. - Accuracy: replays the Spec 065 retrieval golden set through the proxy BM25 search (GET /api/v1/index/search) and scores Recall@{1,3,5,10}/MRR/nDCG@10/MAP against graded labels (deterministic, no LLM). Field names mirror Spec 065 score-report.schema.json. - Latency: client-measured per-query search latency (p50/p95/p99/max) vs. the one-shot load-all-tools cost (server "took" is a 0ms stub). CLI: `go run ./bench/cmd/bench -live -proxy URL -api-key KEY`. Reports stay gitignored (CN-003). All metric math + the live client are unit-tested with httptest stubs; the docker-compose substrate is the live-reproduction path. Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(bench): preserve upstream schemas through /api/v1/tools baseline ConvertGenericToolsToTyped read generic["schema"], but every producer of the generic tool map (runtime/server GetServerTools, mcp.go) emits the upstream input schema under "inputSchema". The /api/v1/tools response therefore dropped every schema, so the MCP-42a live benchmark baseline was silently a description-only token count instead of the required full-schema count, while still able to emit authoritative_headline=true. - Read "inputSchema" first in the converter, keep "schema" as a legacy fallback. - Gate the live headline on baseline schemas too (BaselineSchemasCounted via anyHaveSchema): a systematically schema-less baseline now withholds the headline instead of claiming a full-schema baseline it never had. - Tests: converter preserves inputSchema (+legacy schema fallback); headline withheld when the baseline carries no schemas. Related #748 * fix(bench): conform live retrieval report to Spec 065 score-report schema Addresses CodexReviewer finding on PR #748 / MCP-3167: the live `retrieval` payload emitted flat metric fields, but score-report.schema.json requires nested `retrieval.metrics` + `retrieval.gate`. Restructure RetrievalMetrics into {metrics, gate} so live_report.json validates against the contract, proven by a new jsonschema-validation test (TestRetrievalMetricsConformsToScoreReportSchema). A standalone live run has no stored baseline, so gate.passed is true by construction (CI regression-gating against a committed baseline is MCP-3133). Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing>
1 parent 4a24175 commit e3588fa

14 files changed

Lines changed: 1340 additions & 31 deletions

bench/README.md

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -94,19 +94,48 @@ corpus size alongside a percentage. Reproduce with `go run ./bench/cmd/bench`.
9494
- **`cl100k_base` ≠ the pinned model's tokenizer.** Pinning the exact tokenizer
9595
for the headline model is tracked as a follow-up (see "Roadmap").
9696

97+
## Live run — full schemas + accuracy + latency
98+
99+
The live run boots mcpproxy over the Spec 065 reference-server config and
100+
measures the three headline claims against a *running* proxy. Everything here is
101+
still deterministic and LLM-free.
102+
103+
```bash
104+
# 1. Boot the reproducible substrate (proxy + 7 no-auth reference servers)
105+
docker compose -f bench/docker-compose.yml up --build -d
106+
107+
# 2. Score against the running proxy (writes bench/results/live_report.json)
108+
go run ./bench/cmd/bench -live -proxy http://127.0.0.1:8092 -api-key eval-corpus-snapshot
109+
```
110+
111+
What it adds over the offline token run:
112+
113+
- **Exact token number (full schemas).** Pulls `GET /api/v1/tools` for the
114+
upstream tools *with their full JSON input schemas* and counts them against
115+
the proxy modes — whose management-tool schemas come from the same live
116+
builders as the offline run (`server.ProxyModeToolDefs`). Because schemas are
117+
counted on **both** sides, the savings is authoritative.
118+
- **Safety valve (MCP-3161):** if any proxy tool is missing a schema, counting
119+
the baseline's schemas alone would *overstate* savings, so the run
120+
**withholds the headline %** and reports raw token totals only
121+
(`authoritative_headline: false`). Never quote a withheld run.
122+
- **Accuracy.** Replays `retrieval_golden_v1.json` through the proxy's BM25
123+
search (`GET /api/v1/index/search`) and scores **Recall@{1,3,5,10}, MRR,
124+
nDCG@10, MAP** against the graded labels. Deterministic (BM25), so a single
125+
run is reported (`runs_averaged: 1`). The emitted `retrieval` block **conforms
126+
to** the Spec 065 `score-report.schema.json` shape — nested `metrics` + `gate`
127+
(verified by a schema-validation test). A standalone live run has no stored
128+
baseline to regress against, so `gate.passed` is `true` by construction;
129+
CI regression-gating against a committed baseline is the MCP-3133 lane.
130+
- **Latency.** Client-measured per-query search latency (p50/p95/p99/max) vs.
131+
the one-shot cost of loading all tools. Measured client-side on purpose: the
132+
server's `SearchToolsResponse.took` field is currently a `"0ms"` stub.
133+
97134
## What is scoped but not yet built (follow-ups)
98135

99136
These require decisions and/or other roles, so they are tracked as child issues
100137
rather than landed here:
101138

102-
- **Live run with full schemas + accuracy + latency** — boot mcpproxy over the
103-
Spec 065 `snapshot-servers.config.json` (see `docker-compose.yml`), pull
104-
`GET /api/v1/tools` for exact schemas, and:
105-
- **Accuracy**: replay the Spec 065 retrieval golden set
106-
(`retrieval_golden_v1.json`) through `retrieve_tools` and score Recall@k /
107-
MRR / nDCG (deterministic, no LLM) — reuses the D1 scorer.
108-
- **Latency**: measure proxy-side `retrieve_tools` search latency vs. the
109-
fixed cost of loading all tools.
110139
- **End-to-end task success with a pinned LLM** — requires a pinned model + an
111140
LLM-call budget; this is the only part that costs spend.
112141
- **CI publish-on-release-tag → public static dashboard** — Release/DevOps lane.
@@ -123,11 +152,13 @@ rather than landed here:
123152
`internal/server.ProxyModeToolDefs`). No hand-maintained fixture — the
124153
benchmark cannot drift from the tools the proxy actually serves.
125154

126-
## Reproducible live run (skeleton)
155+
## Reproducible live run
127156

128157
`docker-compose.yml` boots mcpproxy over the frozen reference-server config so
129-
the corpus and live tool list are reproducible across machines. Wiring the live
130-
accuracy/latency scorers into it is the follow-up above.
158+
the corpus and live tool list are reproducible across machines. The live
159+
accuracy/latency/full-schema scorers attach to it via `-live` (see "Live run"
160+
above). Pin the upstream-server images before publishing headline numbers
161+
(image drift can change the tool corpus).
131162

132163
## Reviewer contact
133164

bench/cmd/bench/main.go

Lines changed: 70 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,22 @@
1-
// Command bench runs the mcpproxy token-reduction benchmark over a frozen tool
2-
// corpus and writes a JSON report plus a static HTML dashboard.
1+
// Command bench runs the mcpproxy benchmark.
32
//
4-
// Usage:
3+
// Default (offline) mode scores the committed Spec 065 frozen corpus for
4+
// token reduction and writes a JSON report plus a static HTML dashboard:
55
//
66
// go run ./bench/cmd/bench [-corpus PATH] [-out DIR] [-encoding NAME]
77
//
8-
// With no flags it scores the committed Spec 065 frozen corpus and writes the
9-
// reports to bench/results/ (gitignored — reports are never committed, per the
10-
// Spec 065 CN-003 repo rule).
8+
// Live mode boots against a running proxy (see bench/docker-compose.yml) to add
9+
// the exact-token comparison (full schemas), retrieval accuracy (Recall@k / MRR
10+
// / nDCG over the golden set), and search latency:
11+
//
12+
// go run ./bench/cmd/bench -live [-proxy URL] [-api-key KEY] [-golden PATH]
13+
//
14+
// Reports land in bench/results/ (gitignored — reports are never committed, per
15+
// the Spec 065 CN-003 repo rule).
1116
package main
1217

1318
import (
19+
"context"
1420
"flag"
1521
"fmt"
1622
"log"
@@ -21,21 +27,33 @@ import (
2127

2228
func main() {
2329
corpusPath := flag.String("corpus", "specs/065-evaluation-foundation/datasets/corpus_v1.tools.json", "path to the frozen tool corpus snapshot")
24-
outDir := flag.String("out", "bench/results", "output directory for report.json and dashboard.html")
30+
outDir := flag.String("out", "bench/results", "output directory for reports")
2531
encoding := flag.String("encoding", bench.DefaultEncoding, "tiktoken encoding name")
32+
live := flag.Bool("live", false, "run the live benchmark against a running proxy (full schemas + accuracy + latency)")
33+
proxy := flag.String("proxy", "http://127.0.0.1:8092", "live proxy base URL")
34+
apiKey := flag.String("api-key", "eval-corpus-snapshot", "live proxy API key (X-API-Key)")
35+
goldenPath := flag.String("golden", "specs/065-evaluation-foundation/datasets/retrieval_golden_v1.json", "path to the retrieval golden set")
2636
flag.Parse()
2737

28-
tk, err := bench.NewTokenizer(*encoding)
38+
if *live {
39+
runLive(*proxy, *apiKey, *goldenPath, *outDir)
40+
return
41+
}
42+
runOffline(*corpusPath, *encoding, *outDir)
43+
}
44+
45+
func runOffline(corpusPath, encoding, outDir string) {
46+
tk, err := bench.NewTokenizer(encoding)
2947
if err != nil {
3048
log.Fatalf("bench: %v", err)
3149
}
32-
corpus, err := bench.LoadCorpus(*corpusPath)
50+
corpus, err := bench.LoadCorpus(corpusPath)
3351
if err != nil {
3452
log.Fatalf("bench: %v", err)
3553
}
3654

3755
report := bench.ComputeReport(tk, corpus)
38-
jsonPath, htmlPath, err := report.WriteReports(*outDir)
56+
jsonPath, htmlPath, err := report.WriteReports(outDir)
3957
if err != nil {
4058
log.Fatalf("bench: %v", err)
4159
}
@@ -50,3 +68,45 @@ func main() {
5068
}
5169
fmt.Fprintf(os.Stdout, "wrote %s and %s\n", jsonPath, htmlPath)
5270
}
71+
72+
func runLive(proxy, apiKey, goldenPath, outDir string) {
73+
golden, err := bench.LoadGoldenSet(goldenPath)
74+
if err != nil {
75+
log.Fatalf("bench: %v", err)
76+
}
77+
client := bench.NewLiveClient(proxy, apiKey)
78+
report, err := bench.RunLive(context.Background(), client, golden)
79+
if err != nil {
80+
log.Fatalf("bench: %v", err)
81+
}
82+
jsonPath, err := report.WriteJSON(outDir)
83+
if err != nil {
84+
log.Fatalf("bench: %v", err)
85+
}
86+
87+
fmt.Fprintf(os.Stdout, "mcpproxy LIVE benchmark (proxy %s, %s)\n", report.Proxy, report.Encoding)
88+
tr := report.Tokens
89+
fmt.Fprintf(os.Stdout, " tokens: %d upstream tools, baseline %d tokens (with full schemas)\n", tr.UpstreamTools, tr.BaselineTokens)
90+
for _, m := range tr.Modes {
91+
if m.Mode == bench.ModeBaseline {
92+
continue
93+
}
94+
if tr.AuthoritativeHeadline {
95+
fmt.Fprintf(os.Stdout, " %-16s %6d tokens %.1f%% fewer\n", m.Mode, m.Tokens, m.SavingsRatio*100)
96+
} else {
97+
fmt.Fprintf(os.Stdout, " %-16s %6d tokens (savings withheld — see notes)\n", m.Mode, m.Tokens)
98+
}
99+
}
100+
if !tr.AuthoritativeHeadline {
101+
for _, n := range tr.Notes {
102+
fmt.Fprintf(os.Stdout, " NOTE: %s\n", n)
103+
}
104+
}
105+
r := report.Retrieval
106+
fmt.Fprintf(os.Stdout, " accuracy (%d queries): Recall@1=%.3f Recall@5=%.3f MRR=%.3f nDCG@10=%.3f MAP=%.3f\n",
107+
r.QueryCount, r.Metrics.RecallAt[1], r.Metrics.RecallAt[5], r.Metrics.MRR, r.Metrics.NDCGAt10, r.Metrics.MAP)
108+
l := report.Latency
109+
fmt.Fprintf(os.Stdout, " latency (%d searches): p50=%.1fms p95=%.1fms p99=%.1fms max=%.1fms; load-all-tools=%.1fms\n",
110+
l.Samples, l.P50ms, l.P95ms, l.P99ms, l.MaxMs, l.LoadAllToolsMs)
111+
fmt.Fprintf(os.Stdout, "wrote %s\n", jsonPath)
112+
}

bench/live.go

Lines changed: 168 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,168 @@
1+
package bench
2+
3+
import (
4+
"context"
5+
"encoding/json"
6+
"fmt"
7+
"io"
8+
"net/http"
9+
"net/url"
10+
"os"
11+
"strconv"
12+
"time"
13+
)
14+
15+
// LiveClient talks to a running mcpproxy instance (e.g. the bench
16+
// docker-compose substrate on 127.0.0.1:8092) over its REST API. It is used by
17+
// the live benchmark run to pull the exact tool definitions (with schemas) and
18+
// to replay the retrieval golden set through the proxy's BM25 search.
19+
type LiveClient struct {
20+
BaseURL string
21+
APIKey string
22+
HTTP *http.Client
23+
}
24+
25+
// NewLiveClient builds a LiveClient for baseURL (e.g. "http://127.0.0.1:8092")
26+
// authenticating with apiKey via the X-API-Key header.
27+
func NewLiveClient(baseURL, apiKey string) *LiveClient {
28+
return &LiveClient{
29+
BaseURL: baseURL,
30+
APIKey: apiKey,
31+
HTTP: &http.Client{Timeout: 30 * time.Second},
32+
}
33+
}
34+
35+
// successEnvelope is the standard mcpproxy REST response wrapper
36+
// ({"success":true,"data":{...}}). Data is decoded lazily by each caller.
37+
type successEnvelope struct {
38+
Success bool `json:"success"`
39+
Data json.RawMessage `json:"data"`
40+
Error string `json:"error,omitempty"`
41+
}
42+
43+
// getJSON performs an authenticated GET and unmarshals the envelope's data
44+
// field into out.
45+
func (c *LiveClient) getJSON(ctx context.Context, path string, out interface{}) error {
46+
req, err := http.NewRequestWithContext(ctx, http.MethodGet, c.BaseURL+path, nil)
47+
if err != nil {
48+
return fmt.Errorf("build request %q: %w", path, err)
49+
}
50+
if c.APIKey != "" {
51+
req.Header.Set("X-API-Key", c.APIKey)
52+
}
53+
resp, err := c.HTTP.Do(req)
54+
if err != nil {
55+
return fmt.Errorf("GET %q: %w", path, err)
56+
}
57+
defer resp.Body.Close()
58+
body, err := io.ReadAll(resp.Body)
59+
if err != nil {
60+
return fmt.Errorf("read %q: %w", path, err)
61+
}
62+
if resp.StatusCode != http.StatusOK {
63+
return fmt.Errorf("GET %q: status %d: %s", path, resp.StatusCode, string(body))
64+
}
65+
var env successEnvelope
66+
if err := json.Unmarshal(body, &env); err != nil {
67+
return fmt.Errorf("decode envelope %q: %w", path, err)
68+
}
69+
if !env.Success {
70+
return fmt.Errorf("GET %q: api error: %s", path, env.Error)
71+
}
72+
if err := json.Unmarshal(env.Data, out); err != nil {
73+
return fmt.Errorf("decode data %q: %w", path, err)
74+
}
75+
return nil
76+
}
77+
78+
// apiTool mirrors contracts.Tool for the fields the benchmark needs. The schema
79+
// is kept raw so its exact serialized form is what gets tokenized.
80+
type apiTool struct {
81+
Name string `json:"name"`
82+
ServerName string `json:"server_name"`
83+
Description string `json:"description"`
84+
Schema json.RawMessage `json:"schema,omitempty"`
85+
}
86+
87+
// FetchUpstreamTools pulls the consolidated tool list (GET /api/v1/tools) and
88+
// returns every upstream tool with its full JSON input schema, ready to feed
89+
// into schema-aware token counting for the baseline.
90+
func (c *LiveClient) FetchUpstreamTools(ctx context.Context) ([]Tool, error) {
91+
var resp struct {
92+
Tools []apiTool `json:"tools"`
93+
}
94+
if err := c.getJSON(ctx, "/api/v1/tools", &resp); err != nil {
95+
return nil, err
96+
}
97+
tools := make([]Tool, 0, len(resp.Tools))
98+
for _, t := range resp.Tools {
99+
tools = append(tools, Tool{
100+
ToolID: t.ServerName + ":" + t.Name,
101+
Server: t.ServerName,
102+
Name: t.Name,
103+
Description: t.Description,
104+
Schema: normalizeSchema(t.Schema),
105+
})
106+
}
107+
return tools, nil
108+
}
109+
110+
// normalizeSchema treats an empty JSON object ("{}") or JSON null the same as an
111+
// absent schema so a tool with no real parameters does not inflate token counts.
112+
func normalizeSchema(raw json.RawMessage) json.RawMessage {
113+
switch string(raw) {
114+
case "", "null", "{}":
115+
return nil
116+
default:
117+
return raw
118+
}
119+
}
120+
121+
// Search replays one query through the proxy's BM25 tool search
122+
// (GET /api/v1/index/search) and returns the ranked tool IDs (server:tool,
123+
// best first) plus the client-measured round-trip latency.
124+
//
125+
// Latency is measured client-side on purpose: the server's SearchToolsResponse
126+
// "took" field is currently a hardcoded "0ms" stub (internal/httpapi
127+
// handleSearchTools), so it cannot be trusted as the proxy-side timing.
128+
func (c *LiveClient) Search(ctx context.Context, query string, limit int) (ranked []string, latency time.Duration, err error) {
129+
q := url.Values{}
130+
q.Set("q", query)
131+
q.Set("limit", strconv.Itoa(limit))
132+
path := "/api/v1/index/search?" + q.Encode()
133+
134+
var resp struct {
135+
Results []struct {
136+
Tool apiTool `json:"tool"`
137+
Score float64 `json:"score"`
138+
} `json:"results"`
139+
}
140+
start := time.Now()
141+
err = c.getJSON(ctx, path, &resp)
142+
latency = time.Since(start)
143+
if err != nil {
144+
return nil, latency, err
145+
}
146+
ranked = make([]string, 0, len(resp.Results))
147+
for _, r := range resp.Results {
148+
ranked = append(ranked, r.Tool.ServerName+":"+r.Tool.Name)
149+
}
150+
return ranked, latency, nil
151+
}
152+
153+
// LoadGoldenSet reads the Spec 065 retrieval golden set
154+
// (retrieval_golden_v1.json) from disk.
155+
func LoadGoldenSet(path string) (*GoldenSet, error) {
156+
data, err := os.ReadFile(path)
157+
if err != nil {
158+
return nil, fmt.Errorf("read golden set %q: %w", path, err)
159+
}
160+
var g GoldenSet
161+
if err := json.Unmarshal(data, &g); err != nil {
162+
return nil, fmt.Errorf("parse golden set %q: %w", path, err)
163+
}
164+
if len(g.Queries) == 0 {
165+
return nil, fmt.Errorf("golden set %q contains no queries", path)
166+
}
167+
return &g, nil
168+
}

0 commit comments

Comments
 (0)