Operator runbook. Harness = ~/repos/mcp-eval; datasets versioned in specs/065-evaluation-foundation/datasets/.
mcp-evalrepo (~/repos/mcp-eval,uv), a builtmcpproxybinary, and the eval API key the harness already manages.- Confirm the search endpoint is reachable:
curl -H "X-API-Key: <key>" "http://127.0.0.1:8080/api/v1/index/search?q=docker&limit=5"→ JSON[{server,tool,score,…}].
# documented, repeatable; produces datasets/corpus_v1.json from current tool descriptions
mcp-eval datasets snapshot --out specs/065-evaluation-foundation/datasets/corpus_v1.jsonCommit corpus_v1.json. Refresh later = corpus_v2.json (never mutate v1).
- Generate synthetic queries (3–5 per tool, paraphrased — never name the tool):
mcp-eval datasets gen-queries --corpus corpus_v1.json --out retrieval_golden_v1.json. - Add human-verified hard negatives (near-duplicate tools across servers).
- Validate against
contracts/retrieval-dataset.schema.json(everytool_id∈ corpus). Commit.
- Vendor license-clear malicious samples (DVMCP MIT + self-authored per category) + benign descriptions from the corpus + attack-resembling hard negatives.
- Each entry carries
label,category,provenance.license(CI rejects missing/restricted — FR-007/CN-005). - Validate against
contracts/security-corpus.schema.json. Commit.
mcp-eval retrieval --corpus corpus_v1.json --golden retrieval_golden_v1.json \
--baseline datasets/baseline_v1.json --runs 1
# → Recall@1/3/5/10, MRR, nDCG@10, baseline deltas; HTML/JSON report (not committed)Self-test (SC-003): re-run against a deliberately degraded index → scores drop.
# bridge: cmd/scan-eval (in mcpproxy-go) emits per-entry detector verdicts as JSON
go run ./cmd/scan-eval --corpus specs/065-evaluation-foundation/datasets/security_corpus_v1.json --out /tmp/verdicts.json
mcp-eval security --verdicts /tmp/verdicts.json --corpus security_corpus_v1.json \
--baseline datasets/baseline_v1.json --runs 3
# → per-detector precision/recall/F1/FPR + pass/fail vs FPR ceiling & recall floorSelf-test (SC-004): add a noisy detector → its FPR rises visibly.
.github/workflows/eval.yml: freeze corpus → run D1 + D2 → fail if Recall@5 < baseline−tolerance or any detector FPR > ceiling. Reports uploaded as run artifacts (not committed).
- SC-001/002: one command each → the metric sets.
- SC-003/004: degraded index ↓ retrieval; noisy detector ↑ FPR.
- SC-005: CI fails on either regression.
- SC-006:
retrieval_golden_v1.jsonis consumed unchanged by the future GEPA (D5) loop. - SC-007: unchanged system reproduces baseline within tolerance.
- SC-008: every security entry has category + license; no restricted corpus vendored.
D3 dynamic ASR (AgentDojo/DVMCP through live proxy) · D4 sensitive-data eval · D5 GEPA (fitness = this D1 set) · D6 full CI dashboard.