Phase 1 Data Model: Evaluation Foundation (D1+D2)

No new persistent storage in mcpproxy-go. The "data model" is the versioned JSON dataset + report artifacts. Schemas formalized in contracts/.

1. Tool corpus snapshot (`datasets/corpus_v1.json`)

Frozen list the evaluations score against (CN-002).

Fields: version, generated_from (source + note), tools[] where each = { tool_id: "<server>:<tool>", server, tool, description, schema }.
Invariant: tool_id unique; immutable once committed — refresh = new corpus_v2.json (FR-012).

The D1 labeled set.

Fields: corpus_version (FK → corpus snapshot), queries[] where each = { id, query (paraphrased intent, never names the tool — R-C), labels: [{ tool_id, relevance: 0|1|2 }], notes }.
Validation: every labels[].tool_id exists in the referenced corpus; ≥1 label with relevance≥1; hard-negative queries flagged (notes: "hard-negative: near-dup of <tool_id>").
Used by: RetrievalScorer (D1), and unchanged by GEPA (D5) as fitness (FR-011/SC-006).

The D2 labeled set.

Fields: entries[] where each = { id, description (the tool description text under test), label: "malicious"|"benign", category: "tool_poisoning"|"prompt_injection"|"shadowing"|"rug_pull"|"benign"|"hard_negative", provenance: { source, license } }.
Validation: every entry has label + category + provenance.license (FR-007); no entry whose provenance.license is redistribution-restricted (CN-005); ≥1 hard-negative benign per attack category.

Fields: corpus_version, golden_version, metrics: { recall_at: {1,3,5,10}, mrr, ndcg_at_10, map }, per_query[], baseline_delta: { recall_at_5, mrr, … }, gate: { passed, tolerance }.

Fields: per_detector[] = { detector, precision, recall, f1, fpr, tp, fp, tn, fn }, runs_averaged: N, gate: { detector, fpr_ceiling, recall_floor, passed }.

datasets/baseline_v1.json: reference metric values + per-gate thresholds (Recall@5 tolerance; per-detector FPR ceiling + recall floor). The CI gate (FR-009) diffs a fresh report against this.

INV-1: golden-set tool_ids ⊆ corpus tool_ids (no dangling labels).
INV-2: removing a labeled tool from the corpus drives that query's Recall to 0 (US1 #3 / SC-003 — proves the scorer isn't trivially passing).
INV-3: a known tool-poisoning entry is a true positive; an attack-resembling benign entry that gets flagged increments FP and is visible (SC-004).
INV-4: every security entry carries category + license; build fails if any is missing or restricted (FR-007/CN-005).
INV-5: re-running on an unchanged system reproduces baseline within tolerance (SC-007).