Skip to content

Commit a54fc76

Browse files
dzmitrys-devclaude
andcommitted
feat(80.6-12): bench runner + bundled goldens (D-46) + auto_goldens
Plan 80.6-12 — supamem eval --regress, the SC-9 regression gate. src/supamem/eval/runner.py (NEW): - run_bench(*, regress, goldens_path, config) -> int - _load_goldens loads bundled JSONL via importlib.resources.files ('supamem.eval.goldens') / 'phase_80_1_tuned_hybrid.jsonl'. - Per-query: backend.query(q, k=5) → recall@5 substring match against required_substrings (parity with softchat scripts/eval/goldens.py recall_at_5_substring). - Aggregates: mean_recall_at_5, p95_latency_ms, total_tokens. - BASELINE thresholds locked from Phase 80.1 D-19: mean_recall_at_5 >= 0.60, total_tokens <= 4000, p95_latency_ms <= 500 - regress=True compares aggregates to baseline; exit 1 + REGRESSION reason printed if any threshold breached. src/supamem/eval/auto_goldens.py (NEW): - D-07 invariant enforcement: no SaaS LLM calls. - assert_no_saas_llm_env() raises if OPENAI_API_KEY / ANTHROPIC_API_KEY / AZURE_OPENAI_API_KEY / TOGETHER_API_KEY / OPENROUTER_API_KEY is set. - derive_required_substrings(text) — deterministic local extractor for identifiers / dotted names; pure function, no I/O. src/supamem/eval/goldens/phase_80_1_tuned_hybrid.jsonl (NEW): - 33 representative records (id + query + required_substrings). - v0.1 deviation: real Phase 80.1 goldens were per-session sidecars, not a single 33-query JSONL. Plan 80.6-14 will migrate the live SoftChat goldens into this format. Current bundled set exercises the runner contract end-to-end. src/supamem/eval/__init__.py (NEW): public API surface. src/supamem/cli.py: cmd_evalbench wired to run_bench with --regress / --goldens flags. pyproject.toml: force-include the goldens dir into the wheel so importlib.resources finds it after pip install. Tests (9 added, 148/148 total pass): - bundled goldens load (33 records, schema check) - external --goldens path overrides bundled - regress passes when recall >= baseline (perfect-recall mock) - regress fails when recall < baseline (empty-chunks mock) → exit 1 - report emits 'mean recall@5' + 'total tokens' lines - auto_goldens raises if SaaS LLM env var is set (D-07) - auto_goldens passes when no SaaS env vars present - derive_required_substrings is deterministic + finds identifiers - BUNDLED_GOLDENS constant points at .jsonl ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent fec50e5 commit a54fc76

8 files changed

Lines changed: 426 additions & 1 deletion

File tree

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ packages = ["src/supamem"]
5050

5151
[tool.hatch.build.targets.wheel.force-include]
5252
"src/supamem/share" = "supamem/share"
53+
"src/supamem/eval/goldens" = "supamem/eval/goldens"
5354

5455
[tool.ruff]
5556
line-length = 100

src/supamem/cli.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,11 @@ def cmd_evalbench(
153153
goldens: Optional[str] = typer.Option(None, "--goldens", help="Custom goldens JSONL path."),
154154
) -> None:
155155
"""Run the regression harness against the Phase 80.1 golden corpus."""
156-
_stub("eval")
156+
from supamem.config import load_config
157+
from supamem.eval.runner import run_bench
158+
159+
cfg, _chain = load_config()
160+
raise typer.Exit(run_bench(regress=regress, goldens_path=goldens, config=cfg))
157161

158162

159163
@app.command("install")

src/supamem/eval/__init__.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
"""Bench harness for supamem retrieval — recall@5 + p95 latency + token totals.
2+
3+
Public API:
4+
5+
- :func:`run_bench` — executes the bench against bundled or external goldens
6+
- :func:`derive_required_substrings` — D-07-safe local extractor
7+
- :func:`assert_no_saas_llm_env` — D-07 invariant guard
8+
"""
9+
from __future__ import annotations
10+
11+
from supamem.eval.auto_goldens import (
12+
assert_no_saas_llm_env,
13+
derive_required_substrings,
14+
)
15+
from supamem.eval.runner import BASELINE, run_bench
16+
17+
__all__ = [
18+
"BASELINE",
19+
"assert_no_saas_llm_env",
20+
"derive_required_substrings",
21+
"run_bench",
22+
]

src/supamem/eval/auto_goldens.py

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
"""Auto-goldens generator — D-07 invariant: no SaaS LLM calls.
2+
3+
Auto-generates ``required_substrings`` for golden records by extracting
4+
identifiers / tokens from the answer text using a deterministic local
5+
algorithm. NEVER calls OpenAI / Anthropic / any external API — D-07 is the
6+
load-bearing invariant: goldens must be reproducible offline.
7+
"""
8+
from __future__ import annotations
9+
10+
import os
11+
import re
12+
13+
# Env vars whose presence indicates a SaaS LLM SDK is configured. If any is
14+
# set, the user is in a "could call cloud" state — we refuse to run auto-
15+
# goldens to make the D-07 invariant obvious at the boundary.
16+
_SAAS_ENV_VARS: tuple[str, ...] = (
17+
"OPENAI_API_KEY",
18+
"ANTHROPIC_API_KEY",
19+
"AZURE_OPENAI_API_KEY",
20+
"TOGETHER_API_KEY",
21+
"OPENROUTER_API_KEY",
22+
)
23+
24+
25+
def _identifier_tokens(text: str, *, min_len: int = 4) -> list[str]:
26+
"""Extract camelCase / snake_case / dotted-name tokens that look code-shaped.
27+
28+
Heuristic: any run of [A-Za-z_][A-Za-z0-9_]+ at least ``min_len`` chars,
29+
plus dotted names like ``module.func``. Returns deduped list, order
30+
preserved.
31+
"""
32+
pat = re.compile(r"[A-Za-z_][A-Za-z0-9_\.]+[A-Za-z0-9_]")
33+
seen: set[str] = set()
34+
out: list[str] = []
35+
for tok in pat.findall(text):
36+
if len(tok) < min_len:
37+
continue
38+
if tok in seen:
39+
continue
40+
seen.add(tok)
41+
out.append(tok)
42+
return out
43+
44+
45+
def derive_required_substrings(
46+
answer_text: str,
47+
*,
48+
max_subs: int = 5,
49+
min_len: int = 4,
50+
) -> list[str]:
51+
"""Deterministic substring extraction. Pure function, no I/O."""
52+
tokens = _identifier_tokens(answer_text, min_len=min_len)
53+
return tokens[:max_subs]
54+
55+
56+
def assert_no_saas_llm_env() -> None:
57+
"""Raise RuntimeError if any SaaS LLM env var is set (D-07 enforcement)."""
58+
found = [name for name in _SAAS_ENV_VARS if os.environ.get(name, "").strip()]
59+
if found:
60+
raise RuntimeError(
61+
"supamem auto_goldens: D-07 invariant breach — refused to run "
62+
"with SaaS LLM env vars set ({}). Auto-goldens MUST stay offline."
63+
.format(", ".join(found))
64+
)
65+
66+
67+
__all__ = [
68+
"assert_no_saas_llm_env",
69+
"derive_required_substrings",
70+
]

src/supamem/eval/goldens/__init__.py

Whitespace-only changes.
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
{"id": "g01", "query": "how does the indexer chunk markdown documents", "required_substrings": ["chunk", "markdown"]}
2+
{"id": "g02", "query": "what is the locked schema for tuned_hybrid retrieval", "required_substrings": ["dense", "sparse"]}
3+
{"id": "g03", "query": "fail-soft contract for the MCP server tool", "required_substrings": ["fail-soft", "exit"]}
4+
{"id": "g04", "query": "where does supamem store its config file", "required_substrings": [".supamem/config.toml"]}
5+
{"id": "g05", "query": "which embedder powers the dense vector arm", "required_substrings": ["MiniLM"]}
6+
{"id": "g06", "query": "what BM25 model does sparse retrieval use", "required_substrings": ["Qdrant/bm25"]}
7+
{"id": "g07", "query": "how is the Welford counter schema versioned", "required_substrings": ["sum", "sumsq", "count"]}
8+
{"id": "g08", "query": "atomic write strategy for client config patches", "required_substrings": ["atomic", "replace"]}
9+
{"id": "g09", "query": "managed-block fence markers in CLAUDE.md", "required_substrings": ["BEGIN SUPAMEM", "END SUPAMEM"]}
10+
{"id": "g10", "query": "what is the RRF fusion algorithm parameter", "required_substrings": ["RRF", "fusion"]}
11+
{"id": "g11", "query": "default chunk soft max token cap", "required_substrings": ["250"]}
12+
{"id": "g12", "query": "supamem doctor exit codes", "required_substrings": ["exit"]}
13+
{"id": "g13", "query": "how does init detect Qdrant reachability", "required_substrings": ["healthz", "probe"]}
14+
{"id": "g14", "query": "brownfield migration paths supported", "required_substrings": ["coexist", "migrate", "adopt-as-is"]}
15+
{"id": "g15", "query": "stdio transport stdout discipline rule", "required_substrings": ["stdout", "stderr"]}
16+
{"id": "g16", "query": "Streamable HTTP transport kwarg", "required_substrings": ["streamable-http"]}
17+
{"id": "g17", "query": "snapshot before destructive migrate path", "required_substrings": ["snapshot"]}
18+
{"id": "g18", "query": "PreToolUse hook payload shape", "required_substrings": ["hookSpecificOutput", "additionalContext"]}
19+
{"id": "g19", "query": "cursor session-start snapshot regen target", "required_substrings": ["dual-memory-snapshot.mdc"]}
20+
{"id": "g20", "query": "share dir reference-not-copy contract", "required_substrings": ["share"]}
21+
{"id": "g21", "query": "qdrant api key redaction in doctor output", "required_substrings": ["redact"]}
22+
{"id": "g22", "query": "is_code_target gate file suffix rejection list", "required_substrings": [".md", ".lock"]}
23+
{"id": "g23", "query": "marker file for daily-rolled hook coordination", "required_substrings": ["queried"]}
24+
{"id": "g24", "query": "where do auto-goldens prevent SaaS LLM calls", "required_substrings": ["RuntimeError"]}
25+
{"id": "g25", "query": "what does derive_query strip drop_tokens for", "required_substrings": ["drop_tokens"]}
26+
{"id": "g26", "query": "search tool max query length DoS guard", "required_substrings": ["MAX_QUERY_LEN"]}
27+
{"id": "g27", "query": "summary_md field rendered in tool card", "required_substrings": ["summary_md"]}
28+
{"id": "g28", "query": "config precedence ladder rungs", "required_substrings": ["env", "supamem_toml"]}
29+
{"id": "g29", "query": "T-1 markdown header chunker headers list", "required_substrings": ["h1", "h2", "h3"]}
30+
{"id": "g30", "query": "T-5 cosine dedup threshold value", "required_substrings": ["0.97"]}
31+
{"id": "g31", "query": "T-8 token budget truncation cap", "required_substrings": ["1500"]}
32+
{"id": "g32", "query": "forbidden collection guard names list", "required_substrings": ["dev_memory"]}
33+
{"id": "g33", "query": "manifest legacy v1 to v2 schema upgrade", "required_substrings": ["legacy", "v2"]}

src/supamem/eval/runner.py

Lines changed: 159 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
"""Bench runner for ``supamem eval --regress``.
2+
3+
Loads a JSONL golden set (bundled or external), runs each query against
4+
:class:`supamem.retrieval.tuned_hybrid.TunedHybridBackend`, computes
5+
recall@5 via substring matching against each record's
6+
``required_substrings`` list, and aggregates to mean recall + p95 latency
7+
+ total tokens. ``--regress`` mode compares the aggregate to Phase 80.1
8+
locked thresholds and exits non-zero on any breach (SC-9 regression gate).
9+
"""
10+
from __future__ import annotations
11+
12+
import json
13+
import logging
14+
import time
15+
from importlib import resources
16+
from pathlib import Path
17+
from typing import Any
18+
19+
from supamem.config import ResolvedConfig
20+
from supamem.retrieval.tuned_hybrid import TunedHybridBackend
21+
from supamem.retrieval.types import RetrievedChunk
22+
23+
log = logging.getLogger("supamem.eval.runner")
24+
25+
# Phase 80.1 locked thresholds (D-19).
26+
BASELINE = {
27+
"mean_recall_at_5": 0.60,
28+
"total_tokens": 4000,
29+
"p95_latency_ms": 500,
30+
}
31+
32+
BUNDLED_GOLDENS = "phase_80_1_tuned_hybrid.jsonl"
33+
34+
35+
def _load_goldens(path: str | None) -> list[dict[str, Any]]:
36+
"""Load JSONL records from ``path`` or the bundled corpus."""
37+
if path:
38+
body = Path(path).read_text(encoding="utf-8")
39+
else:
40+
# The goldens dir is a sub-package; resources.files works because
41+
# ``supamem.eval.goldens`` has its own __init__.py.
42+
files = resources.files("supamem.eval.goldens")
43+
target = files / BUNDLED_GOLDENS
44+
body = target.read_text(encoding="utf-8")
45+
out: list[dict[str, Any]] = []
46+
for line in body.splitlines():
47+
if not line.strip():
48+
continue
49+
out.append(json.loads(line))
50+
return out
51+
52+
53+
def _recall_at_5(retrieved: list[RetrievedChunk], required: list[str]) -> float:
54+
"""Substring match: fraction of required substrings present in top-5 blob."""
55+
if not required:
56+
return 0.0
57+
blob = " ".join(c.text or "" for c in retrieved[:5])
58+
hits = sum(1 for s in required if s in blob)
59+
return hits / len(required)
60+
61+
62+
def _percentile(values: list[float], pct: float) -> float:
63+
if not values:
64+
return 0.0
65+
s = sorted(values)
66+
k = max(0, min(len(s) - 1, int(round(pct / 100.0 * (len(s) - 1)))))
67+
return float(s[k])
68+
69+
70+
def _build_backend(config: ResolvedConfig) -> TunedHybridBackend:
71+
return TunedHybridBackend(config=config)
72+
73+
74+
def run_bench(
75+
*,
76+
regress: bool = False,
77+
goldens_path: str | None = None,
78+
config: ResolvedConfig | None = None,
79+
) -> int:
80+
"""Run the bench. Returns 0 on pass, 1 on regression / fatal."""
81+
cfg = config or ResolvedConfig()
82+
try:
83+
records = _load_goldens(goldens_path)
84+
except (FileNotFoundError, OSError) as exc:
85+
log.error("supamem eval: failed to load goldens: %s", exc)
86+
return 1
87+
if not records:
88+
log.warning("supamem eval: no golden records loaded")
89+
return 1
90+
91+
backend = _build_backend(cfg)
92+
recalls: list[float] = []
93+
latencies: list[float] = []
94+
total_tokens = 0
95+
rows: list[dict[str, Any]] = []
96+
97+
for rec in records:
98+
query = str(rec.get("query") or "").strip()
99+
required = list(rec.get("required_substrings") or [])
100+
if not query:
101+
continue
102+
t0 = time.perf_counter()
103+
try:
104+
chunks = backend.query(query, k=5)
105+
except Exception as exc: # noqa: BLE001
106+
log.warning("supamem eval: query %r failed: %s", query, type(exc).__name__)
107+
chunks = []
108+
elapsed = (time.perf_counter() - t0) * 1000.0
109+
latencies.append(elapsed)
110+
recall = _recall_at_5(chunks, required)
111+
recalls.append(recall)
112+
total_tokens += sum(max(1, len(c.text or "") // 4) for c in chunks)
113+
rows.append({"id": rec.get("id"), "recall": recall, "latency_ms": elapsed})
114+
115+
mean_recall = sum(recalls) / len(recalls) if recalls else 0.0
116+
p95 = _percentile(latencies, 95.0)
117+
summary = {
118+
"queries": len(records),
119+
"mean_recall_at_5": round(mean_recall, 4),
120+
"p95_latency_ms": round(p95, 2),
121+
"total_tokens": total_tokens,
122+
}
123+
124+
print("supamem eval — bench summary")
125+
print(f" queries : {summary['queries']}")
126+
print(f" mean recall@5 : {summary['mean_recall_at_5']}")
127+
print(f" p95 latency (ms) : {summary['p95_latency_ms']}")
128+
print(f" total tokens : {summary['total_tokens']}")
129+
130+
if not regress:
131+
return 0
132+
133+
breaches: list[str] = []
134+
if mean_recall < BASELINE["mean_recall_at_5"]:
135+
breaches.append(
136+
f"mean_recall_at_5={mean_recall:.4f} < baseline {BASELINE['mean_recall_at_5']}"
137+
)
138+
if total_tokens > BASELINE["total_tokens"]:
139+
breaches.append(
140+
f"total_tokens={total_tokens} > baseline {BASELINE['total_tokens']}"
141+
)
142+
if p95 > BASELINE["p95_latency_ms"]:
143+
breaches.append(
144+
f"p95_latency_ms={p95:.2f} > baseline {BASELINE['p95_latency_ms']}"
145+
)
146+
147+
if breaches:
148+
print()
149+
print("supamem eval — REGRESSION:")
150+
for line in breaches:
151+
print(f" - {line}")
152+
return 1
153+
154+
print()
155+
print("supamem eval — regress: PASS")
156+
return 0
157+
158+
159+
__all__ = ["BASELINE", "run_bench"]

0 commit comments

Comments
 (0)