Skip to content

Commit 3201cc3

Browse files
cdeustclaude
andcommitted
feat(verif): BEAM-10M LLM head-to-head harness scaffold (Stage 0)
Wires the harness skeleton at benchmarks/llm_head_to_head/ per pre- registered protocol tasks/beam-10m-llm-head-to-head-protocol.md (Fisher v3, 2026-04-30). NO API spend yet — code-only Stage 0. Components: - data_loader (196 BEAM-10M items, gold-supporting turn map) - prompts/{answer,judge}.md (verbatim, SHA-256 anchored at run time) - 4 condition builders: long_context_truncator (A, recency-truncated), retriever_baselines (B, vanilla cosine top-20), cortex_caller (C, production handler — anti-cheating §11.1), oracle_loader (D) - generator (Anthropic + Google + OpenAI with retry/backoff/jitter, env-var keys, secret-audit on manifest write) - judge (cross-vendor: Opus judges Google/OpenAI, GPT-4o judges Haiku; blind shuffle via stable per-question seed) - manifest emitter (protocol §10 schema; secret audit; cost-tracking read-modify-write) - orchestrator (wires conditions × generators × judges; --dry-run cost estimator) - pilot.py (Stage 1 entry: B+C on Haiku 4.5 — dry-run only in this PR) Tests (23 passing): four new files including the production-handler- invariant unit test that asserts cortex_caller.py uses no monkey- patching, only kwargs from the production schema, and a single load- bearing import of mcp_server.handlers.recall::handler. retriever_ baselines isolation test asserts NO Cortex-stack imports / call sites. Manifest test covers required-fields, secret audit, cost tracking. Dry-run smoke (pilot.py --dry-run --n 3) builds all 4 conditions clean: condition A truncates a 19,895-turn BEAM-10M conversation to 195k input tokens (recency-keep), conditions B/C placeholder per scaffold contract, condition D loads gold turns. Stage 2.1 cost estimate $43.95 matches protocol §7 ($44). Known follow-up: manifest.py (376 LOC) and generator.py (375 LOC) are slightly over the project 300-LOC ceiling (~25%, within Medium-stakes 20% flexibility per coding-standards §10). Natural split lines exist (vendor-adapter functions, audit/write/cost helpers); deferred to a post-freeze refactor since the scaffold is what gets pre-registered. Closes Stage 0 of protocol §12 timeline. Pilot fires next; full Stage 2.1 (Haiku 4.5 panel, ~$40-55) gated on pilot pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 551a411 commit 3201cc3

19 files changed

Lines changed: 2834 additions & 0 deletions
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
"""BEAM-10M LLM Head-to-Head Harness.
2+
3+
Stage-0 scaffold for the pre-registered protocol at
4+
``tasks/beam-10m-llm-head-to-head-protocol.md`` (v3, frozen 2026-04-30).
5+
6+
Four conditions feed the SAME generator prompt:
7+
A — naive long-context (recency-truncated to model window)
8+
B — standard top-20 vector RAG (Lewis 2020, no Cortex stack)
9+
C — Cortex-assembled (production ``handlers.recall.handler``)
10+
D — Oracle (gold ``source_chat_ids`` turns)
11+
12+
NO API spend at scaffold stage; the orchestrator's ``--dry-run`` mode
13+
must produce all four context blocks without firing any HTTP requests.
14+
"""
15+
16+
# precondition: package import is side-effect free; no API keys read here.
17+
# postcondition: re-exporting module names resolves cleanly so callers can
18+
# ``from benchmarks.llm_head_to_head import data_loader`` without import-
19+
# time network or DB access.
20+
21+
__all__ = [
22+
"data_loader",
23+
"long_context_truncator",
24+
"retriever_baselines",
25+
"cortex_caller",
26+
"oracle_loader",
27+
"generator",
28+
"judge",
29+
"manifest",
30+
"orchestrator",
31+
"pilot",
32+
]
Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
"""Condition C — Cortex-assembled context.
2+
3+
PROTOCOL §11.1 ANTI-CHEATING (load-bearing invariant for the whole study):
4+
5+
This module MUST invoke the production handler entry point
6+
``mcp_server.handlers.recall.handler`` directly, with arguments that
7+
already exist in the production schema. NO monkey-patching. NO benchmark-
8+
only kwargs. NO ``--benchmark-mode`` flag. NO alternative code path.
9+
10+
The unit test ``tests_py/handlers/test_beam_anticheat.py`` reads THIS
11+
file's source code and asserts:
12+
1. The only import targeting ``mcp_server.handlers.recall`` is exactly
13+
``from mcp_server.handlers.recall import handler``.
14+
2. No call to ``setattr``, ``__class__``, or any monkey-patch primitive.
15+
3. The kwargs passed to ``handler({...})`` are a subset of the keys
16+
declared in ``recall.schema['inputSchema']['properties']``.
17+
18+
If you change this file, the anti-cheating test must still pass without
19+
modification, OR a protocol addendum must be filed (§11 forbids silent
20+
deviation).
21+
22+
precondition: the production memory store has been seeded with the BEAM
23+
conversation's memories under ``domain="beam"`` (the orchestrator does
24+
this via the production ``remember`` handler, not a benchmark shortcut).
25+
postcondition: returns the same memory dicts the production handler would
26+
return for an interactive call with the same query — same ranking, same
27+
enrichments (PL/pgSQL WRRF + FlashRank + prospective + co-activation +
28+
rules + strategic ordering + replay tracking).
29+
invariant: this module's import of ``handler`` is the SOLE link between
30+
the benchmark and the production stack. Removing this import and
31+
re-running condition C must produce a clean ImportError, not a silent
32+
fallback path.
33+
"""
34+
35+
from __future__ import annotations
36+
37+
import asyncio
38+
from typing import Any
39+
40+
# THE LOAD-BEARING IMPORT. Do not change without filing a protocol addendum.
41+
from mcp_server.handlers.recall import handler # noqa: E402
42+
43+
44+
# Pre-registered max_results value matching condition B's k=20 (protocol §2.C
45+
# uses the same retrieval depth as B so the comparison isolates the stack).
46+
CORTEX_MAX_RESULTS = 20
47+
48+
49+
def cortex_recall(question: str, domain: str = "beam") -> list[dict[str, Any]]:
50+
"""Call the production recall handler — exactly as production does.
51+
52+
pre: ``question`` is non-empty; ``domain`` matches what the orchestrator
53+
seeded via the production remember handler.
54+
post: returns a list of memory dicts (possibly empty) — whatever the
55+
production handler returned. We do NOT post-process, re-rank, or
56+
filter; the handler IS the production behaviour.
57+
"""
58+
if not question or not question.strip():
59+
return []
60+
61+
# The production handler is async. Run it on a fresh loop so the
62+
# benchmark orchestrator (synchronous) can call us. This is the same
63+
# pattern any synchronous caller of an MCP tool uses.
64+
args = {
65+
"query": question,
66+
"domain": domain,
67+
"max_results": CORTEX_MAX_RESULTS,
68+
}
69+
response = asyncio.run(handler(args))
70+
71+
# The production handler returns {"results": [...], "total": N, ...}
72+
# per ``recall.py::_handler_impl``. We pull the ``results`` list and
73+
# return it verbatim.
74+
if isinstance(response, dict):
75+
results = response.get("results", [])
76+
if isinstance(results, list):
77+
return results
78+
return []
79+
80+
81+
def passages_to_context(memories: list[dict[str, Any]], separator: str = "\n\n") -> str:
82+
"""Concatenate Cortex-returned memories into the answer prompt.
83+
84+
pre: memories is already ranked best-first by the production handler
85+
(FlashRank + strategic ordering already applied).
86+
post: returns a string; empty when memories is empty. The format
87+
preserves the production ranking — caller does NOT shuffle.
88+
"""
89+
return separator.join(
90+
m.get("content", "") for m in memories if m.get("content")
91+
)
Lines changed: 176 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,176 @@
1+
"""BEAM-10M data loader — 196 items, source_chat_ids → gold-supporting turns.
2+
3+
Reuses ``benchmarks.beam.data`` (the existing 196-item discovery + chat-id
4+
flattening). This module is the SINGLE source of truth for "what items are
5+
in the protocol universe" — every condition builder receives items from
6+
``load_items()`` so they all see the same questions in the same order.
7+
8+
precondition: HuggingFace ``datasets`` package installed; network reachable
9+
on first run (cached afterwards) per ``benchmarks/beam/data.py::load_beam_dataset``.
10+
postcondition: ``load_items()`` returns exactly 196 BeamItem records when
11+
the BEAM-10M split is reachable. Item count mismatch → raises ValueError
12+
per protocol §5 ("This is the universe"). Order is dataset-iteration
13+
order — deterministic across runs because the HF dataset is content-
14+
addressed.
15+
invariant: source_chat_ids on each item are GLOBAL turn IDs (post-flatten
16+
by ``extract_10m_chat``), matching what's in the conversation `turns`
17+
list. The gold-supporting-turn lookup in ``oracle_loader.py`` depends
18+
on this.
19+
"""
20+
21+
from __future__ import annotations
22+
23+
import sys
24+
from dataclasses import dataclass, field
25+
from pathlib import Path
26+
from typing import Iterator
27+
28+
# Re-use the existing BEAM loader without modifying it.
29+
sys.path.insert(0, str(Path(__file__).resolve().parents[2]))
30+
from benchmarks.beam.data import ( # noqa: E402
31+
extract_10m_chat,
32+
extract_conversation_turns,
33+
load_beam_dataset,
34+
parse_probing_questions,
35+
turns_to_memories,
36+
)
37+
38+
39+
# Pre-registered universe size from protocol §5 (Tavakoli et al. 2026, Table 2).
40+
EXPECTED_ITEM_COUNT = 196
41+
42+
# pre-registered RNG seeds (protocol §10 manifest, §11.5 anti-cheating).
43+
SHUFFLE_SEED_BASE = 20260501
44+
JUDGE_SHUFFLE_BASE = 20260501 # same base, per-question delta in judge.py
45+
BOOTSTRAP_SEED = 20260503
46+
47+
48+
@dataclass(frozen=True)
49+
class BeamItem:
50+
"""One BEAM-10M probing question + its conversation context.
51+
52+
Equality is on ``question_id`` only so the same item from two loads
53+
is deduplicated correctly.
54+
"""
55+
56+
question_id: str
57+
conversation_idx: int
58+
ability: str
59+
question: str
60+
gold_answer: str
61+
source_chat_ids: tuple[int, ...]
62+
# Conversation context, copied by reference at construction time. The
63+
# ``turns`` list is the GLOBAL-id-numbered flat list produced by
64+
# ``extract_10m_chat`` + ``extract_conversation_turns``.
65+
turns: list[dict] = field(default_factory=list, hash=False, compare=False)
66+
memories: list[dict] = field(default_factory=list, hash=False, compare=False)
67+
68+
def __hash__(self) -> int:
69+
return hash(self.question_id)
70+
71+
72+
def _flatten_source_ids(raw: object) -> tuple[int, ...]:
73+
"""source_chat_ids may be list[int] or dict-of-lists. Flatten to tuple[int].
74+
75+
pre: raw is whatever BEAM emits in ``probing_questions[ability][i]``.
76+
post: returns a tuple of int turn IDs (possibly empty for abstention).
77+
"""
78+
if isinstance(raw, dict):
79+
out: list[int] = []
80+
for v in raw.values():
81+
if isinstance(v, list):
82+
out.extend(i for i in v if isinstance(i, int))
83+
elif isinstance(v, int):
84+
out.append(v)
85+
return tuple(out)
86+
if isinstance(raw, list):
87+
return tuple(i for i in raw if isinstance(i, int))
88+
return ()
89+
90+
91+
def iter_items(split: str = "10M") -> Iterator[BeamItem]:
92+
"""Yield BeamItems in dataset-iteration order.
93+
94+
pre: split == "10M" for the protocol (other splits accepted for smoke
95+
tests but emit a warning to stderr).
96+
post: each yielded item's ``turns`` and ``memories`` are non-empty
97+
iff the underlying conversation had probing_questions; empties are
98+
skipped.
99+
"""
100+
if split != "10M":
101+
print(
102+
f"[data_loader] WARNING: split={split} is not the pre-registered "
103+
"10M universe; results not protocol-valid.",
104+
file=sys.stderr,
105+
)
106+
107+
ds = load_beam_dataset(split)
108+
for conv_idx, conversation in enumerate(ds):
109+
# BEAM-10M aggregates 10 sub-plans into one ~10M-token convo.
110+
if split == "10M":
111+
chat = extract_10m_chat(conversation)
112+
else:
113+
chat = conversation.get("chat", "")
114+
115+
turns = extract_conversation_turns(chat)
116+
memories = turns_to_memories(turns)
117+
if not turns:
118+
continue
119+
120+
raw_pq = conversation.get("probing_questions", "{}")
121+
questions = parse_probing_questions(raw_pq)
122+
if not questions:
123+
continue
124+
125+
for ability, qs in questions.items():
126+
if not isinstance(qs, list):
127+
qs = [qs]
128+
for q_idx, q in enumerate(qs):
129+
if not isinstance(q, dict):
130+
continue
131+
question_text = q.get("question", "")
132+
if not question_text:
133+
continue
134+
yield BeamItem(
135+
question_id=f"conv{conv_idx:03d}-{ability}-{q_idx:02d}",
136+
conversation_idx=conv_idx,
137+
ability=ability,
138+
question=question_text,
139+
gold_answer=q.get("answer", "") or "",
140+
source_chat_ids=_flatten_source_ids(q.get("source_chat_ids", [])),
141+
turns=turns,
142+
memories=memories,
143+
)
144+
145+
146+
def load_items(split: str = "10M", strict: bool = True) -> list[BeamItem]:
147+
"""Materialise all items into a list. Verifies the universe size.
148+
149+
pre: ``strict=True`` enforces the 196-item invariant from protocol §5.
150+
post: returns ``EXPECTED_ITEM_COUNT`` items in dataset-iteration order
151+
when ``strict=True`` and split=="10M". Mismatch → ValueError. When
152+
``strict=False`` (smoke / dry-run), accepts any count and warns.
153+
"""
154+
items = list(iter_items(split))
155+
if strict and split == "10M" and len(items) != EXPECTED_ITEM_COUNT:
156+
raise ValueError(
157+
f"BEAM-10M item count mismatch: expected {EXPECTED_ITEM_COUNT} "
158+
f"per protocol §5 (Tavakoli et al. 2026), got {len(items)}. "
159+
"Universe drift requires a protocol addendum, not a silent run."
160+
)
161+
if not strict and len(items) != EXPECTED_ITEM_COUNT:
162+
print(
163+
f"[data_loader] non-strict: got {len(items)} items "
164+
f"(expected {EXPECTED_ITEM_COUNT}). Use only for dry-run/smoke.",
165+
file=sys.stderr,
166+
)
167+
return items
168+
169+
170+
def turn_lookup(item: BeamItem) -> dict[int, dict]:
171+
"""Map global turn-id → turn dict, for oracle retrieval.
172+
173+
pre: ``item.turns`` is the global-numbered flat list.
174+
post: returned dict has one entry per turn; key == turn['id'].
175+
"""
176+
return {t["id"]: t for t in item.turns if isinstance(t.get("id"), int)}

0 commit comments

Comments
 (0)