Skip to content

Commit a74b419

Browse files
authored
docs: 0.21 capture-integrity directives in SKILL.md + README (#37)
The four new primitives shipped in #36 (RawProviderSink, assertLlmRoute, assertRunCaptured, traceAnalystOnRunComplete) are operational discipline, not just analytical surface — they only prevent the launch-grade-failure bug class if consumers actually wire them. SKILL.md now encodes them as required directives with shipped-bug rationale; README has a runnable composed example linking back to SKILL.md. SKILL.md: - "Decide where to start" table: 5 new rows for researchReport + the four capture-integrity primitives. - Production-rigor primitives table: rows for researchReport, RawProviderSink, assertLlmRoute, assertRunCaptured, onRunComplete hooks. - New "Capture integrity (REQUIRED for launch-grade adoption)" section with four directives — each carries the why, the shape, and the shipped incident (blueprint-agent matrix run). - Pitfalls 12-15: async researchReport, minPairs default bump, custom-header redaction, hook-error swallow semantics. README: - Core Pieces table: rows for the four capture-integrity primitives. - New "Capture integrity (0.21+)" section with the composed runnable example and a back-link to SKILL.md.
1 parent d755bee commit a74b419

2 files changed

Lines changed: 155 additions & 1 deletion

File tree

.claude/skills/agent-eval/SKILL.md

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,11 @@ If a term below isn't in this table or in `docs/concepts.md`, that's a bug — f
4747
| Standardize a paper-grade run record (snapshot-pinned, hashed, costed) | `RunRecord` + `validateRunRecord` |
4848
| Detect silent judge fallback / calibration drift / distribution shift | `runCanaries` |
4949
| Emit an A/B summary table or Pareto / gain figure spec | `summaryTable` / `paretoChart` / `gainHistogram` |
50+
| Build a launch-decision-grade research report (paired stats, ROPE, MDE, fingerprint, methodology) | `researchReport` (§Research reports) |
51+
| Capture every provider HTTP request/response for forensics | `RawProviderSink` + `LlmClientOptions.rawSink` (§Capture integrity Directive 1) |
52+
| Fail loud if the eval would silently use the wrong route | `assertLlmRoute` (§Capture integrity Directive 2) |
53+
| Assert at run-end that the artifact is complete | `assertRunCaptured` + `throwIfRunIncomplete` (§Capture integrity Directive 3) |
54+
| Auto-execute the trace analyst on every run | `traceAnalystOnRunComplete` + `TraceEmitterOptions.onRunComplete` (§Capture integrity Directive 4) |
5055
| Stable hook for an external research-driver agent | `Researcher` (interface) + `NoopResearcher` (placeholder) |
5156

5257
Extend, don't fork — see §"Extend, don't duplicate."
@@ -62,6 +67,11 @@ Extend, don't fork — see §"Extend, don't duplicate."
6267
| `pairedBootstrap`, `pairedWilcoxon`, `bhAdjust` | `paired-stats.ts` | Stats primitives. Pass `seed` to `pairedBootstrap` when the result feeds a CI / promotion decision. |
6368
| `runCanaries` | `canary.ts` | Silent fallback (constant confidence), calibration drift (KS), distribution shift (chi-square). Returns a report; doesn't fail tests — wire it to a notification. |
6469
| `summaryTable`, `paretoChart`, `gainHistogram` | `summary-report.ts` | A/B reporting. `summaryTable` emits markdown with bootstrap CIs + paired Wilcoxon p (BH-adjusted) + Cohen's d. The other two return vega-lite-friendly specs. |
70+
| `researchReport` | `summary-report.ts` | Async, launch-decision-grade artifact: paired-evidence-only verdicts (`promote` / `hold` / `equivalent` / `reject` / `needs_more_data`), ROPE, Pr(Δ>0), per-candidate MDE via `pairedMde`, SHA-256 `runFingerprint`, optional `preregistrationHash`, embedded methodology. See [`docs/research-report-methodology.md`](../../../docs/research-report-methodology.md). |
71+
| `RawProviderSink` + `callLlm({ rawSink })` | `trace/raw-provider-sink.ts`, `llm-client.ts` | First-class HTTP-level capture alongside `LlmSpan`. `Authorization` / `X-Api-Key` / credential-shaped body fields auto-redacted; `event.redactedFields` records what was stripped. `FileSystemRawProviderSink` rolls at 32 MiB. **Every eval run wires this** — see Directive 1. |
72+
| `assertLlmRoute` | `llm-client.ts` | Pure preflight guard. Throws `LlmRouteAssertionError` on missing baseUrl, blocked URL, missing auth, wrong provider. Call once at matrix-runner construction. See Directive 2. |
73+
| `assertRunCaptured` + `throwIfRunIncomplete` | `trace/integrity.ts` | Read-only run-completion check. `requireRawCoverageOfLlmSpans` catches the bug class where structured spans were emitted but raw HTTP capture went to a different sink. See Directive 3. |
74+
| `onRunComplete` hooks + `traceAnalystOnRunComplete` | `trace/emitter.ts`, `trace-analyst/hook.ts` | Declarative auto-orchestration after `endRun` / `abortRun`. Errors are swallowed and logged by default (auto-orchestration must not crash the underlying flow). See Directive 4. |
6575
| `Researcher` (interface) + `NoopResearcher` | `researcher.ts` | Stable hook for an external agent that drives the meta-loop. Real implementations live downstream. |
6676
| `BenchmarkAdapter` + `routing` benchmark | `benchmarks/` | One adapter contract + the synthetic routing task we own. Reference wrappers for GSM8K and SWE-Bench-Lite live under `examples/benchmarks/`. `BENCHMARK_SPLIT_SEED = "agent-eval-v1"` — never change it. |
6777

@@ -299,6 +309,104 @@ Fail closed; use `// muffle-ok: <reason>` for the rare exception.
299309

300310
---
301311

312+
## Capture integrity (REQUIRED for launch-grade adoption)
313+
314+
A run that *appears* successful but lost its forensic evidence is worse than a failed run — a launch reviewer can't distinguish "we measured a real win" from "we measured nothing on the wrong route." The four directives below are the operational discipline that turns the analytical primitives into a launch-grade artifact. **Skip one and the consumer's run is descriptive, not anchoring** — the same failure mode that prompted the 0.21.0 release.
315+
316+
If you're wrapping agent-eval in a matrix runner, propose-review loop, or `BuilderSession`-driven sweep, you wire all four. Trace evidence + paired stats + held-out gate is the analytical surface; capture + route guard + integrity assertion + auto-orchestration is what makes that surface trustworthy.
317+
318+
### Directive 1 — every eval run wires a `RawProviderSink`
319+
320+
```ts
321+
import { FileSystemRawProviderSink, callLlm } from '@tangle-network/agent-eval'
322+
const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
323+
await callLlm(req, { rawSink: sink, traceContext: { runId, spanId }, ...llmOpts })
324+
```
325+
326+
**Why**: `LlmSpan` records *intent* (model, messages, output, token counts). The raw HTTP body is *ground truth*. Token counts can lie; a proxy can echo a different `model` than answered. Without raw capture you cannot answer "did the verifier hit the wrong route?" or "where did the reasoning tokens go?" after the fact.
327+
328+
**Default redaction** strips `Authorization` / `X-Api-Key` / `X-Auth-Token` / `Cookie` headers and credential-shaped body fields (`apiKey`, `bearer`, `password`, `secret`, `token`, `refresh_token`, …). `event.redactedFields` records the paths so a reviewer sees what was stripped without exposing values. Every retry attempt produces its own `request` and `response` (or `error`) event with `attemptIndex`.
329+
330+
**Sinks**: `InMemoryRawProviderSink` (tests, dev), `FileSystemRawProviderSink` (rolls at 32 MiB, NDJSON), `NoopRawProviderSink` (when explicitly opting out — annotate why). DuckDB / Langfuse / object-store implementations land downstream against the same interface.
331+
332+
**Shipped incident**: `blueprint-agent` matrix run failed launch review because raw events were never written; structured spans alone could not answer "was the verifier hitting the free-tier router?"
333+
334+
### Directive 2 — assert the route at preflight
335+
336+
```ts
337+
import { assertLlmRoute } from '@tangle-network/agent-eval'
338+
assertLlmRoute(llmOpts, {
339+
requireExplicitBaseUrl: true, // never silently fall back to DEFAULT_BASE_URL
340+
allowedBaseUrls: [/api\.openai\.com/, /router\.tangle\.tools/],
341+
requireAuth: true,
342+
expectedProvider: 'openai', // optional: pin the resolved provider
343+
})
344+
```
345+
346+
**Why**: with `baseUrl` undefined, `callLlm` falls back to `DEFAULT_BASE_URL`. An eval sweep that quietly targets the public/free-tier route produces launch-decision-grade artifacts on the wrong provider — the report scores something the operator never intended to ship. Pure function, no I/O — call from constructors, CI gates, preflight validators.
347+
348+
`LlmRouteAssertionError.code` is structured (`no_explicit_base_url` | `base_url_blocked` | `base_url_not_allowed` | `no_auth` | `wrong_provider`) for programmatic recovery.
349+
350+
**Shipped incident**: same `blueprint-agent` matrix run silently used the public router; 0.21 ships this so the next consumer fails closed at preflight.
351+
352+
### Directive 3 — assert the run captured before declaring done
353+
354+
```ts
355+
import { assertRunCaptured, throwIfRunIncomplete } from '@tangle-network/agent-eval'
356+
const report = await assertRunCaptured(store, emitter.runId, {
357+
llmSpansMin: 1,
358+
judgeSpansMin: 1,
359+
rawSink,
360+
requireRawCoverageOfLlmSpans: true, // every LlmSpan has a matching raw `request` event
361+
requireOutcome: true,
362+
})
363+
throwIfRunIncomplete(report) // strict; or branch on report.issues for retry
364+
```
365+
366+
**Why**: a run can complete with `status='completed'` and zero raw events (sink wired to wrong dir, fs error swallowed, integrity wired but disk full). Without an end-of-run assertion the partial-capture bug class is invisible until launch review. `requireRawCoverageOfLlmSpans` specifically catches the case where the structured `LlmSpan` was emitted but the raw HTTP capture went to a different sink — the highest-stakes silent failure in the eval pipeline.
367+
368+
Issue codes: `no_run` | `missing_llm_spans` | `missing_judge_spans` | `missing_tool_spans` | `missing_raw_events` | `no_raw_sink` | `orphan_llm_span` | `missing_outcome`.
369+
370+
### Directive 4 — auto-execute the trace analyst via hook, not out-of-band
371+
372+
```ts
373+
import { TraceEmitter } from '@tangle-network/agent-eval'
374+
import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
375+
376+
const emitter = new TraceEmitter(store, {
377+
onRunComplete: [
378+
traceAnalystOnRunComplete({ analyze: { source, ai }, save: writeAnalysis }),
379+
],
380+
})
381+
```
382+
383+
**Why**: out-of-band steps get skipped (CI flag forgotten, env var missing, "I'll run it manually after"). Declarative hooks fire as part of `endRun` / `abortRun` and never get omitted. Hook errors are swallowed and recorded as `log` events by default — auto-orchestration must not crash the underlying flow. Opt into propagation with `hookErrors: 'throw'` for tests.
384+
385+
**Shipped incident**: `blueprint-agent` matrix run never produced an analyst artifact for a sweep the consumer expected to be self-analyzing.
386+
387+
### Composed shape — the four together
388+
389+
```ts
390+
const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
391+
assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })
392+
393+
const emitter = new TraceEmitter(store, {
394+
onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save: writeAnalysis })],
395+
})
396+
await emitter.startRun({ scenarioId, layer: 'app-runtime' })
397+
// LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
398+
await emitter.endRun({ pass, score })
399+
400+
const integrity = await assertRunCaptured(store, emitter.runId, {
401+
llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
402+
})
403+
throwIfRunIncomplete(integrity)
404+
```
405+
406+
If you're skipping any of the four for a reason that isn't "this is a unit test, capture is irrelevant," document the reason inline. The cost of capture is one NDJSON file; the cost of skipping it is the next launch decision.
407+
408+
---
409+
302410
## Pitfalls
303411

304412
1. **Pin the model snapshot.** `validateRunRecord` rejects bare aliases like `claude-sonnet-4-6`. Record `claude-sonnet-4-6@2025-04-15`. Aliases re-map silently; a bare-alias row can't be re-evaluated.
@@ -323,6 +431,14 @@ Fail closed; use `// muffle-ok: <reason>` for the rare exception.
323431

324432
11. **`Researcher` is an interface, not an implementation.** Real brains live downstream. Keeping this stub-only is what keeps the contract stable.
325433

434+
12. **`researchReport` is async (0.21+).** Web Crypto is used for the run fingerprint; `await` it. The only caller you might miss is a synchronous test helper.
435+
436+
13. **`researchReport.minPairs` defaults to 20 (0.21+).** The pre-0.21 default was 6; that was the soft floor of a rigorous report and got bumped because the previous default invited promotion calls on under-powered evidence. The hard floor (`RESEARCH_REPORT_HARD_PAIR_FLOOR`) is 6 and overrides any caller setting below it.
437+
438+
14. **`RawProviderSink` redaction is allowlist-of-strip, not allowlist-of-keep.** The default redactor strips well-known auth headers and credential-shaped body fields, but a custom header your proxy uses won't be auto-stripped. If a non-standard auth scheme is in play (`X-Org-Token`, etc.), pass a `redactor` that extends `defaultProviderRedactor`. The cost of a leaked token in NDJSON is high.
439+
440+
15. **Hook errors are swallowed and logged by default.** `TraceEmitterOptions.onRunComplete` hooks that throw don't crash the run — that's intentional, auto-orchestration must not fail the underlying flow. If a hook is *load-bearing* for the run's correctness (e.g. a gate that must pass before declaring success), set `hookErrors: 'throw'` or wire the gate as an explicit assertion outside the hook.
441+
326442
---
327443

328444
## Regression tests worth writing

README.md

Lines changed: 39 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,9 +111,47 @@ import { renderReleaseReport } from '@tangle-network/agent-eval/reporting'
111111
| Compare prompt/tool/retrieval policies over full trajectories | `runMultiShotOptimization` |
112112
| Gate releases with paired evidence and holdouts | `evaluateReleaseConfidence`, `HeldOutGate` |
113113
| Explain regressions across trace corpora | `TraceAnalyst` / `analyzeTraces` |
114-
| Report a launch decision | `renderReleaseReport`, `summaryTable`, `paretoChart`, `gainHistogram` |
114+
| Report a launch decision | `renderReleaseReport`, `researchReport`, `summaryTable`, `paretoChart`, `gainHistogram` |
115+
| Capture every provider HTTP request / response for forensics | `RawProviderSink`, `LlmClientOptions.rawSink` |
116+
| Fail loud if an eval would silently use the wrong route | `assertLlmRoute` |
117+
| Assert at run-end that the artifact is complete | `assertRunCaptured`, `throwIfRunIncomplete` |
118+
| Auto-execute the trace analyst on every run | `traceAnalystOnRunComplete` + `TraceEmitterOptions.onRunComplete` |
115119
| Model missing context separately from bad reasoning | `KnowledgeRequirement`, `KnowledgeBundle` |
116120

121+
### Capture integrity (0.21+)
122+
123+
Launch-grade benchmark runs need four things that are easy to forget in glue
124+
code: (1) raw HTTP capture alongside the structured spans so a reviewer can
125+
verify which route answered, (2) a preflight assertion that the configured
126+
client points at the intended provider, (3) a run-end assertion that the
127+
expected events were actually written, and (4) auto-execution of the trace
128+
analyst as part of the run lifecycle. The wiring fits in a few lines:
129+
130+
```ts
131+
import {
132+
TraceEmitter, FileSystemRawProviderSink, callLlm, assertLlmRoute,
133+
assertRunCaptured, throwIfRunIncomplete,
134+
} from '@tangle-network/agent-eval'
135+
import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
136+
137+
const sink = new FileSystemRawProviderSink({ dir: `${workDir}/raw-events` })
138+
assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, allowedBaseUrls, requireAuth: true })
139+
140+
const emitter = new TraceEmitter(store, {
141+
onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
142+
})
143+
await emitter.startRun(/* ... */)
144+
// LLM calls flow through callLlm with `{ rawSink: sink, traceContext: { runId, spanId } }`.
145+
await emitter.endRun({ pass, score })
146+
147+
throwIfRunIncomplete(await assertRunCaptured(store, emitter.runId, {
148+
llmSpansMin: 1, rawSink: sink, requireRawCoverageOfLlmSpans: true, requireOutcome: true,
149+
}))
150+
```
151+
152+
Directives, rationale, and shipped-bug context are in
153+
[`SKILL.md` § Capture integrity](./.claude/skills/agent-eval/SKILL.md#capture-integrity-required-for-launch-grade-adoption).
154+
117155
## Examples
118156

119157
Runnable examples live in

0 commit comments

Comments
 (0)