You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: 0.21 capture-integrity directives in SKILL.md + README (#37)
The four new primitives shipped in #36 (RawProviderSink, assertLlmRoute,
assertRunCaptured, traceAnalystOnRunComplete) are operational discipline,
not just analytical surface — they only prevent the launch-grade-failure
bug class if consumers actually wire them. SKILL.md now encodes them as
required directives with shipped-bug rationale; README has a runnable
composed example linking back to SKILL.md.
SKILL.md:
- "Decide where to start" table: 5 new rows for researchReport + the four
capture-integrity primitives.
- Production-rigor primitives table: rows for researchReport, RawProviderSink,
assertLlmRoute, assertRunCaptured, onRunComplete hooks.
- New "Capture integrity (REQUIRED for launch-grade adoption)" section
with four directives — each carries the why, the shape, and the shipped
incident (blueprint-agent matrix run).
- Pitfalls 12-15: async researchReport, minPairs default bump, custom-header
redaction, hook-error swallow semantics.
README:
- Core Pieces table: rows for the four capture-integrity primitives.
- New "Capture integrity (0.21+)" section with the composed runnable example
and a back-link to SKILL.md.
|`pairedBootstrap`, `pairedWilcoxon`, `bhAdjust`|`paired-stats.ts`| Stats primitives. Pass `seed` to `pairedBootstrap` when the result feeds a CI / promotion decision. |
63
68
|`runCanaries`|`canary.ts`| Silent fallback (constant confidence), calibration drift (KS), distribution shift (chi-square). Returns a report; doesn't fail tests — wire it to a notification. |
64
69
|`summaryTable`, `paretoChart`, `gainHistogram`|`summary-report.ts`| A/B reporting. `summaryTable` emits markdown with bootstrap CIs + paired Wilcoxon p (BH-adjusted) + Cohen's d. The other two return vega-lite-friendly specs. |
|`RawProviderSink` + `callLlm({ rawSink })`|`trace/raw-provider-sink.ts`, `llm-client.ts`| First-class HTTP-level capture alongside `LlmSpan`. `Authorization` / `X-Api-Key` / credential-shaped body fields auto-redacted; `event.redactedFields` records what was stripped. `FileSystemRawProviderSink` rolls at 32 MiB. **Every eval run wires this** — see Directive 1. |
72
+
|`assertLlmRoute`|`llm-client.ts`| Pure preflight guard. Throws `LlmRouteAssertionError` on missing baseUrl, blocked URL, missing auth, wrong provider. Call once at matrix-runner construction. See Directive 2. |
73
+
|`assertRunCaptured` + `throwIfRunIncomplete`|`trace/integrity.ts`| Read-only run-completion check. `requireRawCoverageOfLlmSpans` catches the bug class where structured spans were emitted but raw HTTP capture went to a different sink. See Directive 3. |
74
+
|`onRunComplete` hooks + `traceAnalystOnRunComplete`|`trace/emitter.ts`, `trace-analyst/hook.ts`| Declarative auto-orchestration after `endRun` / `abortRun`. Errors are swallowed and logged by default (auto-orchestration must not crash the underlying flow). See Directive 4. |
65
75
|`Researcher` (interface) + `NoopResearcher`|`researcher.ts`| Stable hook for an external agent that drives the meta-loop. Real implementations live downstream. |
66
76
|`BenchmarkAdapter` + `routing` benchmark |`benchmarks/`| One adapter contract + the synthetic routing task we own. Reference wrappers for GSM8K and SWE-Bench-Lite live under `examples/benchmarks/`. `BENCHMARK_SPLIT_SEED = "agent-eval-v1"` — never change it. |
67
77
@@ -299,6 +309,104 @@ Fail closed; use `// muffle-ok: <reason>` for the rare exception.
299
309
300
310
---
301
311
312
+
## Capture integrity (REQUIRED for launch-grade adoption)
313
+
314
+
A run that *appears* successful but lost its forensic evidence is worse than a failed run — a launch reviewer can't distinguish "we measured a real win" from "we measured nothing on the wrong route." The four directives below are the operational discipline that turns the analytical primitives into a launch-grade artifact. **Skip one and the consumer's run is descriptive, not anchoring** — the same failure mode that prompted the 0.21.0 release.
315
+
316
+
If you're wrapping agent-eval in a matrix runner, propose-review loop, or `BuilderSession`-driven sweep, you wire all four. Trace evidence + paired stats + held-out gate is the analytical surface; capture + route guard + integrity assertion + auto-orchestration is what makes that surface trustworthy.
317
+
318
+
### Directive 1 — every eval run wires a `RawProviderSink`
**Why**: `LlmSpan` records *intent* (model, messages, output, token counts). The raw HTTP body is *ground truth*. Token counts can lie; a proxy can echo a different `model` than answered. Without raw capture you cannot answer "did the verifier hit the wrong route?" or "where did the reasoning tokens go?" after the fact.
327
+
328
+
**Default redaction** strips `Authorization` / `X-Api-Key` / `X-Auth-Token` / `Cookie` headers and credential-shaped body fields (`apiKey`, `bearer`, `password`, `secret`, `token`, `refresh_token`, …). `event.redactedFields` records the paths so a reviewer sees what was stripped without exposing values. Every retry attempt produces its own `request` and `response` (or `error`) event with `attemptIndex`.
329
+
330
+
**Sinks**: `InMemoryRawProviderSink` (tests, dev), `FileSystemRawProviderSink` (rolls at 32 MiB, NDJSON), `NoopRawProviderSink` (when explicitly opting out — annotate why). DuckDB / Langfuse / object-store implementations land downstream against the same interface.
331
+
332
+
**Shipped incident**: `blueprint-agent` matrix run failed launch review because raw events were never written; structured spans alone could not answer "was the verifier hitting the free-tier router?"
expectedProvider: 'openai', // optional: pin the resolved provider
343
+
})
344
+
```
345
+
346
+
**Why**: with `baseUrl` undefined, `callLlm` falls back to `DEFAULT_BASE_URL`. An eval sweep that quietly targets the public/free-tier route produces launch-decision-grade artifacts on the wrong provider — the report scores something the operator never intended to ship. Pure function, no I/O — call from constructors, CI gates, preflight validators.
347
+
348
+
`LlmRouteAssertionError.code` is structured (`no_explicit_base_url` | `base_url_blocked` | `base_url_not_allowed` | `no_auth` | `wrong_provider`) for programmatic recovery.
349
+
350
+
**Shipped incident**: same `blueprint-agent` matrix run silently used the public router; 0.21 ships this so the next consumer fails closed at preflight.
351
+
352
+
### Directive 3 — assert the run captured before declaring done
requireRawCoverageOfLlmSpans: true, // every LlmSpan has a matching raw `request` event
361
+
requireOutcome: true,
362
+
})
363
+
throwIfRunIncomplete(report) // strict; or branch on report.issues for retry
364
+
```
365
+
366
+
**Why**: a run can complete with `status='completed'` and zero raw events (sink wired to wrong dir, fs error swallowed, integrity wired but disk full). Without an end-of-run assertion the partial-capture bug class is invisible until launch review. `requireRawCoverageOfLlmSpans` specifically catches the case where the structured `LlmSpan` was emitted but the raw HTTP capture went to a different sink — the highest-stakes silent failure in the eval pipeline.
traceAnalystOnRunComplete({ analyze: { source, ai }, save: writeAnalysis }),
379
+
],
380
+
})
381
+
```
382
+
383
+
**Why**: out-of-band steps get skipped (CI flag forgotten, env var missing, "I'll run it manually after"). Declarative hooks fire as part of `endRun` / `abortRun` and never get omitted. Hook errors are swallowed and recorded as `log` events by default — auto-orchestration must not crash the underlying flow. Opt into propagation with `hookErrors: 'throw'` for tests.
384
+
385
+
**Shipped incident**: `blueprint-agent` matrix run never produced an analyst artifact for a sweep the consumer expected to be self-analyzing.
If you're skipping any of the four for a reason that isn't "this is a unit test, capture is irrelevant," document the reason inline. The cost of capture is one NDJSON file; the cost of skipping it is the next launch decision.
407
+
408
+
---
409
+
302
410
## Pitfalls
303
411
304
412
1.**Pin the model snapshot.**`validateRunRecord` rejects bare aliases like `claude-sonnet-4-6`. Record `claude-sonnet-4-6@2025-04-15`. Aliases re-map silently; a bare-alias row can't be re-evaluated.
@@ -323,6 +431,14 @@ Fail closed; use `// muffle-ok: <reason>` for the rare exception.
323
431
324
432
11.**`Researcher` is an interface, not an implementation.** Real brains live downstream. Keeping this stub-only is what keeps the contract stable.
325
433
434
+
12.**`researchReport` is async (0.21+).** Web Crypto is used for the run fingerprint; `await` it. The only caller you might miss is a synchronous test helper.
435
+
436
+
13.**`researchReport.minPairs` defaults to 20 (0.21+).** The pre-0.21 default was 6; that was the soft floor of a rigorous report and got bumped because the previous default invited promotion calls on under-powered evidence. The hard floor (`RESEARCH_REPORT_HARD_PAIR_FLOOR`) is 6 and overrides any caller setting below it.
437
+
438
+
14.**`RawProviderSink` redaction is allowlist-of-strip, not allowlist-of-keep.** The default redactor strips well-known auth headers and credential-shaped body fields, but a custom header your proxy uses won't be auto-stripped. If a non-standard auth scheme is in play (`X-Org-Token`, etc.), pass a `redactor` that extends `defaultProviderRedactor`. The cost of a leaked token in NDJSON is high.
439
+
440
+
15.**Hook errors are swallowed and logged by default.**`TraceEmitterOptions.onRunComplete` hooks that throw don't crash the run — that's intentional, auto-orchestration must not fail the underlying flow. If a hook is *load-bearing* for the run's correctness (e.g. a gate that must pass before declaring success), set `hookErrors: 'throw'` or wire the gate as an explicit assertion outside the hook.
0 commit comments