Skip to content

Commit d763d00

Browse files
authored
feat: 0.22.0 — EvalCampaign + replay + always-valid + outcome calibration (#42)
* feat: 0.22.0 — runEvalCampaign, capture integrity by construction 0.21 shipped the four capture-integrity primitives (RawProviderSink, assertLlmRoute, assertRunCaptured, onRunComplete hooks) as opt-in. Every consumer still had to wire them by hand, and the bug class blueprint-agent reported (forgotten wiring → silent partial-capture) reappears the moment a new consumer adopts agent-eval cold. 0.22 makes the right thing the default path. runEvalCampaign is an opinionated matrix runner that owns the integrity surface so consumers stop reinventing it. What it owns: - assertLlmRoute() once at preflight, with requireExplicitBaseUrl + requireAuth defaults. Misconfigured routes never burn a run. - Per cell: TraceStore + RawProviderSink + TraceEmitter constructed from caller-supplied factories. The runner receives an LlmClientOptions pre-wired with rawSink + traceContext — calling an LLM without capturing it requires actively bypassing the campaign. - assertRunCaptured() after every endRun with requireRawCoverageOfLlmSpans + requireOutcome defaults. Failure policy: throw | mark_failed | log (default mark_failed; sibling cells continue). - onRunComplete hooks — pass traceAnalystOnRunComplete to auto-run the analyst as part of the run lifecycle. - End of campaign: researchReport over the collected RunRecords with the campaign fingerprint + preregistrationHash baked in. Determinism + isolation: - Default runId is a stable hash of (campaignId, variantId, scenarioId, seed). Re-running the same campaign produces the same ids. - Campaign fingerprint is a SHA-256 over the canonicalised plan (variants, scenarios, seeds, splitTag, baseUrl, provider, preregistrationHash) — stable across permutations. - Local async worker pool, default concurrency 1. Failure isolation: - Runner throws → cell marked failed, others continue. - Integrity fails → routed by onIntegrityFailure policy. - Genuine non-runner exceptions propagate (don't mask bugs). Surface: - runEvalCampaign exported from root and @tangle-network/agent-eval/optimization. - Types: CampaignRunner, CampaignRunContext, CampaignRunOutcome, CampaignVariant, CampaignScenario, EvalCampaignOptions, EvalCampaignResult, FailedRun, CampaignIntegrityPolicy, CampaignFactoryParams. - NoopRawProviderSink.list() now returns [] so explicit opt-out from capture is not flagged as no_raw_sink by assertRunCaptured. Opt-out remains a deliberate choice — caller still has to override integrity expectations to admit the run. Tests: - 883 / 883 passing (+16 dedicated runEvalCampaign cases): happy path, research report end-to-end, fingerprint stability across permutations, preregistration passthrough, route preflight failures, validation errors, runner-throws-with-isolation, all three integrity policies, concurrency. - typecheck + build clean. Docs: - SKILL.md: new "EvalCampaign — preferred starting point" section BEFORE the capture-integrity directive list, with a full runnable example and explicit when-not-to-use guidance pointing at runMultiShotOptimization, runPromptEvolution, runAgentControlLoop. - Discoverability rows added to the "Decide where to start" and "Production-rigor primitives" tables. Version lockstep: npm 0.22.0 ↔ PyPI agent-eval-rpc 0.22.0. Migration: existing consumers don't need to change. runEvalCampaign is additive. The recommended path is to replace hand-rolled matrix runners with a single runEvalCampaign call on the next eval-runner refactor. The capture-integrity directives go from "things you might forget" to "things the framework owns." * feat: 0.22.0 — replay, anytime-valid sequential, outcome calibration Three primitives that compound on top of the EvalCampaign artifact: 1. Replay-from-raw-events: every captured campaign is a re-runnable artifact. ReplayCache + createReplayFetch turn yesterday's raw provider events into a deterministic fetch-shaped cache. Re-judge, re-score, or determinism-audit without burning a single LLM token. 2. Anytime-valid sequential evaluation: pairedEvalueSequence and evaluateInterimReleaseConfidence ship the predictable plug-in betting martingale of Waudby-Smith & Ramdas (2024) paired with the empirical Bernstein confidence sequence of Howard et al. (2021). Type-I error is bounded by α at every stopping time — peek at every campaign tick without inflating false-discovery rate. Tested under the null at α=0.05 on 100 synthetic series; bound holds. 3. Outcome calibration loop: rubricPredictiveValidity joins canonical RunRecords to a DeploymentOutcomeStore and ranks rubrics by |spearman| against the outcomes that actually matter. Verdict bucketing (load_bearing / informative / decorative) tells you which rubrics earn their promotion power and which are decorative. Without this loop every rubric is faith-based. Each is a standalone primitive but they compose: - Replay makes outcome-calibration cheaper to retrofit (re-score past runs with new rubrics without re-burning). - Sequential makes campaign cadence honest (peek every Tuesday). - Outcome calibration tells sequential which rubrics to peek at. Surface (root + subpaths): - Root: ReplayCache, createReplayFetch, iterateRawCalls, ReplayCacheMissError, pairedEvalueSequence, evaluateInterimReleaseConfidence - traces subpath: replay primitives - reporting subpath: sequential primitives + rubricPredictiveValidity - meta-eval barrel: rubricPredictiveValidity (alongside existing correlationStudy / OutcomeStore / calibrationCurve) Tests: - 910 / 910 passing (+27 dedicated cases across 3 new files): replay (cache build, lookup, miss policies, fallback, pass-through), sequential (continue/promote/reject/equivalent, type-I bound under the null, p-value monotonicity, clipping, configuration validation), rubric predictive validity (load-bearing vs decorative ranking, rubric discovery, minSamples / skipped-runs / no-data handling, sign-aware verdict). Docs: - SKILL.md: new sections "Replay & sequential evaluation" and "Outcome calibration", with runnable examples and citations. - README.md: new Core Pieces rows. - methodology doc: "out of scope" entries for sequential inference and outcome calibration are now "shipped in 0.22" with the references. Migration: all four primitives are additive. Recommended sequence: - Replace hand-rolled matrix runners with runEvalCampaign. - Wire evaluateInterimReleaseConfidence into the rolling-campaign loop. - Replay on every eval R&D iteration (free). - Run rubricPredictiveValidity quarterly once ≥ 30 outcome rows exist. References: - Howard, S. R., Ramdas, A., McAuliffe, J., Sekhon, J. (2021). Time-uniform, nonparametric, nonasymptotic confidence sequences. Annals of Statistics, 49(2), 1055–1080. - Waudby-Smith, I., Ramdas, A. (2024). Estimating means of bounded random variables by betting. JRSS B, 86(1), 1–27.
1 parent ac0c593 commit d763d00

21 files changed

Lines changed: 2448 additions & 7 deletions

.claude/skills/agent-eval/SKILL.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@ If a term below isn't in this table or in `docs/concepts.md`, that's a bug — f
4848
| Detect silent judge fallback / calibration drift / distribution shift | `runCanaries` |
4949
| Emit an A/B summary table or Pareto / gain figure spec | `summaryTable` / `paretoChart` / `gainHistogram` |
5050
| Build a launch-decision-grade research report (paired stats, ROPE, MDE, fingerprint, methodology) | `researchReport` (§Research reports) |
51+
| Run a matrix of variants × scenarios × seeds with capture integrity by construction | `runEvalCampaign` (§EvalCampaign — preferred starting point for new evals) |
52+
| Re-run / re-judge / determinism-audit a past campaign for free | `ReplayCache` + `createReplayFetch` (§Replay & sequential evaluation) |
53+
| Ship the moment evidence is decisive, with anytime-valid α control across rolling looks | `pairedEvalueSequence`, `evaluateInterimReleaseConfidence` (§Replay & sequential evaluation) |
54+
| Tell load-bearing rubrics from decorative ones using deployment outcomes | `rubricPredictiveValidity` (§Outcome calibration) |
5155
| Capture every provider HTTP request/response for forensics | `RawProviderSink` + `LlmClientOptions.rawSink` (§Capture integrity Directive 1) |
5256
| Fail loud if the eval would silently use the wrong route | `assertLlmRoute` (§Capture integrity Directive 2) |
5357
| Assert at run-end that the artifact is complete | `assertRunCaptured` + `throwIfRunIncomplete` (§Capture integrity Directive 3) |
@@ -68,6 +72,10 @@ Extend, don't fork — see §"Extend, don't duplicate."
6872
| `runCanaries` | `canary.ts` | Silent fallback (constant confidence), calibration drift (KS), distribution shift (chi-square). Returns a report; doesn't fail tests — wire it to a notification. |
6973
| `summaryTable`, `paretoChart`, `gainHistogram` | `summary-report.ts` | A/B reporting. `summaryTable` emits markdown with bootstrap CIs + paired Wilcoxon p (BH-adjusted) + Cohen's d. The other two return vega-lite-friendly specs. |
7074
| `researchReport` | `summary-report.ts` | Async, launch-decision-grade artifact: paired-evidence-only verdicts (`promote` / `hold` / `equivalent` / `reject` / `needs_more_data`), ROPE, Pr(Δ>0), per-candidate MDE via `pairedMde`, SHA-256 `runFingerprint`, optional `preregistrationHash`, embedded methodology. See [`docs/research-report-methodology.md`](../../../docs/research-report-methodology.md). |
75+
| `runEvalCampaign` | `eval-campaign.ts` | The capture-integrity directives, made structural. Variants × scenarios × seeds → `RunRecord[]` + integrity reports + (optional) `researchReport`. Wires `assertLlmRoute` at preflight, builds `TraceStore` + `RawProviderSink` + `TraceEmitter` per run, asserts `requireRawCoverageOfLlmSpans` at run-end, runs the analyst on completion. See §EvalCampaign. |
76+
| `ReplayCache` + `createReplayFetch` + `iterateRawCalls` | `replay.ts` | Turns a populated `RawProviderSink` into a `(canonical request → cached response)` cache + a `fetch`-shaped shim. Pass via `LlmClientOptions.fetch` and `callLlm` reads from the cache transparently; zero LLM cost for re-judging, post-hoc scoring, or determinism audits. See §Replay & sequential evaluation. |
77+
| `pairedEvalueSequence`, `evaluateInterimReleaseConfidence` | `sequential.ts` | Anytime-valid sequential evaluation: predictable plug-in betting martingale (Waudby-Smith & Ramdas 2024) + empirical Bernstein confidence sequence (Howard et al. 2021). Verdict at every interim look is type-I-error-controlled at α regardless of how many times you peeked. Pair with `runEvalCampaign` for ship-when-decisive. |
78+
| `rubricPredictiveValidity` | `meta-eval/rubric-predictive-validity.ts` | The outcome-calibration loop: joins campaign `RunRecord`s to deployment `OutcomeStore` and ranks rubrics by `\|spearman\|` against each outcome metric, with bootstrap CI. Buckets: `'load_bearing' \| 'informative' \| 'decorative'`. Use to deprecate decorative rubrics, re-weight composites, trigger recalibration when validity drops. |
7179
| `RawProviderSink` + `callLlm({ rawSink })` | `trace/raw-provider-sink.ts`, `llm-client.ts` | First-class HTTP-level capture alongside `LlmSpan`. `Authorization` / `X-Api-Key` / credential-shaped body fields auto-redacted; `event.redactedFields` records what was stripped. `FileSystemRawProviderSink` rolls at 32 MiB. **Every eval run wires this** — see Directive 1. |
7280
| `assertLlmRoute` | `llm-client.ts` | Pure preflight guard. Throws `LlmRouteAssertionError` on missing baseUrl, blocked URL, missing auth, wrong provider. Call once at matrix-runner construction. See Directive 2. |
7381
| `assertRunCaptured` + `throwIfRunIncomplete` | `trace/integrity.ts` | Read-only run-completion check. `requireRawCoverageOfLlmSpans` catches the bug class where structured spans were emitted but raw HTTP capture went to a different sink. See Directive 3. |
@@ -309,6 +317,142 @@ Fail closed; use `// muffle-ok: <reason>` for the rare exception.
309317

310318
---
311319

320+
## Replay & sequential evaluation (0.22+)
321+
322+
Once `runEvalCampaign` standardises the output (every run is a `RunRecord` plus a SHA-256-keyed raw-event log) two compounding capabilities open up:
323+
324+
### Replay — every past run is a re-runnable artifact
325+
326+
Trying a new judge no longer means re-burning a sweep. Build a `ReplayCache` from the populated `RawProviderSink` of a previous run, install `createReplayFetch(cache)` as the `fetch` for `callLlm`, and the network call resolves out of the cache.
327+
328+
```ts
329+
import { ReplayCache, createReplayFetch } from '@tangle-network/agent-eval/traces'
330+
331+
const cache = await ReplayCache.fromSink(yesterdayCampaignSink)
332+
const replayFetch = createReplayFetch(cache, { onMiss: 'fail-closed' })
333+
334+
await callLlm(req, { ...llmOpts, fetch: replayFetch }) // zero LLM cost
335+
```
336+
337+
The cache hashes a canonical projection of the request body (`model + messages + temperature + max_tokens|max_completion_tokens + response_format`), so insertion-order quirks don't cause spurious misses. `onMiss` is `'throw' | 'fallback' | 'fail-closed'` — pick `fail-closed` for "I expect 100% replay; flag any new request as a determinism bug."
338+
339+
For post-hoc scoring that doesn't even need a `fetch` shim, iterate the cached `(request, response)` pairs directly with `iterateRawCalls(sink)` and run your scorer in pure TS.
340+
341+
### Sequential — ship the moment evidence is decisive
342+
343+
Real consumers run campaigns weekly / nightly / per-PR. Each new look at the data silently inflates type-I error under the BH-FDR guarantee, which was for the *first* analysis. `pairedEvalueSequence(deltas, opts)` and `evaluateInterimReleaseConfidence({ deltaSeries })` ship time-uniform inference: Type-I error is bounded by α at *every* stopping time.
344+
345+
```ts
346+
import { evaluateInterimReleaseConfidence } from '@tangle-network/agent-eval/reporting'
347+
348+
const verdict = evaluateInterimReleaseConfidence({
349+
deltaSeries: candidates.map((c) => ({ candidateId: c.id, deltas: c.pairedDeltas })),
350+
alpha: 0.05,
351+
rope: { low: -0.02, high: 0.02 },
352+
})
353+
// → recommendation: { decision: 'promote_now' | 'continue' | 'reject_now' | 'equivalent', candidateId }
354+
```
355+
356+
Methodology: predictable plug-in betting martingale (Waudby-Smith & Ramdas 2024) for the e-value, empirical Bernstein confidence sequence (Howard et al. 2021) for the running mean. Use `decisionFiredAt` to early-stop campaigns that are decisive at, say, 30 paired observations rather than burning all 100 you budgeted for.
357+
358+
**Common pattern:** call after every campaign tick. The recommendation is anytime-valid; if it returns `'continue'`, keep running; if it returns `'promote_now'` or `'reject_now'`, stop and act.
359+
360+
---
361+
362+
## Outcome calibration — does the rubric actually predict deployment? (0.22+)
363+
364+
Without this loop every rubric is faith-based. `rubricPredictiveValidity` joins canonical `RunRecord`s to a `DeploymentOutcomeStore` (matched on `runId`), computes Pearson + Spearman + bootstrap CI per (rubric, outcome) pair, and ranks rubrics by `|spearman|` against the outcomes that actually matter (revenue, retention, CSAT, churn, support-tickets, …).
365+
366+
```ts
367+
import { rubricPredictiveValidity } from '@tangle-network/agent-eval/reporting'
368+
import { FileSystemOutcomeStore } from '@tangle-network/agent-eval'
369+
370+
const validity = await rubricPredictiveValidity({
371+
runs: lastQuarterRuns, // RunRecord[] from runEvalCampaign
372+
outcomes: new FileSystemOutcomeStore({ root: PROD_OUTCOMES }),
373+
outcomeMetrics: ['revenue_lift', 'retention_30d', 'csat'],
374+
})
375+
376+
for (const r of validity.ranked) {
377+
console.log(`${r.rubric} → ${r.bestOutcome}: ρ=${r.spearman.toFixed(2)} (${r.verdict})`)
378+
}
379+
```
380+
381+
Verdict bucketing on `|spearman|`:
382+
383+
- `load_bearing` ≥ 0.7 — keep, weight heavily, defend in launch reviews.
384+
- `informative` ≥ 0.4 — useful as one signal among many; don't gate on it alone.
385+
- `decorative` < 0.4 — score is uncorrelated with the outcome that matters; deprecate, demote in composite weighting, or trigger recalibration. **A rubric with a strong negative correlation against a desired outcome buckets as `load_bearing` by magnitude — inspect the sign before promoting it.**
386+
387+
Wire this on a quarterly cadence. When a previously-load-bearing rubric drifts toward `decorative` it's almost always one of: (a) the model has shifted, (b) the user base has changed, (c) the rubric has been overfit to last quarter's failure modes. Each has a different fix; the calibration check distinguishes them.
388+
389+
`correlationStudy` continues to ship for the lower-level case of joining a `TraceStore` to an `OutcomeStore` over arbitrary eval metrics. `rubricPredictiveValidity` is the campaign-shaped wrapper purpose-built for the `RunRecord` artifact.
390+
391+
---
392+
393+
## EvalCampaign — preferred starting point for new evals (0.22+)
394+
395+
The four capture-integrity directives below are the operational discipline. **`runEvalCampaign` is what wires them by construction.** New consumers should reach for the campaign primitive first; the directives become "things the framework owns," not "things you might forget."
396+
397+
```ts
398+
import { runEvalCampaign } from '@tangle-network/agent-eval'
399+
import { traceAnalystOnRunComplete } from '@tangle-network/agent-eval/traces'
400+
import { FileSystemTraceStore } from '@tangle-network/agent-eval/traces'
401+
402+
const result = await runEvalCampaign({
403+
campaignId: 'launch-2026-q2',
404+
commitSha: process.env.GIT_SHA!,
405+
variants: [
406+
{ id: 'baseline', payload: { prompt: PROMPTS.v1 } },
407+
{ id: 'cand-tool-repair', payload: { prompt: PROMPTS.v2 } },
408+
],
409+
scenarios: scenarios, // [{ scenarioId: 'task-1' }, ...]
410+
seeds: [0, 1, 2, 3, 4],
411+
llmOpts: { baseUrl, apiKey, defaultTimeoutMs: 60_000 },
412+
storeFactory: ({ runId }) => new FileSystemTraceStore({ root: `${WORK}/trace/${runId}` }),
413+
workDir: WORK, // FileSystemRawProviderSink lands at WORK/raw-events/<runId>/
414+
onRunComplete: [traceAnalystOnRunComplete({ analyze: analystOpts, save })],
415+
preregistrationHash: signedManifest.contentHash,
416+
report: { comparator: 'baseline', rope: { low: -0.02, high: 0.02 } },
417+
runner: async (ctx) => {
418+
await ctx.emitter.startRun({ scenarioId: ctx.scenarioId, layer: 'app-runtime' })
419+
const { result } = await callLlmJson(req(ctx.variant), ctx.llmOpts) // raw HTTP captured by construction
420+
const score = await judgeOutput(result.content, ctx.scenarioId, ctx.llmOpts)
421+
await ctx.emitter.endRun({ pass: score > 0.5, score })
422+
return {
423+
pass: score > 0.5, score,
424+
costUsd: result.costUsd ?? 0,
425+
tokenUsage: { input: result.usage.promptTokens, output: result.usage.completionTokens },
426+
model: 'claude-sonnet-4-6@2025-04-15',
427+
promptHash: hashPrompt(ctx.variant.prompt),
428+
configHash: hashConfig(ctx.variant),
429+
}
430+
},
431+
})
432+
433+
// result.runs: RunRecord[] for downstream pipelines
434+
// result.integrityReports: per-run capture-integrity reports
435+
// result.failedRuns: cells that threw or failed integrity (mark_failed default)
436+
// result.report: researchReport — promote/hold/equivalent/reject + methodology
437+
// result.campaignFingerprint: SHA-256 over the canonicalised plan
438+
```
439+
440+
**What the campaign owns** so the consumer doesn't:
441+
- `assertLlmRoute(llmOpts, { requireExplicitBaseUrl: true, requireAuth: true })` once at preflight.
442+
- A fresh `TraceStore` and `RawProviderSink` per cell; the runner gets an `LlmClientOptions` already wired with `rawSink` + `traceContext`. Calling an LLM without capturing it requires actively bypassing the campaign.
443+
- `assertRunCaptured(store, runId, { requireRawCoverageOfLlmSpans: true, requireOutcome: true })` after every `endRun`.
444+
- Auto-execution of `traceAnalystOnRunComplete` if you pass an analyst config in `onRunComplete`.
445+
- `researchReport` over the collected runs at the end with the campaign's `preregistrationHash` baked in.
446+
447+
**When NOT to use the campaign:**
448+
- Trajectory-shaped GEPA optimization → `runMultiShotOptimization` (steered prompts, paired seeds, intermediate metrics).
449+
- Prompt + code evolution with mutation, sandbox pools, lineage → `runPromptEvolution` + `createCompositeMutator`.
450+
- Long-running agent control loops with budgets → `runAgentControlLoop` (the campaign is for *measurement*, not the live runtime).
451+
452+
The four directives below remain the source of truth for *why* the campaign does what it does. Read them when something fails — the issue codes (`missing_raw_events`, `orphan_llm_span`, `no_explicit_base_url`, …) are the campaign's failure modes too.
453+
454+
---
455+
312456
## Capture integrity (REQUIRED for launch-grade adoption)
313457

314458
A run that *appears* successful but lost its forensic evidence is worse than a failed run — a launch reviewer can't distinguish "we measured a real win" from "we measured nothing on the wrong route." The four directives below are the operational discipline that turns the analytical primitives into a launch-grade artifact. **Skip one and the consumer's run is descriptive, not anchoring** — the same failure mode that prompted the 0.21.0 release.

0 commit comments

Comments
 (0)