Commit d763d00
authored
feat: 0.22.0 — EvalCampaign + replay + always-valid + outcome calibration (#42)
* feat: 0.22.0 — runEvalCampaign, capture integrity by construction
0.21 shipped the four capture-integrity primitives (RawProviderSink,
assertLlmRoute, assertRunCaptured, onRunComplete hooks) as opt-in.
Every consumer still had to wire them by hand, and the bug class
blueprint-agent reported (forgotten wiring → silent partial-capture)
reappears the moment a new consumer adopts agent-eval cold.
0.22 makes the right thing the default path. runEvalCampaign is an
opinionated matrix runner that owns the integrity surface so consumers
stop reinventing it.
What it owns:
- assertLlmRoute() once at preflight, with requireExplicitBaseUrl +
requireAuth defaults. Misconfigured routes never burn a run.
- Per cell: TraceStore + RawProviderSink + TraceEmitter constructed
from caller-supplied factories. The runner receives an
LlmClientOptions pre-wired with rawSink + traceContext — calling an
LLM without capturing it requires actively bypassing the campaign.
- assertRunCaptured() after every endRun with
requireRawCoverageOfLlmSpans + requireOutcome defaults. Failure
policy: throw | mark_failed | log (default mark_failed; sibling
cells continue).
- onRunComplete hooks — pass traceAnalystOnRunComplete to auto-run
the analyst as part of the run lifecycle.
- End of campaign: researchReport over the collected RunRecords with
the campaign fingerprint + preregistrationHash baked in.
Determinism + isolation:
- Default runId is a stable hash of (campaignId, variantId, scenarioId,
seed). Re-running the same campaign produces the same ids.
- Campaign fingerprint is a SHA-256 over the canonicalised plan
(variants, scenarios, seeds, splitTag, baseUrl, provider,
preregistrationHash) — stable across permutations.
- Local async worker pool, default concurrency 1.
Failure isolation:
- Runner throws → cell marked failed, others continue.
- Integrity fails → routed by onIntegrityFailure policy.
- Genuine non-runner exceptions propagate (don't mask bugs).
Surface:
- runEvalCampaign exported from root and @tangle-network/agent-eval/optimization.
- Types: CampaignRunner, CampaignRunContext, CampaignRunOutcome,
CampaignVariant, CampaignScenario, EvalCampaignOptions,
EvalCampaignResult, FailedRun, CampaignIntegrityPolicy,
CampaignFactoryParams.
- NoopRawProviderSink.list() now returns [] so explicit opt-out from
capture is not flagged as no_raw_sink by assertRunCaptured. Opt-out
remains a deliberate choice — caller still has to override
integrity expectations to admit the run.
Tests:
- 883 / 883 passing (+16 dedicated runEvalCampaign cases): happy path,
research report end-to-end, fingerprint stability across
permutations, preregistration passthrough, route preflight
failures, validation errors, runner-throws-with-isolation, all
three integrity policies, concurrency.
- typecheck + build clean.
Docs:
- SKILL.md: new "EvalCampaign — preferred starting point" section
BEFORE the capture-integrity directive list, with a full runnable
example and explicit when-not-to-use guidance pointing at
runMultiShotOptimization, runPromptEvolution, runAgentControlLoop.
- Discoverability rows added to the "Decide where to start" and
"Production-rigor primitives" tables.
Version lockstep: npm 0.22.0 ↔ PyPI agent-eval-rpc 0.22.0.
Migration: existing consumers don't need to change. runEvalCampaign is
additive. The recommended path is to replace hand-rolled matrix runners
with a single runEvalCampaign call on the next eval-runner refactor.
The capture-integrity directives go from "things you might forget" to
"things the framework owns."
* feat: 0.22.0 — replay, anytime-valid sequential, outcome calibration
Three primitives that compound on top of the EvalCampaign artifact:
1. Replay-from-raw-events: every captured campaign is a re-runnable
artifact. ReplayCache + createReplayFetch turn yesterday's raw
provider events into a deterministic fetch-shaped cache. Re-judge,
re-score, or determinism-audit without burning a single LLM token.
2. Anytime-valid sequential evaluation: pairedEvalueSequence and
evaluateInterimReleaseConfidence ship the predictable plug-in
betting martingale of Waudby-Smith & Ramdas (2024) paired with the
empirical Bernstein confidence sequence of Howard et al. (2021).
Type-I error is bounded by α at every stopping time — peek at every
campaign tick without inflating false-discovery rate. Tested under
the null at α=0.05 on 100 synthetic series; bound holds.
3. Outcome calibration loop: rubricPredictiveValidity joins canonical
RunRecords to a DeploymentOutcomeStore and ranks rubrics by
|spearman| against the outcomes that actually matter. Verdict
bucketing (load_bearing / informative / decorative) tells you which
rubrics earn their promotion power and which are decorative. Without
this loop every rubric is faith-based.
Each is a standalone primitive but they compose:
- Replay makes outcome-calibration cheaper to retrofit (re-score past
runs with new rubrics without re-burning).
- Sequential makes campaign cadence honest (peek every Tuesday).
- Outcome calibration tells sequential which rubrics to peek at.
Surface (root + subpaths):
- Root: ReplayCache, createReplayFetch, iterateRawCalls, ReplayCacheMissError,
pairedEvalueSequence, evaluateInterimReleaseConfidence
- traces subpath: replay primitives
- reporting subpath: sequential primitives + rubricPredictiveValidity
- meta-eval barrel: rubricPredictiveValidity (alongside existing
correlationStudy / OutcomeStore / calibrationCurve)
Tests:
- 910 / 910 passing (+27 dedicated cases across 3 new files):
replay (cache build, lookup, miss policies, fallback, pass-through),
sequential (continue/promote/reject/equivalent, type-I bound under
the null, p-value monotonicity, clipping, configuration validation),
rubric predictive validity (load-bearing vs decorative ranking,
rubric discovery, minSamples / skipped-runs / no-data handling,
sign-aware verdict).
Docs:
- SKILL.md: new sections "Replay & sequential evaluation" and
"Outcome calibration", with runnable examples and citations.
- README.md: new Core Pieces rows.
- methodology doc: "out of scope" entries for sequential inference and
outcome calibration are now "shipped in 0.22" with the references.
Migration: all four primitives are additive. Recommended sequence:
- Replace hand-rolled matrix runners with runEvalCampaign.
- Wire evaluateInterimReleaseConfidence into the rolling-campaign loop.
- Replay on every eval R&D iteration (free).
- Run rubricPredictiveValidity quarterly once ≥ 30 outcome rows exist.
References:
- Howard, S. R., Ramdas, A., McAuliffe, J., Sekhon, J. (2021).
Time-uniform, nonparametric, nonasymptotic confidence sequences.
Annals of Statistics, 49(2), 1055–1080.
- Waudby-Smith, I., Ramdas, A. (2024). Estimating means of bounded
random variables by betting. JRSS B, 86(1), 1–27.1 parent ac0c593 commit d763d00
21 files changed
Lines changed: 2448 additions & 7 deletions
File tree
- .claude/skills/agent-eval
- clients/python
- src/agent_eval_rpc
- docs
- src
- meta-eval
- trace
- tests
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
51 | 55 | | |
52 | 56 | | |
53 | 57 | | |
| |||
68 | 72 | | |
69 | 73 | | |
70 | 74 | | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
71 | 79 | | |
72 | 80 | | |
73 | 81 | | |
| |||
309 | 317 | | |
310 | 318 | | |
311 | 319 | | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
312 | 456 | | |
313 | 457 | | |
314 | 458 | | |
| |||
0 commit comments