@@ -83,11 +83,49 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
8383| ` evaluateActionPolicy ` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [ feature-guide.md] ( ./docs/feature-guide.md ) |
8484| ` ExperimentTracker ` , steering optimizers, ` bisector ` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
8585| ` runMultiShotOptimization ` , ` trialTraceFromMultiShotTrial ` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [ multi-shot-optimization.md] ( ./docs/multi-shot-optimization.md ) |
86+ | ` evaluateReleaseConfidence ` , ` assertReleaseConfidence ` | Release scorecard that composes corpus coverage, search/holdout run evidence, ASI diagnostics, overfit checks, and cost/latency budgets. | §Release confidence |
8687| ` runPromptEvolution ` , ` createCompositeMutator ` , ` createSandboxPool ` , ` createSandboxCodeMutator ` , ` MutationTelemetry ` , ` LineageRecorder ` , ` CostLedger ` , ` JsonlTrialCache ` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
8788| ` reflective-mutation ` (` buildReflectionPrompt ` , ` parseReflectionResponse ` , ` DEFAULT_MUTATION_PRIMITIVES ` ) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
8889| ` correlationStudy ` , ` OutcomeStore ` , ` ProductRegistry ` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
8990| Telemetry (` telemetry/ ` , ` telemetry/file ` ) | OTLP export, trace replay, file sinks. | inline JSDoc |
9091
92+ ## Release confidence
93+
94+ Use ` evaluateReleaseConfidence ` at the release boundary for every consuming
95+ agent surface. It fails closed unless the release has a versioned corpus,
96+ search and holdout run evidence, score/pass-rate evidence, ASI for failures,
97+ and budget/overfit checks. Single-shot and multi-shot apps use the same path:
98+ single-shot traces are just trace evidence with ` turnCount: 1 ` .
99+
100+ ``` ts
101+ import {
102+ evaluateReleaseConfidence ,
103+ releaseTraceEvidenceFromMultiShotTrials ,
104+ } from ' @tangle-network/agent-eval'
105+
106+ const scorecard = evaluateReleaseConfidence ({
107+ target: ' blueprint-agent/autoresearch' ,
108+ candidateId: ' candidate-v3' ,
109+ baselineId: ' baseline' ,
110+ dataset: await dataset .manifest (),
111+ runs: [... candidateRuns , ... baselineRuns ],
112+ traces: releaseTraceEvidenceFromMultiShotTrials (result .evolution .generations .flatMap ((g ) => g .trials )),
113+ gateDecision: result .gate ?.decision ,
114+ thresholds: {
115+ minScenarioCount: 50 ,
116+ minSearchRuns: 50 ,
117+ minHoldoutRuns: 20 ,
118+ minPassRate: 0.9 ,
119+ minMeanScore: 0.8 ,
120+ maxOverfitGap: 0.1 ,
121+ maxMeanCostUsd: 0.05 ,
122+ maxP95WallMs: 120_000 ,
123+ },
124+ })
125+
126+ if (! scorecard .promote ) throw new Error (scorecard .summary )
127+ ```
128+
91129## Evolution loop
92130
93131For agent tasks that run across many chat turns or tool calls, start with
0 commit comments