Skip to content

Commit 7e875ae

Browse files
committed
feat(confidence): add release scorecard gate
1 parent 7fdc886 commit 7e875ae

4 files changed

Lines changed: 740 additions & 0 deletions

File tree

README.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,11 +83,49 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
8383
| `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
8484
| `ExperimentTracker`, steering optimizers, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
8585
| `runMultiShotOptimization`, `trialTraceFromMultiShotTrial` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [multi-shot-optimization.md](./docs/multi-shot-optimization.md) |
86+
| `evaluateReleaseConfidence`, `assertReleaseConfidence` | Release scorecard that composes corpus coverage, search/holdout run evidence, ASI diagnostics, overfit checks, and cost/latency budgets. | §Release confidence |
8687
| `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
8788
| `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
8889
| `correlationStudy`, `OutcomeStore`, `ProductRegistry` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
8990
| Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc |
9091

92+
## Release confidence
93+
94+
Use `evaluateReleaseConfidence` at the release boundary for every consuming
95+
agent surface. It fails closed unless the release has a versioned corpus,
96+
search and holdout run evidence, score/pass-rate evidence, ASI for failures,
97+
and budget/overfit checks. Single-shot and multi-shot apps use the same path:
98+
single-shot traces are just trace evidence with `turnCount: 1`.
99+
100+
```ts
101+
import {
102+
evaluateReleaseConfidence,
103+
releaseTraceEvidenceFromMultiShotTrials,
104+
} from '@tangle-network/agent-eval'
105+
106+
const scorecard = evaluateReleaseConfidence({
107+
target: 'blueprint-agent/autoresearch',
108+
candidateId: 'candidate-v3',
109+
baselineId: 'baseline',
110+
dataset: await dataset.manifest(),
111+
runs: [...candidateRuns, ...baselineRuns],
112+
traces: releaseTraceEvidenceFromMultiShotTrials(result.evolution.generations.flatMap((g) => g.trials)),
113+
gateDecision: result.gate?.decision,
114+
thresholds: {
115+
minScenarioCount: 50,
116+
minSearchRuns: 50,
117+
minHoldoutRuns: 20,
118+
minPassRate: 0.9,
119+
minMeanScore: 0.8,
120+
maxOverfitGap: 0.1,
121+
maxMeanCostUsd: 0.05,
122+
maxP95WallMs: 120_000,
123+
},
124+
})
125+
126+
if (!scorecard.promote) throw new Error(scorecard.summary)
127+
```
128+
91129
## Evolution loop
92130

93131
For agent tasks that run across many chat turns or tool calls, start with

src/index.ts

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -793,6 +793,23 @@ export type {
793793
MultiShotVariant,
794794
} from './multi-shot-optimization'
795795

796+
export {
797+
assertReleaseConfidence,
798+
evaluateReleaseConfidence,
799+
releaseTraceEvidenceFromMultiShotTrials,
800+
} from './release-confidence'
801+
export type {
802+
ReleaseConfidenceAxis,
803+
ReleaseConfidenceAxisName,
804+
ReleaseConfidenceInput,
805+
ReleaseConfidenceIssue,
806+
ReleaseConfidenceMetrics,
807+
ReleaseConfidenceScorecard,
808+
ReleaseConfidenceStatus,
809+
ReleaseConfidenceThresholds,
810+
ReleaseTraceEvidence,
811+
} from './release-confidence'
812+
796813
// ── 0.14.0: concurrency + persistence + telemetry primitives for evolution loops ──
797814
export { Mutex } from './concurrency'
798815

0 commit comments

Comments
 (0)