|
| 1 | +# Multi-Shot Optimization |
| 2 | + |
| 3 | +`runMultiShotOptimization` is the public adapter for GEPA-style optimization over |
| 4 | +variable-length agent conversations. |
| 5 | + |
| 6 | +Use it when the thing you want to improve is not a single model call. Typical |
| 7 | +targets are agent system prompts, tool descriptions, routing policies, retrieval |
| 8 | +plans, or app-specific scaffolding that affects an entire task trajectory. |
| 9 | + |
| 10 | +The primitive is intentionally small. Your app owns the domain logic: |
| 11 | + |
| 12 | +- `seedVariants`: prompt/config/tool-policy candidates |
| 13 | +- `runner`: executes one complete task trajectory for one variant |
| 14 | +- `scorer`: scores the trajectory and emits actionable side information |
| 15 | +- `mutateAdapter`: proposes new variants from top and bottom trials |
| 16 | + |
| 17 | +`agent-eval` owns the release-critical glue: |
| 18 | + |
| 19 | +- stable paired seeds |
| 20 | +- search-split prompt evolution |
| 21 | +- cost/score Pareto objectives |
| 22 | +- failed-run conversion into failed trials |
| 23 | +- ASI projection into reflection traces and numeric metrics |
| 24 | +- optional paired holdout gating through `HeldOutGate` |
| 25 | +- validated `RunRecord` rows for promotion evidence |
| 26 | + |
| 27 | +## Result Contract |
| 28 | + |
| 29 | +The return shape separates discovery from promotion: |
| 30 | + |
| 31 | +- `searchBestVariant`: best variant on the optimizer-visible search scenarios |
| 32 | +- `searchBestAggregate`: aggregate for that search winner |
| 33 | +- `promotedVariant`: variant callers should ship |
| 34 | +- `promotedAggregate`: aggregate for the promoted variant |
| 35 | +- `gate`: holdout decision and evidence, or `null` when no gate ran |
| 36 | + |
| 37 | +If a holdout gate is configured and rejects the search winner, |
| 38 | +`promotedVariant` is the baseline. Do not ship `searchBestVariant` directly |
| 39 | +unless you intentionally run without a holdout gate. |
| 40 | + |
| 41 | +## Actionable Side Information |
| 42 | + |
| 43 | +The scorer should return `asi` rows for concrete failure modes: |
| 44 | + |
| 45 | +```ts |
| 46 | +{ |
| 47 | + expectationId: 'used-primary-sources', |
| 48 | + message: 'The final answer cited secondary summaries instead of primary sources.', |
| 49 | + severity: 'error', |
| 50 | + responsibleSurface: 'retrieval-policy', |
| 51 | + suggestion: 'Prefer primary-source domains during source-gathering turns.', |
| 52 | +} |
| 53 | +``` |
| 54 | + |
| 55 | +These rows become: |
| 56 | + |
| 57 | +- reflection expectations via `trialTraceFromMultiShotTrial` |
| 58 | +- aggregate metrics like `asi.error` and `surface.retrieval-policy` |
| 59 | +- trace evidence available to downstream reports |
| 60 | + |
| 61 | +This is the main reason to use this primitive instead of reducing each run to a |
| 62 | +single scalar reward. |
| 63 | + |
| 64 | +## Holdout Discipline |
| 65 | + |
| 66 | +For release gates, configure `gate`. The first seed variant is the baseline and |
| 67 | +`gate.gate.baselineKey` must match its id. |
| 68 | + |
| 69 | +Holdout scenarios must be disjoint from `searchScenarioIds`. The adapter runs |
| 70 | +baseline and candidate with the same `(scenarioId, rep)` seed, validates every |
| 71 | +row with `validateRunRecord`, then asks `HeldOutGate` whether to promote. |
| 72 | + |
| 73 | +When `gate.searchScenarioIds` is omitted, the adapter reuses |
| 74 | +`searchScenarioIds` for the overfit-gap check. |
| 75 | + |
| 76 | +## Minimal Shape |
| 77 | + |
| 78 | +```ts |
| 79 | +import { |
| 80 | + runMultiShotOptimization, |
| 81 | + trialTraceFromMultiShotTrial, |
| 82 | + type MultiShotVariant, |
| 83 | +} from '@tangle-network/agent-eval' |
| 84 | + |
| 85 | +type Payload = { systemPrompt: string } |
| 86 | + |
| 87 | +const baseline: MultiShotVariant<Payload> = { |
| 88 | + id: 'baseline', |
| 89 | + label: 'baseline', |
| 90 | + generation: 0, |
| 91 | + payload: { systemPrompt: currentPrompt }, |
| 92 | +} |
| 93 | + |
| 94 | +const result = await runMultiShotOptimization<Payload>({ |
| 95 | + runId: `research-agent-${Date.now()}`, |
| 96 | + target: 'research-agent-system-prompt', |
| 97 | + seedVariants: [baseline], |
| 98 | + searchScenarioIds: searchScenarios.map((s) => s.id), |
| 99 | + reps: 2, |
| 100 | + generations: 4, |
| 101 | + populationSize: 4, |
| 102 | + scoreConcurrency: 4, |
| 103 | + runner: { |
| 104 | + async run({ variant, scenarioId, seed }) { |
| 105 | + return runYourAgentToCompletion({ scenarioId, seed, prompt: variant.payload.systemPrompt }) |
| 106 | + }, |
| 107 | + }, |
| 108 | + scorer: { |
| 109 | + async score({ run }) { |
| 110 | + return scoreFullTrajectory(run.trace) |
| 111 | + }, |
| 112 | + }, |
| 113 | + mutateAdapter: { |
| 114 | + async mutate({ parent, bottomTrials, childCount, generation }) { |
| 115 | + const traces = bottomTrials.map((t) => trialTraceFromMultiShotTrial(t)) |
| 116 | + return proposePromptMutations({ parent, traces, childCount, generation }) |
| 117 | + }, |
| 118 | + }, |
| 119 | +}) |
| 120 | + |
| 121 | +deploy(result.promotedVariant.payload) |
| 122 | +``` |
0 commit comments