Skip to content

Commit cfb082d

Browse files
authored
release multi-shot optimization 0.18.0
Release 0.18.0 with the multi-shot optimization adapter, docs, example, version bump, package metadata cleanup, and removal of historical checked-in package tarballs.
1 parent b16d830 commit cfb082d

17 files changed

Lines changed: 1102 additions & 12 deletions

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,31 @@
11
# Changelog
22

3+
## 0.18.0 — multi-shot optimization
4+
5+
### Added
6+
7+
- `runMultiShotOptimization`, the canonical GEPA-style adapter for
8+
variable-length agent trajectories. It wraps `runPromptEvolution` while
9+
preserving full multi-shot traces, actionable side information, stable paired
10+
seeds, score/cost objectives, and optional held-out promotion gating.
11+
- `trialTraceFromMultiShotTrial`, a bridge from multi-shot trial results into
12+
reflective mutation prompts.
13+
- `ActionableSideInfo`, `MultiShotVariant`, `MultiShotTrace`, `MultiShotRun`,
14+
`MultiShotScore`, `MultiShotTrialResult`, `MultiShotMutateAdapter`, and
15+
related public types.
16+
- `docs/multi-shot-optimization.md` and
17+
`examples/multi-shot-optimization/index.ts`.
18+
19+
### Changed
20+
21+
- The multi-shot result shape explicitly separates `searchBestVariant` from
22+
`promotedVariant`. If a holdout gate rejects the search winner, the promoted
23+
variant is the baseline.
24+
- `runMultiShotOptimization` validates release-critical configuration up front:
25+
unique variant/scenario ids, positive integer run counts, population size,
26+
disjoint search/holdout ids, and a gate baseline key matching the first seed
27+
variant.
28+
329
## 0.17.2 — agent control runtime
430

531
### Added

README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,13 +82,20 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
8282
| `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
8383
| `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
8484
| `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
85+
| `runMultiShotOptimization`, `trialTraceFromMultiShotTrial` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [multi-shot-optimization.md](./docs/multi-shot-optimization.md) |
8586
| `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
8687
| `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
8788
| `correlationStudy`, `OutcomeStore`, `ProductRegistry` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
8889
| Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc |
8990

9091
## Evolution loop
9192

93+
For agent tasks that run across many chat turns or tool calls, start with
94+
[`runMultiShotOptimization`](./docs/multi-shot-optimization.md). It runs the
95+
same prompt-evolution core over full trajectories, carries actionable side
96+
information into reflection, and separates the search winner from the variant
97+
that actually passes held-out promotion.
98+
9299
Closing the loop on a prompt or codebase is **two adapters + a config**. Compose `runPromptEvolution` with `createCompositeMutator` (plateau policy) and you get prompt-only optimization until improvement stalls, then automatic switch to code-channel mutations from a coding agent inside a `SandboxPool`.
93100

94101
```ts

clients/python/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -140,9 +140,9 @@ All errors carry `.code` and `.details` (the structured payload from the server)
140140

141141
## Versioning
142142

143-
This package is **version-locked** to the npm package. `tangle-agent-eval==0.12.0``@tangle-network/agent-eval@0.12.0`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
143+
This package is **version-locked** to the npm package. `tangle-agent-eval==0.18.0``@tangle-network/agent-eval@0.18.0`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
144144

145-
`wire_version` is separate. It bumps only on breaking schema changes. A 0.13 server still serves 0.12 clients as long as `wire_version` is the same.
145+
`wire_version` is separate. It bumps only on breaking schema changes. Package versions can differ across releases as long as `wire_version` is the same.
146146

147147
## Development
148148

clients/python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "tangle-agent-eval"
7-
version = "0.12.0"
7+
version = "0.18.0"
88
description = "Python client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC."
99
readme = "README.md"
1010
requires-python = ">=3.10"

clients/python/src/tangle_agent_eval/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,7 @@
3939
VersionResponse,
4040
)
4141

42-
__version__ = "0.12.0"
42+
__version__ = "0.18.0"
4343

4444
__all__ = [
4545
"Client",

docs/feature-guide.md

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@ trying, and whether a change made them better or worse.
3434
| “Can this action run, or does it need approval?” | `evaluateActionPolicy` | Generic preflight for side effects, budgets, and required evidence. |
3535
| “I need train/dev/test/holdout examples.” | `Dataset` plus feedback trajectory conversion | Stable splits and contamination control. |
3636
| “Which prompt or signature wins?” | `PromptOptimizer`, `OptimizationLoop`, steering optimizers | Runs variants on scenarios and compares scores. |
37+
| “Improve a multi-turn agent over real task traces.” | `runMultiShotOptimization` | GEPA-style trajectory optimization with ASI and held-out promotion. |
3738
| “Improve prompts, then code if prompts plateau.” | `runPromptEvolution`, composite mutator, code mutator | Bounded evolution with telemetry and lineage. |
3839
| “Find why a regression happened.” | bisector, traces, run records | Narrows changes and preserves evidence. |
3940
| “Expose evals to another language.” | Wire protocol and Python client | HTTP/RPC boundary for non-TypeScript apps. |
@@ -104,10 +105,12 @@ generated code -> build/test/runtime gates -> score -> ship or revise
104105

105106
Use when you want Ax/GEPA-style improvement.
106107

107-
1. Build a dataset with train/dev/test/holdout splits.
108-
2. Evaluate variants against the same scenarios.
109-
3. Promote only when paired comparisons and held-out checks support it.
110-
4. Keep run records with prompt hash, model, config, cost, and commit.
108+
1. For variable-length agent tasks, use `runMultiShotOptimization`.
109+
2. Build search/dev/test/holdout splits from the real product loop.
110+
3. Score full trajectories, not just final text.
111+
4. Emit actionable side information for failures the mutator can fix.
112+
5. Promote only `promotedVariant`, never a rejected `searchBestVariant`.
113+
6. Keep run records with prompt hash, model, config, cost, and commit.
111114

112115
Result:
113116

docs/multi-shot-optimization.md

Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Multi-Shot Optimization
2+
3+
`runMultiShotOptimization` is the public adapter for GEPA-style optimization over
4+
variable-length agent conversations.
5+
6+
Use it when the thing you want to improve is not a single model call. Typical
7+
targets are agent system prompts, tool descriptions, routing policies, retrieval
8+
plans, or app-specific scaffolding that affects an entire task trajectory.
9+
10+
The primitive is intentionally small. Your app owns the domain logic:
11+
12+
- `seedVariants`: prompt/config/tool-policy candidates
13+
- `runner`: executes one complete task trajectory for one variant
14+
- `scorer`: scores the trajectory and emits actionable side information
15+
- `mutateAdapter`: proposes new variants from top and bottom trials
16+
17+
`agent-eval` owns the release-critical glue:
18+
19+
- stable paired seeds
20+
- search-split prompt evolution
21+
- cost/score Pareto objectives
22+
- failed-run conversion into failed trials
23+
- ASI projection into reflection traces and numeric metrics
24+
- optional paired holdout gating through `HeldOutGate`
25+
- validated `RunRecord` rows for promotion evidence
26+
27+
## Result Contract
28+
29+
The return shape separates discovery from promotion:
30+
31+
- `searchBestVariant`: best variant on the optimizer-visible search scenarios
32+
- `searchBestAggregate`: aggregate for that search winner
33+
- `promotedVariant`: variant callers should ship
34+
- `promotedAggregate`: aggregate for the promoted variant
35+
- `gate`: holdout decision and evidence, or `null` when no gate ran
36+
37+
If a holdout gate is configured and rejects the search winner,
38+
`promotedVariant` is the baseline. Do not ship `searchBestVariant` directly
39+
unless you intentionally run without a holdout gate.
40+
41+
## Actionable Side Information
42+
43+
The scorer should return `asi` rows for concrete failure modes:
44+
45+
```ts
46+
{
47+
expectationId: 'used-primary-sources',
48+
message: 'The final answer cited secondary summaries instead of primary sources.',
49+
severity: 'error',
50+
responsibleSurface: 'retrieval-policy',
51+
suggestion: 'Prefer primary-source domains during source-gathering turns.',
52+
}
53+
```
54+
55+
These rows become:
56+
57+
- reflection expectations via `trialTraceFromMultiShotTrial`
58+
- aggregate metrics like `asi.error` and `surface.retrieval-policy`
59+
- trace evidence available to downstream reports
60+
61+
This is the main reason to use this primitive instead of reducing each run to a
62+
single scalar reward.
63+
64+
## Holdout Discipline
65+
66+
For release gates, configure `gate`. The first seed variant is the baseline and
67+
`gate.gate.baselineKey` must match its id.
68+
69+
Holdout scenarios must be disjoint from `searchScenarioIds`. The adapter runs
70+
baseline and candidate with the same `(scenarioId, rep)` seed, validates every
71+
row with `validateRunRecord`, then asks `HeldOutGate` whether to promote.
72+
73+
When `gate.searchScenarioIds` is omitted, the adapter reuses
74+
`searchScenarioIds` for the overfit-gap check.
75+
76+
## Minimal Shape
77+
78+
```ts
79+
import {
80+
runMultiShotOptimization,
81+
trialTraceFromMultiShotTrial,
82+
type MultiShotVariant,
83+
} from '@tangle-network/agent-eval'
84+
85+
type Payload = { systemPrompt: string }
86+
87+
const baseline: MultiShotVariant<Payload> = {
88+
id: 'baseline',
89+
label: 'baseline',
90+
generation: 0,
91+
payload: { systemPrompt: currentPrompt },
92+
}
93+
94+
const result = await runMultiShotOptimization<Payload>({
95+
runId: `research-agent-${Date.now()}`,
96+
target: 'research-agent-system-prompt',
97+
seedVariants: [baseline],
98+
searchScenarioIds: searchScenarios.map((s) => s.id),
99+
reps: 2,
100+
generations: 4,
101+
populationSize: 4,
102+
scoreConcurrency: 4,
103+
runner: {
104+
async run({ variant, scenarioId, seed }) {
105+
return runYourAgentToCompletion({ scenarioId, seed, prompt: variant.payload.systemPrompt })
106+
},
107+
},
108+
scorer: {
109+
async score({ run }) {
110+
return scoreFullTrajectory(run.trace)
111+
},
112+
},
113+
mutateAdapter: {
114+
async mutate({ parent, bottomTrials, childCount, generation }) {
115+
const traces = bottomTrials.map((t) => trialTraceFromMultiShotTrial(t))
116+
return proposePromptMutations({ parent, traces, childCount, generation })
117+
},
118+
},
119+
})
120+
121+
deploy(result.promotedVariant.payload)
122+
```

docs/wire-protocol.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -96,13 +96,13 @@ GET /v1/version
9696
```json
9797
{
9898
"package": "@tangle-network/agent-eval",
99-
"version": "0.12.0",
99+
"version": "0.18.0",
100100
"wireVersion": "1.0.0",
101101
"apiSurface": ["judge", "listRubrics", "version"]
102102
}
103103
```
104104

105-
`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking schema changes. A 0.13 server still serves 0.12 clients as long as `wireVersion` matches.
105+
`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
106106

107107
### `GET /healthz` — liveness
108108

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
import {
2+
runMultiShotOptimization,
3+
trialTraceFromMultiShotTrial,
4+
type MultiShotVariant,
5+
type RunRecord,
6+
} from '@tangle-network/agent-eval'
7+
8+
type Payload = {
9+
instruction: string
10+
quality: number
11+
}
12+
13+
const baseline: MultiShotVariant<Payload> = {
14+
id: 'baseline',
15+
label: 'baseline',
16+
generation: 0,
17+
payload: {
18+
instruction: 'Complete the user task.',
19+
quality: 0.45,
20+
},
21+
}
22+
23+
const result = await runMultiShotOptimization<Payload>({
24+
runId: 'demo-multi-shot',
25+
target: 'demo-agent-system-prompt',
26+
seedVariants: [baseline],
27+
searchScenarioIds: ['search-brief', 'search-code-review', 'search-research'],
28+
reps: 1,
29+
generations: 2,
30+
populationSize: 2,
31+
scoreConcurrency: 2,
32+
runner: {
33+
async run({ variant, scenarioId }) {
34+
return {
35+
trace: {
36+
scenarioId,
37+
turns: [
38+
{ role: 'user', content: `Run ${scenarioId}` },
39+
{ role: 'assistant', content: `${variant.payload.instruction} quality=${variant.payload.quality}` },
40+
],
41+
output: `quality=${variant.payload.quality}`,
42+
},
43+
costUsd: 0.01,
44+
durationMs: 50,
45+
}
46+
},
47+
},
48+
scorer: {
49+
async score({ variant }) {
50+
return {
51+
score: variant.payload.quality,
52+
ok: true,
53+
asi: variant.payload.quality >= 0.8
54+
? []
55+
: [{
56+
expectationId: 'complete-task',
57+
message: 'The agent did not fully complete the task.',
58+
severity: 'error',
59+
responsibleSurface: 'system-prompt',
60+
suggestion: 'Make completion criteria explicit before final response.',
61+
}],
62+
}
63+
},
64+
},
65+
mutateAdapter: {
66+
async mutate({ parent, bottomTrials, childCount, generation }) {
67+
const traces = bottomTrials.map((trial) => trialTraceFromMultiShotTrial(trial))
68+
const rationale = traces.flatMap((trace) => (trace.expectations ?? []).map((e) => e.phrase)).join('\n')
69+
return Array.from({ length: childCount }, (_, i) => ({
70+
id: `${parent.id}.g${generation}.${i}`,
71+
label: 'completion-focused',
72+
generation,
73+
payload: {
74+
instruction: `${parent.payload.instruction} Verify every requested step before final answer.`,
75+
quality: 0.9,
76+
},
77+
rationale,
78+
}))
79+
},
80+
},
81+
gate: {
82+
holdoutScenarioIds: ['holdout-brief', 'holdout-code-review', 'holdout-research'],
83+
gate: {
84+
baselineKey: 'baseline',
85+
minProductiveRuns: 3,
86+
pairedDeltaThreshold: 0,
87+
seed: 7,
88+
},
89+
toRunRecord: ({ variant, scenarioId, rep, split, seed, trial }): RunRecord => ({
90+
runId: `demo-${variant.id}-${scenarioId}-${rep}-${split}`,
91+
experimentId: scenarioId,
92+
candidateId: variant.id,
93+
seed,
94+
model: 'demo-model@2026-01-01',
95+
promptHash: 'p'.repeat(64),
96+
configHash: 'c'.repeat(64),
97+
commitSha: 'deadbeef',
98+
wallMs: trial.durationMs ?? 0,
99+
costUsd: trial.cost ?? 0,
100+
tokenUsage: { input: 1, output: 1 },
101+
outcome: {
102+
[split === 'holdout' ? 'holdoutScore' : 'searchScore']: trial.score,
103+
raw: { score: trial.score },
104+
},
105+
splitTag: split,
106+
}),
107+
},
108+
})
109+
110+
console.log({
111+
searchBest: result.searchBestVariant.id,
112+
promoted: result.promotedVariant.id,
113+
gate: result.gate?.decision ?? null,
114+
})

package.json

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,15 @@
11
{
22
"name": "@tangle-network/agent-eval",
3-
"version": "0.17.3",
3+
"version": "0.18.0",
44
"description": "Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).",
5+
"homepage": "https://github.com/tangle-network/agent-eval#readme",
6+
"repository": {
7+
"type": "git",
8+
"url": "git+https://github.com/tangle-network/agent-eval.git"
9+
},
10+
"bugs": {
11+
"url": "https://github.com/tangle-network/agent-eval/issues"
12+
},
513
"type": "module",
614
"main": "./dist/index.js",
715
"types": "./dist/index.d.ts",
@@ -31,7 +39,8 @@
3139
"agent-eval": "dist/cli.js"
3240
},
3341
"files": [
34-
"dist"
42+
"dist",
43+
"docs"
3544
],
3645
"publishConfig": {
3746
"access": "public"

0 commit comments

Comments
 (0)