tangle-network
diff --git a/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 26 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 7 additions & 0 deletions b/‎README.md‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎clients/python/README.md‎
Lines changed: 2 additions & 2 deletions b/‎clients/python/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎clients/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion b/‎clients/python/pyproject.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎clients/python/src/tangle_agent_eval/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎clients/python/src/tangle_agent_eval/__init__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/feature-guide.md‎
Lines changed: 7 additions & 4 deletions b/‎docs/feature-guide.md‎
Lines changed: 7 additions & 4 deletions
diff --git a/‎docs/multi-shot-optimization.md‎
Lines changed: 122 additions & 0 deletions b/‎docs/multi-shot-optimization.md‎
Lines changed: 122 additions & 0 deletions
diff --git a/‎docs/wire-protocol.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/wire-protocol.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎examples/multi-shot-optimization/index.ts‎
Lines changed: 114 additions & 0 deletions b/‎examples/multi-shot-optimization/index.ts‎
Lines changed: 114 additions & 0 deletions
diff --git a/‎package.json‎
Lines changed: 11 additions & 2 deletions b/‎package.json‎
Lines changed: 11 additions & 2 deletions
@@ -1,5 +1,31 @@
 # Changelog
 
+## 0.18.0 — multi-shot optimization
+
+### Added
+
+- `runMultiShotOptimization`, the canonical GEPA-style adapter for
+  variable-length agent trajectories. It wraps `runPromptEvolution` while
+  preserving full multi-shot traces, actionable side information, stable paired
+  seeds, score/cost objectives, and optional held-out promotion gating.
+- `trialTraceFromMultiShotTrial`, a bridge from multi-shot trial results into
+  reflective mutation prompts.
+- `ActionableSideInfo`, `MultiShotVariant`, `MultiShotTrace`, `MultiShotRun`,
+  `MultiShotScore`, `MultiShotTrialResult`, `MultiShotMutateAdapter`, and
+  related public types.
+- `docs/multi-shot-optimization.md` and
+  `examples/multi-shot-optimization/index.ts`.
+
+### Changed
+
+- The multi-shot result shape explicitly separates `searchBestVariant` from
+  `promotedVariant`. If a holdout gate rejects the search winner, the promoted
+  variant is the baseline.
+- `runMultiShotOptimization` validates release-critical configuration up front:
+  unique variant/scenario ids, positive integer run counts, population size,
+  disjoint search/holdout ids, and a gate baseline key matching the first seed
+  variant.
+
 ## 0.17.2 — agent control runtime
 
 ### Added
 
@@ -82,13 +82,20 @@ The recipe for a code-generator eval is in [`SKILL.md` §Minimal working path](.
 | `FeedbackTrajectory`, `InMemoryFeedbackTrajectoryStore`, `FileSystemFeedbackTrajectoryStore` | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | [feedback-trajectories.md](./docs/feedback-trajectories.md) |
 | `evaluateActionPolicy` | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | [feature-guide.md](./docs/feature-guide.md) |
 | `ExperimentTracker`, `PromptOptimizer`, `bisector` | A/B prompts, optimize steering, bisect regressions. | SKILL.md |
+| `runMultiShotOptimization`, `trialTraceFromMultiShotTrial` | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | [multi-shot-optimization.md](./docs/multi-shot-optimization.md) |
 | `runPromptEvolution`, `createCompositeMutator`, `createSandboxPool`, `createSandboxCodeMutator`, `MutationTelemetry`, `LineageRecorder`, `CostLedger`, `JsonlTrialCache` | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop |
 | `reflective-mutation` (`buildReflectionPrompt`, `parseReflectionResponse`, `DEFAULT_MUTATION_PRIMITIVES`) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc |
 | `correlationStudy`, `OutcomeStore`, `ProductRegistry` | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc |
 | Telemetry (`telemetry/`, `telemetry/file`) | OTLP export, trace replay, file sinks. | inline JSDoc |
 
 ## Evolution loop
 
+For agent tasks that run across many chat turns or tool calls, start with
+[`runMultiShotOptimization`](./docs/multi-shot-optimization.md). It runs the
+same prompt-evolution core over full trajectories, carries actionable side
+information into reflection, and separates the search winner from the variant
+that actually passes held-out promotion.
+
 Closing the loop on a prompt or codebase is **two adapters + a config**. Compose `runPromptEvolution` with `createCompositeMutator` (plateau policy) and you get prompt-only optimization until improvement stalls, then automatic switch to code-channel mutations from a coding agent inside a `SandboxPool`.
 
 ```ts
 
@@ -140,9 +140,9 @@ All errors carry `.code` and `.details` (the structured payload from the server)
 
 ## Versioning
 
-This package is **version-locked** to the npm package. `tangle-agent-eval==0.12.0` ↔ `@tangle-network/agent-eval@0.12.0`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
+This package is **version-locked** to the npm package. `tangle-agent-eval==0.18.0` ↔ `@tangle-network/agent-eval@0.18.0`. The two ship from the same git tag in the same CI workflow; if either fails to publish, neither does. Mismatched versions are a build-time error.
 
-`wire_version` is separate. It bumps only on breaking schema changes. A 0.13 server still serves 0.12 clients as long as `wire_version` is the same.
+`wire_version` is separate. It bumps only on breaking schema changes. Package versions can differ across releases as long as `wire_version` is the same.
 
 ## Development
 
 
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "tangle-agent-eval"
-version = "0.12.0"
+version = "0.18.0"
 description = "Python client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC."
 readme = "README.md"
 requires-python = ">=3.10"
 
@@ -39,7 +39,7 @@
     VersionResponse,
 )
 
-__version__ = "0.12.0"
+__version__ = "0.18.0"
 
 __all__ = [
     "Client",
 
@@ -34,6 +34,7 @@ trying, and whether a change made them better or worse.
 | “Can this action run, or does it need approval?” | `evaluateActionPolicy` | Generic preflight for side effects, budgets, and required evidence. |
 | “I need train/dev/test/holdout examples.” | `Dataset` plus feedback trajectory conversion | Stable splits and contamination control. |
 | “Which prompt or signature wins?” | `PromptOptimizer`, `OptimizationLoop`, steering optimizers | Runs variants on scenarios and compares scores. |
+| “Improve a multi-turn agent over real task traces.” | `runMultiShotOptimization` | GEPA-style trajectory optimization with ASI and held-out promotion. |
 | “Improve prompts, then code if prompts plateau.” | `runPromptEvolution`, composite mutator, code mutator | Bounded evolution with telemetry and lineage. |
 | “Find why a regression happened.” | bisector, traces, run records | Narrows changes and preserves evidence. |
 | “Expose evals to another language.” | Wire protocol and Python client | HTTP/RPC boundary for non-TypeScript apps. |
@@ -104,10 +105,12 @@ generated code -> build/test/runtime gates -> score -> ship or revise
 
 Use when you want Ax/GEPA-style improvement.
 
-1. Build a dataset with train/dev/test/holdout splits.
-2. Evaluate variants against the same scenarios.
-3. Promote only when paired comparisons and held-out checks support it.
-4. Keep run records with prompt hash, model, config, cost, and commit.
+1. For variable-length agent tasks, use `runMultiShotOptimization`.
+2. Build search/dev/test/holdout splits from the real product loop.
+3. Score full trajectories, not just final text.
+4. Emit actionable side information for failures the mutator can fix.
+5. Promote only `promotedVariant`, never a rejected `searchBestVariant`.
+6. Keep run records with prompt hash, model, config, cost, and commit.
 
 Result:
 
 
@@ -0,0 +1,122 @@
+# Multi-Shot Optimization
+
+`runMultiShotOptimization` is the public adapter for GEPA-style optimization over
+variable-length agent conversations.
+
+Use it when the thing you want to improve is not a single model call. Typical
+targets are agent system prompts, tool descriptions, routing policies, retrieval
+plans, or app-specific scaffolding that affects an entire task trajectory.
+
+The primitive is intentionally small. Your app owns the domain logic:
+
+- `seedVariants`: prompt/config/tool-policy candidates
+- `runner`: executes one complete task trajectory for one variant
+- `scorer`: scores the trajectory and emits actionable side information
+- `mutateAdapter`: proposes new variants from top and bottom trials
+
+`agent-eval` owns the release-critical glue:
+
+- stable paired seeds
+- search-split prompt evolution
+- cost/score Pareto objectives
+- failed-run conversion into failed trials
+- ASI projection into reflection traces and numeric metrics
+- optional paired holdout gating through `HeldOutGate`
+- validated `RunRecord` rows for promotion evidence
+
+## Result Contract
+
+The return shape separates discovery from promotion:
+
+- `searchBestVariant`: best variant on the optimizer-visible search scenarios
+- `searchBestAggregate`: aggregate for that search winner
+- `promotedVariant`: variant callers should ship
+- `promotedAggregate`: aggregate for the promoted variant
+- `gate`: holdout decision and evidence, or `null` when no gate ran
+
+If a holdout gate is configured and rejects the search winner,
+`promotedVariant` is the baseline. Do not ship `searchBestVariant` directly
+unless you intentionally run without a holdout gate.
+
+## Actionable Side Information
+
+The scorer should return `asi` rows for concrete failure modes:
+
+```ts
+{
+  expectationId: 'used-primary-sources',
+  message: 'The final answer cited secondary summaries instead of primary sources.',
+  severity: 'error',
+  responsibleSurface: 'retrieval-policy',
+  suggestion: 'Prefer primary-source domains during source-gathering turns.',
+}
+```
+
+These rows become:
+
+- reflection expectations via `trialTraceFromMultiShotTrial`
+- aggregate metrics like `asi.error` and `surface.retrieval-policy`
+- trace evidence available to downstream reports
+
+This is the main reason to use this primitive instead of reducing each run to a
+single scalar reward.
+
+## Holdout Discipline
+
+For release gates, configure `gate`. The first seed variant is the baseline and
+`gate.gate.baselineKey` must match its id.
+
+Holdout scenarios must be disjoint from `searchScenarioIds`. The adapter runs
+baseline and candidate with the same `(scenarioId, rep)` seed, validates every
+row with `validateRunRecord`, then asks `HeldOutGate` whether to promote.
+
+When `gate.searchScenarioIds` is omitted, the adapter reuses
+`searchScenarioIds` for the overfit-gap check.
+
+## Minimal Shape
+
+```ts
+import {
+  runMultiShotOptimization,
+  trialTraceFromMultiShotTrial,
+  type MultiShotVariant,
+} from '@tangle-network/agent-eval'
+
+type Payload = { systemPrompt: string }
+
+const baseline: MultiShotVariant<Payload> = {
+  id: 'baseline',
+  label: 'baseline',
+  generation: 0,
+  payload: { systemPrompt: currentPrompt },
+}
+
+const result = await runMultiShotOptimization<Payload>({
+  runId: `research-agent-${Date.now()}`,
+  target: 'research-agent-system-prompt',
+  seedVariants: [baseline],
+  searchScenarioIds: searchScenarios.map((s) => s.id),
+  reps: 2,
+  generations: 4,
+  populationSize: 4,
+  scoreConcurrency: 4,
+  runner: {
+    async run({ variant, scenarioId, seed }) {
+      return runYourAgentToCompletion({ scenarioId, seed, prompt: variant.payload.systemPrompt })
+    },
+  },
+  scorer: {
+    async score({ run }) {
+      return scoreFullTrajectory(run.trace)
+    },
+  },
+  mutateAdapter: {
+    async mutate({ parent, bottomTrials, childCount, generation }) {
+      const traces = bottomTrials.map((t) => trialTraceFromMultiShotTrial(t))
+      return proposePromptMutations({ parent, traces, childCount, generation })
+    },
+  },
+})
+
+deploy(result.promotedVariant.payload)
+```
@@ -96,13 +96,13 @@ GET /v1/version
 ```json
 {
   "package": "@tangle-network/agent-eval",
-  "version": "0.12.0",
+  "version": "0.18.0",
   "wireVersion": "1.0.0",
   "apiSurface": ["judge", "listRubrics", "version"]
 }
 ```
 
-`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking schema changes. A 0.13 server still serves 0.12 clients as long as `wireVersion` matches.
+`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
 
 ### `GET /healthz` — liveness
 
 
@@ -0,0 +1,114 @@
+import {
+  runMultiShotOptimization,
+  trialTraceFromMultiShotTrial,
+  type MultiShotVariant,
+  type RunRecord,
+} from '@tangle-network/agent-eval'
+
+type Payload = {
+  instruction: string
+  quality: number
+}
+
+const baseline: MultiShotVariant<Payload> = {
+  id: 'baseline',
+  label: 'baseline',
+  generation: 0,
+  payload: {
+    instruction: 'Complete the user task.',
+    quality: 0.45,
+  },
+}
+
+const result = await runMultiShotOptimization<Payload>({
+  runId: 'demo-multi-shot',
+  target: 'demo-agent-system-prompt',
+  seedVariants: [baseline],
+  searchScenarioIds: ['search-brief', 'search-code-review', 'search-research'],
+  reps: 1,
+  generations: 2,
+  populationSize: 2,
+  scoreConcurrency: 2,
+  runner: {
+    async run({ variant, scenarioId }) {
+      return {
+        trace: {
+          scenarioId,
+          turns: [
+            { role: 'user', content: `Run ${scenarioId}` },
+            { role: 'assistant', content: `${variant.payload.instruction} quality=${variant.payload.quality}` },
+          ],
+          output: `quality=${variant.payload.quality}`,
+        },
+        costUsd: 0.01,
+        durationMs: 50,
+      }
+    },
+  },
+  scorer: {
+    async score({ variant }) {
+      return {
+        score: variant.payload.quality,
+        ok: true,
+        asi: variant.payload.quality >= 0.8
+          ? []
+          : [{
+              expectationId: 'complete-task',
+              message: 'The agent did not fully complete the task.',
+              severity: 'error',
+              responsibleSurface: 'system-prompt',
+              suggestion: 'Make completion criteria explicit before final response.',
+            }],
+      }
+    },
+  },
+  mutateAdapter: {
+    async mutate({ parent, bottomTrials, childCount, generation }) {
+      const traces = bottomTrials.map((trial) => trialTraceFromMultiShotTrial(trial))
+      const rationale = traces.flatMap((trace) => (trace.expectations ?? []).map((e) => e.phrase)).join('\n')
+      return Array.from({ length: childCount }, (_, i) => ({
+        id: `${parent.id}.g${generation}.${i}`,
+        label: 'completion-focused',
+        generation,
+        payload: {
+          instruction: `${parent.payload.instruction} Verify every requested step before final answer.`,
+          quality: 0.9,
+        },
+        rationale,
+      }))
+    },
+  },
+  gate: {
+    holdoutScenarioIds: ['holdout-brief', 'holdout-code-review', 'holdout-research'],
+    gate: {
+      baselineKey: 'baseline',
+      minProductiveRuns: 3,
+      pairedDeltaThreshold: 0,
+      seed: 7,
+    },
+    toRunRecord: ({ variant, scenarioId, rep, split, seed, trial }): RunRecord => ({
+      runId: `demo-${variant.id}-${scenarioId}-${rep}-${split}`,
+      experimentId: scenarioId,
+      candidateId: variant.id,
+      seed,
+      model: 'demo-model@2026-01-01',
+      promptHash: 'p'.repeat(64),
+      configHash: 'c'.repeat(64),
+      commitSha: 'deadbeef',
+      wallMs: trial.durationMs ?? 0,
+      costUsd: trial.cost ?? 0,
+      tokenUsage: { input: 1, output: 1 },
+      outcome: {
+        [split === 'holdout' ? 'holdoutScore' : 'searchScore']: trial.score,
+        raw: { score: trial.score },
+      },
+      splitTag: split,
+    }),
+  },
+})
+
+console.log({
+  searchBest: result.searchBestVariant.id,
+  promoted: result.promotedVariant.id,
+  gate: result.gate?.decision ?? null,
+})
@@ -1,7 +1,15 @@
 {
   "name": "@tangle-network/agent-eval",
-  "version": "0.17.3",
+  "version": "0.18.0",
   "description": "Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and frontier (meta-eval correlation study, Process Reward Modeling, bisector).",
+  "homepage": "https://github.com/tangle-network/agent-eval#readme",
+  "repository": {
+    "type": "git",
+    "url": "git+https://github.com/tangle-network/agent-eval.git"
+  },
+  "bugs": {
+    "url": "https://github.com/tangle-network/agent-eval/issues"
+  },
   "type": "module",
   "main": "./dist/index.js",
   "types": "./dist/index.d.ts",
@@ -31,7 +39,8 @@
     "agent-eval": "dist/cli.js"
   },
   "files": [
-    "dist"
+    "dist",
+    "docs"
   ],
   "publishConfig": {
     "access": "public"
Original file line number	Diff line number	Diff line change
`@@ -39,7 +39,7 @@`
`39`	`39`	`VersionResponse,`
`40`	`40`	`)`
`41`	`41`
`42`		`-__version__ = "0.12.0"`
	`42`	`+__version__ = "0.18.0"`
`43`	`43`
`44`	`44`	`__all__ = [`
`45`	`45`	`"Client",`
Original file line number	Diff line number	Diff line change
`@@ -96,13 +96,13 @@ GET /v1/version`
`96`	`96`	```json
`97`	`97`	`{`
`98`	`98`	`"package": "@tangle-network/agent-eval",`
`99`		`- "version": "0.12.0",`
	`99`	`+ "version": "0.18.0",`
`100`	`100`	`"wireVersion": "1.0.0",`
`101`	`101`	`"apiSurface": ["judge", "listRubrics", "version"]`
`102`	`102`	`}`
`103`	`103`	```
`104`	`104`
`105`		-`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking schema changes. A 0.13 server still serves 0.12 clients as long as `wireVersion` matches.
	`105`	+`version` matches the npm/PyPI package version. `wireVersion` bumps independently — only on breaking request/response schema changes. Package versions can differ across releases as long as `wireVersion` matches.
`106`	`106`
`107`	`107`	### `GET /healthz` — liveness
`108`	`108`