tangle-network · drewstone · May 14, 2026 · May 14, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,70 @@
 # Changelog
 
+## 0.25.0 — ProductionLoop primitive: close the eval → prod → eval cycle
+
+This release ships the **orchestration layer** that turns the existing
+eval substrate into a continuously-improving production system. Static
+prompts decay; today's regulation flips tomorrow. The pieces to close
+the loop were already in the package (`runMultiShotOptimization`,
+`failureClusterView`, `evaluateReleaseConfidence`, `extractPreferences`,
+`FeedbackTrajectoryStore`, `TraceStore`); this release adds the one
+clean primitive that wires them together end-to-end.
+
+### Added
+
+- **`runProductionLoop({ ... })`** (`src/production-loop.ts`,
+  `@experimental`) — one call = one cycle. Ingests production traces
+  and feedback, clusters failures, runs evolve against the worst
+  cluster, gates with `HeldOutGate` + `evaluateReleaseConfidence`
+  (fail-closed), and — when wired with an `AutoPrClient` — opens a PR
+  with the improved prompt. Idempotent + replayable: same `runId`
+  yields the same plan. Cron / GitHub Actions are the consumer's job;
+  the primitive doesn't own scheduling.
+
+- **`proposeAutomatedPullRequest(client, input)`** + two transports
+  (`src/auto-pr.ts`, `@experimental`):
+    - `httpGithubClient({ token, ... })` — direct REST against
+      `api.github.com`, no extra deps. Idempotent on branch name:
+      existing open PRs are returned, not duplicated.
+    - `ghCliClient({ ... })` — shells out to `gh` for environments
+      where developer auth state is already configured.
+  Both validate inputs (no `..` paths, no whitespace branches, no
+  duplicate file changes) and surface `ValidationError` / `ConfigError`
+  from the typed taxonomy.
+
+- **`POST /v1/feedback` + `POST /v1/traces/ingest`** wire endpoints
+  (`src/wire/`). Both Zod-validated, both append to the configured
+  store (`FeedbackTrajectoryStore` / `TraceStore`). 503 when no store
+  is wired (fail loud, not silent). Traces ingest accepts both
+  `application/json` (`{events:[...]}`) and `application/x-ndjson` for
+  streaming production runtimes. Schemas (`TraceEvent`,
+  `FeedbackTrajectory`, `TracesIngestRequest/Response`,
+  `FeedbackIngestResponse`) added to `openapi.json` for cross-language
+  clients.
+
+- **Optional bearer-token auth** on the wire server, configured via
+  `createApp({ auth: { bearer: '...' } })` or as a verifier function
+  for rotating tokens. `/healthz` and `/v1/version` remain unprotected
+  (regression: never lock monitoring out of the runtime).
+
+- **`examples/production-loop/`** — synthetic end-to-end demo wiring
+  the loop against in-memory trace + feedback stores and a fake
+  auto-PR client. Shows the failure-cluster trigger, the evolve round,
+  the gate verdict, and the PR-shaped output without requiring
+  credentials or a live model.
+
+### Changed
+
+- **Wire server** (`createApp(opts)`) now accepts optional
+  `IngestionStores` (`{ traceStore?, feedbackStore? }`) and `auth`.
+  Existing zero-arg callers continue to work — judge / rubrics /
+  version / healthz are unchanged.
+
+### Status tags
+
+- Every new export is `@experimental` initially. Pin the patch version
+  if you depend on it. All other 0.24.0 stability tags are preserved.
+
 ## 0.24.0 — DX cleanup: framing, stability tags, lint, taxonomy, strict indices
 
 This release is **DX + correctness**. No production behavior moved; consumer

diff --git a/README.md b/README.md
@@ -88,6 +88,75 @@ await product.storeEvalResult(task.id, result)
 Same loop shape in production, replay, benchmark, and optimization. Swap the
 dependencies behind `observe()` and `act()`, never the eval contract.
 
+## Production loop — close the eval → prod → eval cycle (0.25.0)
+
+Static prompts decay. Yesterday's FTC rule flips today; yesterday's tool quirk
+becomes today's incident. The production agents that win are the ones that
+**continuously re-train against live failure modes**.
+
+`runProductionLoop` is the orchestration layer that wires the existing eval
+substrate into a self-improvement cron:
+
+```ts
+import {
+  runProductionLoop,
+  httpGithubClient,
+  FileSystemFeedbackTrajectoryStore,
+} from '@tangle-network/agent-eval'
+import { FileSystemTraceStore } from '@tangle-network/agent-eval/traces'
+
+const result = await runProductionLoop({
+  runId: `weekly-${new Date().toISOString().slice(0, 10)}`,
+  target: 'tax-agent',
+
+  // 1. Where production traces + feedback land. Wire the HTTP ingestion
+  //    endpoints (POST /v1/traces/ingest, POST /v1/feedback) from your
+  //    runtime; the same store reads them here.
+  traceStore: new FileSystemTraceStore({ dir: 'data/prod-traces' }),
+  feedbackStore: new FileSystemFeedbackTrajectoryStore({ dir: 'data/prod-feedback' }),
+
+  // 2. Cluster threshold: act on failure groups ≥ 20 runs or ≥ 5% of corpus.
+  cluster: { minClusterSize: 20, minSeverityRatio: 0.05, maxClustersPerCycle: 1 },
+
+  // 3. Evolve: seed = current prompt, gate against holdout scenarios.
+  evolve: {
+    baselinePrompt: currentSystemPrompt,
+    holdoutScenarios: productionShapeScenarios,
+    runner,                            // your agent driver
+    scorer,                            // calibrated judge or rubric
+    mutator,                           // GEPA-style or addendum-style mutator
+    gate: {
+      baselineKey: 'baseline',
+      minProductiveRuns: 5,
+      pairedDeltaThreshold: 0.03,      // require Nσ improvement on holdout
+      overfitGapThreshold: 0.10,
+    },
+  },
+
+  // 4. Ship: when the gate passes, open a PR with the new prompt.
+  ship: {
+    client: httpGithubClient({ token: process.env.GITHUB_TOKEN! }),
+    repo: { owner: 'tangle-network', name: 'tax-agent' },
+    branchPrefix: 'eval/auto-improve',
+    promptFilePath: 'prompts/tax-agent-system.txt',
+    reviewers: ['drew'],
+  },
+
+  cron: { cadence: 'weekly' },         // surface-only; consumer schedules
+})
+
+console.log(result.decision)            // 'pr_opened' | 'gate_failed' | 'no_actionable_failures' | ...
+console.log(result.pullRequest?.prUrl)  // populated when a PR was opened
+```
+
+The primitive runs **one cycle**. Schedule it with `workflow_dispatch` + cron in
+GitHub Actions. It is **idempotent + replayable**: same `runId` → same plan.
+Gate failures are fail-closed — a candidate that beats baseline on search but
+overfits on holdout never lands.
+
+Full runnable demo (synthetic traces, no credentials) in
+[`examples/production-loop`](./examples/production-loop/README.md).
+
 ## Self-improvement loop
 
 Eval doesn't end at "pass/fail." Outcomes become training signal, mutation
@@ -222,6 +291,8 @@ and runtime. See [`examples/`](./examples/).
   closed loop — score, reflect, mutate, re-score, repeat.
 - [`examples/fine-tune-with-prime-rl`](./examples/fine-tune-with-prime-rl/README.md):
   RunRecord → preferences → trainer (prime-rl) → next campaign.
+- [`examples/production-loop`](./examples/production-loop/README.md):
+  ingest prod traces + feedback, cluster failures, evolve, gate, open a PR.
 
 ## Docs
 

diff --git a/clients/python/pyproject.toml b/clients/python/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "agent-eval-rpc"
-version = "0.24.0"
+version = "0.25.0"
 description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client."
 readme = "README.md"
 requires-python = ">=3.10"

diff --git a/examples/production-loop/README.md b/examples/production-loop/README.md
@@ -0,0 +1,79 @@
+# Production loop
+
+End-to-end demo of `runProductionLoop` — the orchestration layer that
+closes eval → prod → eval.
+
+## What it shows
+
+- 8 synthetic production failures (all hitting the same `instruction_following`
+  failure class — missing statute citations on FTC rule questions) seeded
+  into an `InMemoryTraceStore`.
+- 8 matching 👎 user-feedback labels seeded into an
+  `InMemoryFeedbackTrajectoryStore`.
+- One `runProductionLoop` cycle:
+  - `failureClusterView` surfaces the cluster, which crosses the
+    `minClusterSize: 5` threshold.
+  - `runMultiShotOptimization` runs 2 generations × 2 reps over 3
+    holdout scenarios, with an addendum-style mutator that appends a
+    citation directive to the baseline prompt.
+  - `HeldOutGate` checks that the paired-Δ on the holdout split is
+    positive with `minProductiveRuns: 3`.
+  - `evaluateReleaseConfidence` cross-checks pass-rate, mean score,
+    overfit gap, and the gate decision (fail-closed on any axis).
+  - On pass, a fake `AutoPrClient` captures the PR plan — a real
+    deployment would wire `httpGithubClient({ token })` or
+    `ghCliClient()`.
+
+## Run
+
+```sh
+pnpm tsx examples/production-loop/index.ts
+```
+
+## Expected output
+
+```
+═══════════════════════════════════════════════════════════════
+production-loop demo · synthetic prod data → improved prompt
+═══════════════════════════════════════════════════════════════
+runId          : prod-loop-demo-<epoch>
+target         : tax-agent
+decision       : pr_opened
+observed runs  : 8
+observed feedback: 8
+clusters seen  : 1
+acted-on       : class=instruction_following runs=8 scenarios=1
+gate           : promote=true medianΔ=0.450 CI=[0.450, 0.450]
+release status : pass (passRate=...)
+───────────────────────────────────────────────────────────────
+PR opened      : https://github.com/tangle-network/tax-agent/pull/synthetic-1
+branch         : eval/auto-improve/prod-loop-demo-<epoch>
+head SHA       : face-cafe-beef-...
+───────────────────────────────────────────────────────────────
+PR title: tax-agent: production-loop prompt update (prod-loop-demo-<epoch>)
+PR file: prompts/tax-agent-system.txt
+PR body preview:
+  ## Production-loop prompt update — `tax-agent`
+
+  Run id: `prod-loop-demo-<epoch>`
+  Decision: `pr_opened`
+  Observed in this cycle: 8 prod runs, 8 feedback trajectories.
+
+  ### Triggering failure cluster
+  ...
+═══════════════════════════════════════════════════════════════
+```
+
+## Adapt this to your product
+
+| Synthetic                       | Production                                          |
+| ------------------------------- | --------------------------------------------------- |
+| `InMemoryTraceStore`            | `FileSystemTraceStore`, or HTTP-ingest via `POST /v1/traces/ingest` |
+| `InMemoryFeedbackTrajectoryStore` | `FileSystemFeedbackTrajectoryStore`, or HTTP-ingest via `POST /v1/feedback` |
+| deterministic `runner`          | your agent driver invoking real tools               |
+| deterministic `scorer`          | calibrated judge (`callLlmJson` + `Rubric`)         |
+| `captureAutoPrClient()`         | `httpGithubClient({ token })` or `ghCliClient()`    |
+| `main()`                        | scheduled GitHub Action (`workflow_dispatch` + cron) |
+
+The primitive is **idempotent** + **replayable**: re-running with the
+same `runId` produces the same plan. Safe to retry on transient errors.