Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,70 @@
# Changelog

## 0.25.0 — ProductionLoop primitive: close the eval → prod → eval cycle

This release ships the **orchestration layer** that turns the existing
eval substrate into a continuously-improving production system. Static
prompts decay; today's regulation flips tomorrow. The pieces to close
the loop were already in the package (`runMultiShotOptimization`,
`failureClusterView`, `evaluateReleaseConfidence`, `extractPreferences`,
`FeedbackTrajectoryStore`, `TraceStore`); this release adds the one
clean primitive that wires them together end-to-end.

### Added

- **`runProductionLoop({ ... })`** (`src/production-loop.ts`,
`@experimental`) — one call = one cycle. Ingests production traces
and feedback, clusters failures, runs evolve against the worst
cluster, gates with `HeldOutGate` + `evaluateReleaseConfidence`
(fail-closed), and — when wired with an `AutoPrClient` — opens a PR
with the improved prompt. Idempotent + replayable: same `runId`
yields the same plan. Cron / GitHub Actions are the consumer's job;
the primitive doesn't own scheduling.

- **`proposeAutomatedPullRequest(client, input)`** + two transports
(`src/auto-pr.ts`, `@experimental`):
- `httpGithubClient({ token, ... })` — direct REST against
`api.github.com`, no extra deps. Idempotent on branch name:
existing open PRs are returned, not duplicated.
- `ghCliClient({ ... })` — shells out to `gh` for environments
where developer auth state is already configured.
Both validate inputs (no `..` paths, no whitespace branches, no
duplicate file changes) and surface `ValidationError` / `ConfigError`
from the typed taxonomy.

- **`POST /v1/feedback` + `POST /v1/traces/ingest`** wire endpoints
(`src/wire/`). Both Zod-validated, both append to the configured
store (`FeedbackTrajectoryStore` / `TraceStore`). 503 when no store
is wired (fail loud, not silent). Traces ingest accepts both
`application/json` (`{events:[...]}`) and `application/x-ndjson` for
streaming production runtimes. Schemas (`TraceEvent`,
`FeedbackTrajectory`, `TracesIngestRequest/Response`,
`FeedbackIngestResponse`) added to `openapi.json` for cross-language
clients.

- **Optional bearer-token auth** on the wire server, configured via
`createApp({ auth: { bearer: '...' } })` or as a verifier function
for rotating tokens. `/healthz` and `/v1/version` remain unprotected
(regression: never lock monitoring out of the runtime).

- **`examples/production-loop/`** — synthetic end-to-end demo wiring
the loop against in-memory trace + feedback stores and a fake
auto-PR client. Shows the failure-cluster trigger, the evolve round,
the gate verdict, and the PR-shaped output without requiring
credentials or a live model.

### Changed

- **Wire server** (`createApp(opts)`) now accepts optional
`IngestionStores` (`{ traceStore?, feedbackStore? }`) and `auth`.
Existing zero-arg callers continue to work — judge / rubrics /
version / healthz are unchanged.

### Status tags

- Every new export is `@experimental` initially. Pin the patch version
if you depend on it. All other 0.24.0 stability tags are preserved.

## 0.24.0 — DX cleanup: framing, stability tags, lint, taxonomy, strict indices

This release is **DX + correctness**. No production behavior moved; consumer
Expand Down
71 changes: 71 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,75 @@ await product.storeEvalResult(task.id, result)
Same loop shape in production, replay, benchmark, and optimization. Swap the
dependencies behind `observe()` and `act()`, never the eval contract.

## Production loop — close the eval → prod → eval cycle (0.25.0)

Static prompts decay. Yesterday's FTC rule flips today; yesterday's tool quirk
becomes today's incident. The production agents that win are the ones that
**continuously re-train against live failure modes**.

`runProductionLoop` is the orchestration layer that wires the existing eval
substrate into a self-improvement cron:

```ts
import {
runProductionLoop,
httpGithubClient,
FileSystemFeedbackTrajectoryStore,
} from '@tangle-network/agent-eval'
import { FileSystemTraceStore } from '@tangle-network/agent-eval/traces'

const result = await runProductionLoop({
runId: `weekly-${new Date().toISOString().slice(0, 10)}`,
target: 'tax-agent',

// 1. Where production traces + feedback land. Wire the HTTP ingestion
// endpoints (POST /v1/traces/ingest, POST /v1/feedback) from your
// runtime; the same store reads them here.
traceStore: new FileSystemTraceStore({ dir: 'data/prod-traces' }),
feedbackStore: new FileSystemFeedbackTrajectoryStore({ dir: 'data/prod-feedback' }),

// 2. Cluster threshold: act on failure groups ≥ 20 runs or ≥ 5% of corpus.
cluster: { minClusterSize: 20, minSeverityRatio: 0.05, maxClustersPerCycle: 1 },

// 3. Evolve: seed = current prompt, gate against holdout scenarios.
evolve: {
baselinePrompt: currentSystemPrompt,
holdoutScenarios: productionShapeScenarios,
runner, // your agent driver
scorer, // calibrated judge or rubric
mutator, // GEPA-style or addendum-style mutator
gate: {
baselineKey: 'baseline',
minProductiveRuns: 5,
pairedDeltaThreshold: 0.03, // require Nσ improvement on holdout
overfitGapThreshold: 0.10,
},
},

// 4. Ship: when the gate passes, open a PR with the new prompt.
ship: {
client: httpGithubClient({ token: process.env.GITHUB_TOKEN! }),
repo: { owner: 'tangle-network', name: 'tax-agent' },
branchPrefix: 'eval/auto-improve',
promptFilePath: 'prompts/tax-agent-system.txt',
reviewers: ['drew'],
},

cron: { cadence: 'weekly' }, // surface-only; consumer schedules
})

console.log(result.decision) // 'pr_opened' | 'gate_failed' | 'no_actionable_failures' | ...
console.log(result.pullRequest?.prUrl) // populated when a PR was opened
```

The primitive runs **one cycle**. Schedule it with `workflow_dispatch` + cron in
GitHub Actions. It is **idempotent + replayable**: same `runId` → same plan.
Gate failures are fail-closed — a candidate that beats baseline on search but
overfits on holdout never lands.

Full runnable demo (synthetic traces, no credentials) in
[`examples/production-loop`](./examples/production-loop/README.md).

## Self-improvement loop

Eval doesn't end at "pass/fail." Outcomes become training signal, mutation
Expand Down Expand Up @@ -222,6 +291,8 @@ and runtime. See [`examples/`](./examples/).
closed loop — score, reflect, mutate, re-score, repeat.
- [`examples/fine-tune-with-prime-rl`](./examples/fine-tune-with-prime-rl/README.md):
RunRecord → preferences → trainer (prime-rl) → next campaign.
- [`examples/production-loop`](./examples/production-loop/README.md):
ingest prod traces + feedback, cluster failures, evolve, gate, open a PR.

## Docs

Expand Down
2 changes: 1 addition & 1 deletion clients/python/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "hatchling.build"

[project]
name = "agent-eval-rpc"
version = "0.24.0"
version = "0.25.0"
description = "Python RPC client for @tangle-network/agent-eval — judge content against rubrics over HTTP or stdio RPC. Eval logic runs in the Node runtime; this package is a thin wire client."
readme = "README.md"
requires-python = ">=3.10"
Expand Down
79 changes: 79 additions & 0 deletions examples/production-loop/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Production loop

End-to-end demo of `runProductionLoop` — the orchestration layer that
closes eval → prod → eval.

## What it shows

- 8 synthetic production failures (all hitting the same `instruction_following`
failure class — missing statute citations on FTC rule questions) seeded
into an `InMemoryTraceStore`.
- 8 matching 👎 user-feedback labels seeded into an
`InMemoryFeedbackTrajectoryStore`.
- One `runProductionLoop` cycle:
- `failureClusterView` surfaces the cluster, which crosses the
`minClusterSize: 5` threshold.
- `runMultiShotOptimization` runs 2 generations × 2 reps over 3
holdout scenarios, with an addendum-style mutator that appends a
citation directive to the baseline prompt.
- `HeldOutGate` checks that the paired-Δ on the holdout split is
positive with `minProductiveRuns: 3`.
- `evaluateReleaseConfidence` cross-checks pass-rate, mean score,
overfit gap, and the gate decision (fail-closed on any axis).
- On pass, a fake `AutoPrClient` captures the PR plan — a real
deployment would wire `httpGithubClient({ token })` or
`ghCliClient()`.

## Run

```sh
pnpm tsx examples/production-loop/index.ts
```

## Expected output

```
═══════════════════════════════════════════════════════════════
production-loop demo · synthetic prod data → improved prompt
═══════════════════════════════════════════════════════════════
runId : prod-loop-demo-<epoch>
target : tax-agent
decision : pr_opened
observed runs : 8
observed feedback: 8
clusters seen : 1
acted-on : class=instruction_following runs=8 scenarios=1
gate : promote=true medianΔ=0.450 CI=[0.450, 0.450]
release status : pass (passRate=...)
───────────────────────────────────────────────────────────────
PR opened : https://github.com/tangle-network/tax-agent/pull/synthetic-1
branch : eval/auto-improve/prod-loop-demo-<epoch>
head SHA : face-cafe-beef-...
───────────────────────────────────────────────────────────────
PR title: tax-agent: production-loop prompt update (prod-loop-demo-<epoch>)
PR file: prompts/tax-agent-system.txt
PR body preview:
## Production-loop prompt update — `tax-agent`

Run id: `prod-loop-demo-<epoch>`
Decision: `pr_opened`
Observed in this cycle: 8 prod runs, 8 feedback trajectories.

### Triggering failure cluster
...
═══════════════════════════════════════════════════════════════
```

## Adapt this to your product

| Synthetic | Production |
| ------------------------------- | --------------------------------------------------- |
| `InMemoryTraceStore` | `FileSystemTraceStore`, or HTTP-ingest via `POST /v1/traces/ingest` |
| `InMemoryFeedbackTrajectoryStore` | `FileSystemFeedbackTrajectoryStore`, or HTTP-ingest via `POST /v1/feedback` |
| deterministic `runner` | your agent driver invoking real tools |
| deterministic `scorer` | calibrated judge (`callLlmJson` + `Rubric`) |
| `captureAutoPrClient()` | `httpGithubClient({ token })` or `ghCliClient()` |
| `main()` | scheduled GitHub Action (`workflow_dispatch` + cron) |

The primitive is **idempotent** + **replayable**: re-running with the
same `runId` produces the same plan. Safe to retry on transient errors.
Loading
Loading