feat(0.25.0): ProductionLoop primitive — close the eval→prod→eval cycle by tangletools · Pull Request #49 · tangle-network/agent-eval

tangletools · 2026-05-14T17:27:59Z

Summary

Today, @tangle-network/agent-eval is great for CI/offline evals. It does not yet ship the compound machine — the production cron that ingests live customer traces, clusters failures, runs evolve, opens a PR with the improved prompt, and ships it. That cycle is the durable moat for the 5 product agents we run; without it, we're shipping static prompts that decay as the world changes (today's FTC-non-compete-rule flip is exactly this kind of decay).

The eval substrate is already there. 0.25.0 adds the one clean orchestration layer that wires it up.

What's new

`runProductionLoop({ ... })` — `src/production-loop.ts`

One call = one cycle. Steps:

List runs from traceStore + trajectories from feedbackStore (or HTTP-ingested via the new endpoints).
failureClusterView groups failed runs; pick the worst above minClusterSize / minSeverityRatio.
runMultiShotOptimization against the consumer's holdoutScenarios, seeded with baselinePrompt.
HeldOutGate paired-Δ verdict; evaluateReleaseConfidence cross-checks pass-rate, mean score, overfit gap. Fail-closed on either axis.
When ship is wired and the gate passes, open a PR with the new prompt via AutoPrClient.

Idempotent + replayable: same runId → same plan. Cron / GitHub Actions are the consumer's job — this primitive runs one cycle.

`proposeAutomatedPullRequest` — `src/auto-pr.ts`

Two transports, both AutoPrClient impls (so tests substitute a fake):

httpGithubClient({ token }) — direct REST against api.github.com. No new deps. Idempotent on branch name: existing open PRs are returned, not duplicated. Fast-forwards refs that already exist at a different SHA.
ghCliClient() — shells out to gh + git for developer-machine workflows.

Validates inputs: rejects .. paths, whitespace branch names, duplicate file changes, branch == base.

Wire ingestion — `src/wire/`

POST /v1/traces/ingest — accepts JSON ({events:[...]}) or application/x-ndjson for streaming production runtimes. Per-event error reporting (one bad event doesn't poison the batch).
POST /v1/feedback — idempotent on trajectory.id.
Both 503 when no store is wired (fail loud, not silent).
Optional bearer auth (createApp({ auth: { bearer: '...' } })); supports rotating tokens via verifier function. /healthz and /v1/version always exempt — never lock monitoring out of the runtime.
Schemas + responses added to openapi.json.

`examples/production-loop/`

Runnable synthetic demo: 8 prod failures all in the same instruction_following cluster → cycle fires → addendum-style mutator wins on holdout → fake AutoPrClient captures the PR plan. tests/production-loop-example.test.ts asserts the example shape so CI catches rot.

Diff

Surface	Before	After
Build-and-evolve	`runMultiShotOptimization`	unchanged
Cluster failures	`failureClusterView`	unchanged
Release gate	`evaluateReleaseConfidence` + `HeldOutGate`	unchanged
Production trace ingestion	bring-your-own	`POST /v1/traces/ingest` (NDJSON)
Production feedback ingestion	bring-your-own	`POST /v1/feedback`
Open a PR with the new prompt	bring-your-own	`proposeAutomatedPullRequest`
Wire the above end-to-end	bring-your-own	`runProductionLoop`
Cron / scheduler	consumer	still consumer (workflow_dispatch + cron)

The substrate was there. The 5 product agents now get the cycle for free.

Stability

Every new export is @experimental. All 0.24.0 @stable markers preserved.

Tests

33 new tests (1052 total / 1019 baseline; +3.2%).
tests/production-loop.test.ts: cluster → evolve → propose-PR determinism, fail-closed gate behavior, validation.
tests/auto-pr.test.ts: REST sequence (blob → tree → commit → ref → pulls), idempotency on existing open PRs, dry-run skips network.
tests/wire-ingestion.test.ts: JSON + NDJSON ingest, malformed payload → 400 ValidationError, missing store → 503, bearer auth.
tests/production-loop-example.test.ts: regression for the worked example.

Gates

pnpm typecheck — clean (strict + noUncheckedIndexedAccess)
pnpm exec biome check src — zero new warnings (14 pre-existing, untouched)
pnpm test — 1052 passed (118 files)
pnpm build — tsup + dist/openapi.json regenerated with the new endpoints + components

Test plan

pnpm install && pnpm typecheck && pnpm lint && pnpm test && pnpm build clean
dist/openapi.json paths include /v1/feedback and /v1/traces/ingest; components include FeedbackTrajectory, TraceEvent, TracesIngestRequest, TracesIngestResponse, FeedbackIngestResponse
tests/production-loop-example.test.ts passes against the worked example fixture
One of the 5 product agents wires runProductionLoop against a FileSystemTraceStore (smoke test, not blocking on this PR)

The pieces to close the loop were already in the package (runMultiShotOptimization, failureClusterView, evaluateReleaseConfidence, extractPreferences, FeedbackTrajectoryStore, TraceStore). This release adds the one clean primitive that wires them end-to-end. - runProductionLoop({...}): one call = one cycle. Ingest prod traces + feedback, cluster failures, evolve against the worst cluster, gate fail-closed, open a PR with the new prompt. Idempotent + replayable; cron is the consumer's job. - proposeAutomatedPullRequest + ghCliClient / httpGithubClient (no new deps; both transports tested with fakes). - POST /v1/feedback + POST /v1/traces/ingest (NDJSON-capable) wire endpoints. Optional bearer auth (healthz / version stay exempt). - examples/production-loop: runnable synthetic end-to-end demo. - 33 new tests (1052 total, up from 1019). Typecheck + biome + build clean. openapi.json regenerated.

tangletools · 2026-05-14T17:33:21Z

🔍 Reviewing `5166cd7d`

Pass	Status	ETA
Kimi Code K2.6	Running	~5-15 min
opencode DeepSeek v4 Pro	Running	~5-15 min

Agent review running. Reads the actual code. This comment updates in place.

_{tangletools · #49 · model: kimi-for-coding · started 2026-05-14T17:33:19Z}

drewstone merged commit 7d5adc9 into main May 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(0.25.0): ProductionLoop primitive — close the eval→prod→eval cycle#49

feat(0.25.0): ProductionLoop primitive — close the eval→prod→eval cycle#49
drewstone merged 1 commit into
mainfrom
feat/0.25.0-production-loop

tangletools commented May 14, 2026

Uh oh!

tangletools commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tangletools commented May 14, 2026

Summary

What's new

runProductionLoop({ ... }) — src/production-loop.ts

proposeAutomatedPullRequest — src/auto-pr.ts

Wire ingestion — src/wire/

examples/production-loop/

Diff

Stability

Tests

Gates

Test plan

Uh oh!

tangletools commented May 14, 2026

🔍 Reviewing 5166cd7d

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`runProductionLoop({ ... })` — `src/production-loop.ts`

`proposeAutomatedPullRequest` — `src/auto-pr.ts`

Wire ingestion — `src/wire/`

`examples/production-loop/`

🔍 Reviewing `5166cd7d`