feat(0.25.0): ProductionLoop primitive — close the eval→prod→eval cycle#49
Merged
Conversation
The pieces to close the loop were already in the package
(runMultiShotOptimization, failureClusterView, evaluateReleaseConfidence,
extractPreferences, FeedbackTrajectoryStore, TraceStore). This release
adds the one clean primitive that wires them end-to-end.
- runProductionLoop({...}): one call = one cycle. Ingest prod traces +
feedback, cluster failures, evolve against the worst cluster, gate
fail-closed, open a PR with the new prompt. Idempotent + replayable;
cron is the consumer's job.
- proposeAutomatedPullRequest + ghCliClient / httpGithubClient (no new
deps; both transports tested with fakes).
- POST /v1/feedback + POST /v1/traces/ingest (NDJSON-capable) wire
endpoints. Optional bearer auth (healthz / version stay exempt).
- examples/production-loop: runnable synthetic end-to-end demo.
- 33 new tests (1052 total, up from 1019). Typecheck + biome + build
clean. openapi.json regenerated.
Contributor
Author
🔍 Reviewing
|
| Pass | Status | ETA |
|---|---|---|
| Kimi Code K2.6 | Running | ~5-15 min |
| opencode DeepSeek v4 Pro | Running | ~5-15 min |
Agent review running. Reads the actual code. This comment updates in place.
tangletools · #49 · model: kimi-for-coding · started 2026-05-14T17:33:19Z
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Today,
@tangle-network/agent-evalis great for CI/offline evals. It does not yet ship the compound machine — the production cron that ingests live customer traces, clusters failures, runs evolve, opens a PR with the improved prompt, and ships it. That cycle is the durable moat for the 5 product agents we run; without it, we're shipping static prompts that decay as the world changes (today's FTC-non-compete-rule flip is exactly this kind of decay).The eval substrate is already there. 0.25.0 adds the one clean orchestration layer that wires it up.
What's new
runProductionLoop({ ... })—src/production-loop.tsOne call = one cycle. Steps:
traceStore+ trajectories fromfeedbackStore(or HTTP-ingested via the new endpoints).failureClusterViewgroups failed runs; pick the worst aboveminClusterSize/minSeverityRatio.runMultiShotOptimizationagainst the consumer'sholdoutScenarios, seeded withbaselinePrompt.HeldOutGatepaired-Δ verdict;evaluateReleaseConfidencecross-checks pass-rate, mean score, overfit gap. Fail-closed on either axis.shipis wired and the gate passes, open a PR with the new prompt viaAutoPrClient.Idempotent + replayable: same
runId→ same plan. Cron / GitHub Actions are the consumer's job — this primitive runs one cycle.proposeAutomatedPullRequest—src/auto-pr.tsTwo transports, both
AutoPrClientimpls (so tests substitute a fake):httpGithubClient({ token })— direct REST againstapi.github.com. No new deps. Idempotent on branch name: existing open PRs are returned, not duplicated. Fast-forwards refs that already exist at a different SHA.ghCliClient()— shells out togh+gitfor developer-machine workflows.Validates inputs: rejects
..paths, whitespace branch names, duplicate file changes, branch == base.Wire ingestion —
src/wire/POST /v1/traces/ingest— accepts JSON ({events:[...]}) orapplication/x-ndjsonfor streaming production runtimes. Per-event error reporting (one bad event doesn't poison the batch).POST /v1/feedback— idempotent ontrajectory.id.createApp({ auth: { bearer: '...' } })); supports rotating tokens via verifier function./healthzand/v1/versionalways exempt — never lock monitoring out of the runtime.openapi.json.examples/production-loop/Runnable synthetic demo: 8 prod failures all in the same
instruction_followingcluster → cycle fires → addendum-style mutator wins on holdout → fakeAutoPrClientcaptures the PR plan.tests/production-loop-example.test.tsasserts the example shape so CI catches rot.Diff
runMultiShotOptimizationfailureClusterViewevaluateReleaseConfidence+HeldOutGatePOST /v1/traces/ingest(NDJSON)POST /v1/feedbackproposeAutomatedPullRequestrunProductionLoopThe substrate was there. The 5 product agents now get the cycle for free.
Stability
Every new export is
@experimental. All 0.24.0@stablemarkers preserved.Tests
tests/production-loop.test.ts: cluster → evolve → propose-PR determinism, fail-closed gate behavior, validation.tests/auto-pr.test.ts: REST sequence (blob → tree → commit → ref → pulls), idempotency on existing open PRs, dry-run skips network.tests/wire-ingestion.test.ts: JSON + NDJSON ingest, malformed payload → 400ValidationError, missing store → 503, bearer auth.tests/production-loop-example.test.ts: regression for the worked example.Gates
pnpm typecheck— clean (strict +noUncheckedIndexedAccess)pnpm exec biome check src— zero new warnings (14 pre-existing, untouched)pnpm test— 1052 passed (118 files)pnpm build— tsup +dist/openapi.jsonregenerated with the new endpoints + componentsTest plan
pnpm install && pnpm typecheck && pnpm lint && pnpm test && pnpm buildcleandist/openapi.jsonpaths include/v1/feedbackand/v1/traces/ingest; components includeFeedbackTrajectory,TraceEvent,TracesIngestRequest,TracesIngestResponse,FeedbackIngestResponsetests/production-loop-example.test.tspasses against the worked example fixturerunProductionLoopagainst aFileSystemTraceStore(smoke test, not blocking on this PR)