Skip to content

feat(0.25.0): ProductionLoop primitive — close the eval→prod→eval cycle#49

Merged
drewstone merged 1 commit into
mainfrom
feat/0.25.0-production-loop
May 14, 2026
Merged

feat(0.25.0): ProductionLoop primitive — close the eval→prod→eval cycle#49
drewstone merged 1 commit into
mainfrom
feat/0.25.0-production-loop

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Summary

Today, @tangle-network/agent-eval is great for CI/offline evals. It does not yet ship the compound machine — the production cron that ingests live customer traces, clusters failures, runs evolve, opens a PR with the improved prompt, and ships it. That cycle is the durable moat for the 5 product agents we run; without it, we're shipping static prompts that decay as the world changes (today's FTC-non-compete-rule flip is exactly this kind of decay).

The eval substrate is already there. 0.25.0 adds the one clean orchestration layer that wires it up.

What's new

runProductionLoop({ ... })src/production-loop.ts

One call = one cycle. Steps:

  1. List runs from traceStore + trajectories from feedbackStore (or HTTP-ingested via the new endpoints).
  2. failureClusterView groups failed runs; pick the worst above minClusterSize / minSeverityRatio.
  3. runMultiShotOptimization against the consumer's holdoutScenarios, seeded with baselinePrompt.
  4. HeldOutGate paired-Δ verdict; evaluateReleaseConfidence cross-checks pass-rate, mean score, overfit gap. Fail-closed on either axis.
  5. When ship is wired and the gate passes, open a PR with the new prompt via AutoPrClient.

Idempotent + replayable: same runId → same plan. Cron / GitHub Actions are the consumer's job — this primitive runs one cycle.

proposeAutomatedPullRequestsrc/auto-pr.ts

Two transports, both AutoPrClient impls (so tests substitute a fake):

  • httpGithubClient({ token }) — direct REST against api.github.com. No new deps. Idempotent on branch name: existing open PRs are returned, not duplicated. Fast-forwards refs that already exist at a different SHA.
  • ghCliClient() — shells out to gh + git for developer-machine workflows.

Validates inputs: rejects .. paths, whitespace branch names, duplicate file changes, branch == base.

Wire ingestion — src/wire/

  • POST /v1/traces/ingest — accepts JSON ({events:[...]}) or application/x-ndjson for streaming production runtimes. Per-event error reporting (one bad event doesn't poison the batch).
  • POST /v1/feedback — idempotent on trajectory.id.
  • Both 503 when no store is wired (fail loud, not silent).
  • Optional bearer auth (createApp({ auth: { bearer: '...' } })); supports rotating tokens via verifier function. /healthz and /v1/version always exempt — never lock monitoring out of the runtime.
  • Schemas + responses added to openapi.json.

examples/production-loop/

Runnable synthetic demo: 8 prod failures all in the same instruction_following cluster → cycle fires → addendum-style mutator wins on holdout → fake AutoPrClient captures the PR plan. tests/production-loop-example.test.ts asserts the example shape so CI catches rot.

Diff

Surface Before After
Build-and-evolve runMultiShotOptimization unchanged
Cluster failures failureClusterView unchanged
Release gate evaluateReleaseConfidence + HeldOutGate unchanged
Production trace ingestion bring-your-own POST /v1/traces/ingest (NDJSON)
Production feedback ingestion bring-your-own POST /v1/feedback
Open a PR with the new prompt bring-your-own proposeAutomatedPullRequest
Wire the above end-to-end bring-your-own runProductionLoop
Cron / scheduler consumer still consumer (workflow_dispatch + cron)

The substrate was there. The 5 product agents now get the cycle for free.

Stability

Every new export is @experimental. All 0.24.0 @stable markers preserved.

Tests

  • 33 new tests (1052 total / 1019 baseline; +3.2%).
  • tests/production-loop.test.ts: cluster → evolve → propose-PR determinism, fail-closed gate behavior, validation.
  • tests/auto-pr.test.ts: REST sequence (blob → tree → commit → ref → pulls), idempotency on existing open PRs, dry-run skips network.
  • tests/wire-ingestion.test.ts: JSON + NDJSON ingest, malformed payload → 400 ValidationError, missing store → 503, bearer auth.
  • tests/production-loop-example.test.ts: regression for the worked example.

Gates

  • pnpm typecheck — clean (strict + noUncheckedIndexedAccess)
  • pnpm exec biome check src — zero new warnings (14 pre-existing, untouched)
  • pnpm test — 1052 passed (118 files)
  • pnpm build — tsup + dist/openapi.json regenerated with the new endpoints + components

Test plan

  • pnpm install && pnpm typecheck && pnpm lint && pnpm test && pnpm build clean
  • dist/openapi.json paths include /v1/feedback and /v1/traces/ingest; components include FeedbackTrajectory, TraceEvent, TracesIngestRequest, TracesIngestResponse, FeedbackIngestResponse
  • tests/production-loop-example.test.ts passes against the worked example fixture
  • One of the 5 product agents wires runProductionLoop against a FileSystemTraceStore (smoke test, not blocking on this PR)

The pieces to close the loop were already in the package
(runMultiShotOptimization, failureClusterView, evaluateReleaseConfidence,
extractPreferences, FeedbackTrajectoryStore, TraceStore). This release
adds the one clean primitive that wires them end-to-end.

- runProductionLoop({...}): one call = one cycle. Ingest prod traces +
  feedback, cluster failures, evolve against the worst cluster, gate
  fail-closed, open a PR with the new prompt. Idempotent + replayable;
  cron is the consumer's job.
- proposeAutomatedPullRequest + ghCliClient / httpGithubClient (no new
  deps; both transports tested with fakes).
- POST /v1/feedback + POST /v1/traces/ingest (NDJSON-capable) wire
  endpoints. Optional bearer auth (healthz / version stay exempt).
- examples/production-loop: runnable synthetic end-to-end demo.
- 33 new tests (1052 total, up from 1019). Typecheck + biome + build
  clean. openapi.json regenerated.
@tangletools
Copy link
Copy Markdown
Contributor Author

🔍 Reviewing 5166cd7d

Pass Status ETA
Kimi Code K2.6 Running ~5-15 min
opencode DeepSeek v4 Pro Running ~5-15 min

Agent review running. Reads the actual code. This comment updates in place.

tangletools · #49 · model: kimi-for-coding · started 2026-05-14T17:33:19Z

@drewstone drewstone merged commit 7d5adc9 into main May 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants