Skip to content

feat(0.7.0): clean substrate — error taxonomy, canonical RuntimeRun, cost ledger, biome, strict indices#10

Merged
tangletools merged 1 commit into
mainfrom
feat/0.7.0-error-taxonomy-canonical-runtime-run
May 14, 2026
Merged

feat(0.7.0): clean substrate — error taxonomy, canonical RuntimeRun, cost ledger, biome, strict indices#10
tangletools merged 1 commit into
mainfrom
feat/0.7.0-error-taxonomy-canonical-runtime-run

Conversation

@tangletools
Copy link
Copy Markdown
Contributor

Why

Two parallel error-tracking paths surfaced in legal-agent's chat handler today:
completeProductionAgentRun from legal's bespoke eval-evidence.ts AND
persistRuntimeRun from api.chat.ts. Both write to the same agentRuns
table, both wrap the same try/catch, neither owns the cost ledger. That's the
rot we're killing. The runtime must own production-run lifecycle so the five
consumer agents (legal, tax, gtm, creative, agent-builder) stop reinventing
it per repo.

This PR is the clean substrate that makes the consumer-side deletions possible.

What changed

Canonical RuntimeRunHandle (NEW)

const run = startRuntimeRun({
  workspaceId, sessionId, agentId, taskSpec, scenarioId,
  adapter: { upsert: (row) => db.insert(agentRuns).values(row) },
})

for await (const event of runAgentTaskStream({ task, backend, input })) {
  run.observe(event) // llm_call events update the cost ledger
  if (event.type === 'final') run.complete({ status: ..., resultSummary: ... })
}
await run.persist({ runtimeEvents: telemetry.events })
console.log(run.cost()) // { tokensIn, tokensOut, costUsd, wallMs, llmCalls }

State machine (running -> completed | failed | cancelled) enforced by
RuntimeRunStateError. complete() is idempotent for the same status (so
retry/cleanup paths don't double-fire). Persistence is pluggable via a single
upsert(row) method — same shape works for D1, postgres, KV.

Error taxonomy (replaces bespoke throw new Error)

Re-exported from @tangle-network/agent-eval 0.24.0:
ValidationError, ConfigError, NotFoundError, CaptureIntegrityError,
JudgeError, ReplayError, VerificationError. All extend AgentEvalError.

Runtime-specific subclasses (also extending AgentEvalError):

  • SessionMismatchError — resume requested against a different backend's kind
  • BackendTransportError — backend HTTP / IPC returned non-success (carries status)
  • RuntimeRunStateErrorRuntimeRunHandle lifecycle out of order

User-facing throws migrated: 2 in src (backends.ts, run.ts). Internal
invariants (none in this package) stay plain Error.

RuntimeStreamEvent -> agent-eval TraceEvent bridge (NEW)

const bridge = createTraceBridge({ runId, spanId })
for await (const event of runAgentTaskStream(...)) {
  const trace = bridge.toTraceEvent(event)
  if (trace) await traceStore.appendEvent(trace)
}

Drop-in replacement for the hand-rolled adapter consumers currently write.
readiness_end (blocked) -> policy_violation, backend_error and failed
task_end / final -> error, everything else -> log. text_delta /
reasoning_delta are dropped (they belong inside an LlmSpan.output, not as
separate trace events).

Cost ledger

RuntimeStreamEvent gains an llm_call variant:
{ model, tokensIn, tokensOut, costUsd, latencyMs, finishReason }.
handle.observe(event) only mutates state on llm_call, so consumers can pipe
the whole stream through without filtering. handle.cost() returns the
running total any time. Required for "cost per customer task" dashboards.

DX bar = agent-eval 0.24.0

  • biome.json mirrors agent-eval (lineWidth 100, single quotes, no semis, organize imports)
  • .github/workflows/ci.yml — lint + typecheck + test + build on every PR
  • .github/workflows/publish.yml — tag-driven npm publish (idempotent on rerun)
  • tsconfig.jsonnoUncheckedIndexedAccess: true enabled; zero new typecheck failures (the codebase already used ?.at(-1) and ?.[0] defensively, so the strict flag was a free win)
  • Every public export carries a @stable JSDoc tag
  • README rewritten — what runtime IS in 2 sentences + a feature table, instead of 20 sentences of taxonomy

Module split

src/index.ts shrank from 1,388 lines to a re-export hub. New layout:

File Owns
src/types.ts task/session/adapter/stream-event types
src/errors.ts re-export agent-eval typed errors + 3 runtime-only
src/readiness.ts decideKnowledgeReadiness
src/sessions.ts InMemoryRuntimeSessionStore + helpers
src/sanitize.ts event sanitization + telemetry collectors
src/sse.ts server-sent event encoding
src/backends.ts backend factories + stream normalization
src/run.ts runAgentTask + runAgentTaskStream
src/runtime-run.ts startRuntimeRun + cost ledger (NEW)
src/trace-bridge.ts createTraceBridge (NEW)

Before / after API

Before After
throw new Error('Cannot resume ...') throw new SessionMismatchError(...)
throw new Error('chat backend returned <status>') throw new BackendTransportError(kind, message, { status })
Consumers wrote completeProductionAgentRun(handle, outcome) per repo handle.complete({ status, resultSummary, error? })
Consumers wrote persistRuntimeRun({ db, ... }) per repo handle.persist({ adapter })
Consumers mapped RuntimeStreamEvent -> TraceEvent per repo createTraceBridge({ runId }).toTraceEvent(event)
Cost accounting was per-event ad-hoc handle.cost() returns { tokensIn, tokensOut, costUsd, wallMs, llmCalls }

Why completeProductionAgentRun is now @deprecated

The consumer-side pattern at legal-agent/src/lib/.server/eval-evidence.ts:59
duplicated three concerns:

  1. State machine — open-coded running -> completed/failed with no
    idempotency guarantee. Retry paths double-fire logAudit.
  2. Persistence — interleaves with persistRuntimeRun in
    api.chat.ts:515. Two writers, one table, no transaction.
  3. Audit — fires agent_eval.trace.* actions on logAudit directly,
    coupling the run lifecycle to the audit shape.

startRuntimeRun collapses (1) and (2) behind a single state-machine handle
with a pluggable upsert(row) adapter. Consumers route audit from
handle.complete() -> their existing logAudit in the adapter or in a thin
wrapper. The @deprecated marker isn't on this package — it's the pattern
itself that's deprecated; the in-repo callers ditch it on their own version
bumps.

Test plan

  • pnpm run lint clean (biome)
  • pnpm run typecheck clean (strict + noUncheckedIndexedAccess)
  • pnpm run test — 39 tests pass (18 existing + 12 runtime-run + 9 trace-bridge)
  • pnpm run build — ESM + d.ts emit clean (dist/index.js 45.7 KB, dist/index.d.ts 34.4 KB)
  • examples/runtime-run/runtime-run.ts runs end-to-end (toy backend, in-memory adapter)
  • Pending CI run on this PR

Consumer follow-up

Each consumer agent gets its own PR on its own version bump. Tracking:

  • legal-agent — delete eval-evidence.ts::completeProductionAgentRun + api.chat.ts::persistRuntimeRun; adopt startRuntimeRun. Optional: replace mapRuntimeEventToStreamEvent with the trace bridge if the chat client moves to consume agent-eval TraceEvent directly.
  • tax-agent / gtm-agent / creative-agent / agent-builder — same pattern. Audit each repo for a completeProductionAgentRun / persistRuntimeRun equivalent first.

…cost ledger, biome, strict indices

Take ownership of the production-run lifecycle so consumer agents (legal, tax,
gtm, creative, agent-builder) stop reinventing `agentRuns`-row plumbing. Split
the single-file runtime into eight focused modules, adopt agent-eval 0.24.0
error taxonomy, add a canonical `RuntimeRunHandle` (cost ledger + persistence
adapter), and bring DX in line with agent-eval (biome, CI, strict indices,
JSDoc stability tags).

Module split:
- src/types.ts        — task/session/adapter/stream-event types
- src/errors.ts       — re-export agent-eval typed errors + 3 runtime-only ones
- src/readiness.ts    — `decideKnowledgeReadiness` + minimum-score validation
- src/sessions.ts     — InMemoryRuntimeSessionStore + helpers
- src/sanitize.ts     — runtime/stream event sanitization + telemetry collectors
- src/sse.ts          — server-sent event encoding
- src/backends.ts     — backend factories + stream normalization
- src/run.ts          — runAgentTask + runAgentTaskStream
- src/runtime-run.ts  — `startRuntimeRun` lifecycle + cost ledger (NEW)
- src/trace-bridge.ts — `createTraceBridge` mapping RuntimeStreamEvent -> TraceEvent (NEW)

Error taxonomy (replaces bespoke `throw new Error(...)` at user-facing sites):
- ValidationError, ConfigError, NotFoundError, CaptureIntegrityError,
  JudgeError, ReplayError, VerificationError — re-exported from agent-eval
- SessionMismatchError    — runtime-specific (wrong-backend resume)
- BackendTransportError   — runtime-specific (HTTP/IPC non-success)
- RuntimeRunStateError    — runtime-specific (lifecycle methods called out of order)

`RuntimeRunHandle` (canonical production-run lifecycle):
- startRuntimeRun({ workspaceId, sessionId, taskSpec, scenarioId, adapter })
- handle.observe(event)  — llm_call events accumulate the cost ledger
- handle.cost()          — { tokensIn, tokensOut, costUsd, wallMs, llmCalls }
- handle.complete({ status, resultSummary, cost?, error?, metadata? })
- handle.persist(metadata?) — writes `RuntimeRunRow` via adapter.upsert(row)
- handle.toRow(metadata?)   — dry-run for tests + adapter rehearsal

`RuntimeStreamEvent` gains an `llm_call` variant so the cost ledger has a
canonical input. Existing backends ignore unrecognized event types; new
backends emit `llm_call` per model call.

`createTraceBridge({ runId, spanId? })` exports `toTraceEvent(event)` and
`drain(events)` — drop-in replacement for the hand-rolled adapter
consumers currently write.

DX bar matches agent-eval 0.24.0:
- biome.json mirrors agent-eval (lineWidth 100, single quotes, no semis)
- .github/workflows/ci.yml: lint + typecheck + test + build on every PR
- .github/workflows/publish.yml: tag-driven npm publish (idempotent)
- tsconfig adds `noUncheckedIndexedAccess: true` (clean, 0 new failures)
- All public exports carry `@stable` JSDoc tags
- README rewritten — what runtime IS in 2 sentences, then a feature table

Telemetry sanitization sprouts an `llm_call` case so tokens/cost flow through
the safe envelope. Test coverage rises from 18 -> 39 tests (new modules):
- 12 tests in tests/runtime-run.test.ts (state machine, ledger, persistence)
- 9 tests in tests/trace-bridge.test.ts (mapping, drain, error-kind routing)

Examples gain `examples/runtime-run/` — the canonical pattern for consumer
repos to copy into their chat / task routes.

Breaking changes:
- `src/index.ts` is now a re-export hub. Deep imports (`@tangle-network/agent-runtime/src/...`) were never part of the public API and are not preserved.
- `Cannot resume X session with Y backend` now throws `SessionMismatchError` (was plain Error). Consumers pattern-matching by message string keep working; consumers using `instanceof` upgrade with a type narrow.
- `chat backend returned <status>` now throws `BackendTransportError` carrying `status`. Same migration path.

Consumer follow-up PRs:
- legal-agent: delete `completeProductionAgentRun` + `persistRuntimeRun`; adopt `startRuntimeRun`. Delete the hand-rolled `mapRuntimeEventToStreamEvent` if the chat surface adopts the trace bridge.
- tax-agent, gtm-agent, creative-agent, agent-builder: same pattern.
@tangletools tangletools merged commit 6e53aac into main May 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants