Skip to content

Latest commit

 

History

History
180 lines (152 loc) · 10.1 KB

File metadata and controls

180 lines (152 loc) · 10.1 KB

LLM Call Metadata Decisions

Date: 2026-05-21

Companion to REDESIGN_DECISIONS.md. Records the design choices behind the token-usage / cost / call-metadata surface that billing and metering consumers depend on. Append-only — each new decision gets the next number; older entries stay as-is even when superseded (superseding entries explicitly reference the entry they replace).

Tracking issue: constructive-planning #907.

Usage shape

  1. Reasoning is a subset of output, not a sibling. output keeps the completion_tokens value the provider reports (which already includes reasoning per OpenAI's wire contract), and reasoning is exposed as a separate read-only count. The totalTokens invariant remains input + output + cacheRead + cacheWrite — adding reasoning to the total would double-count, since the provider already folded it into completion_tokens upstream. Billing derives pure-completion tokens as output - reasoning when it needs a separate rate.

  2. Anthropic reasoning stays zero. The Anthropic Messages API does not expose a reasoning-token count even when extended thinking is on; the cost of thinking blocks is server-side folded into output_tokens. We do not fabricate a value or estimate from thinking-content character counts.

  3. Ollama reasoning stays zero. Ollama's native API reports only prompt_eval_count and eval_count; there is no reasoning breakdown. Same policy as Anthropic — leave the field at zero rather than guess.

  4. No OpenAI-named alias fields on Usage. The canonical shape stays input / output / reasoning / cacheRead / cacheWrite / totalTokens. Billing and downstream consumers translate at the boundary (prompt_tokens → input, completion_tokens → output, reasoning_tokens → reasoning, total_tokens → totalTokens, plus the cache fields). Adding aliases would either duplicate state or invite drift.

  5. No separate cost rate for reasoning tokens. Reasoning cost is folded into the output rate via model.cost.output. Every model we currently ship prices reasoning at the same rate as output. Add a model.cost.reasoning schedule field only when we onboard a model that prices reasoning separately.

Aggregation surface

  1. Cumulative usage lives on AgentState.totalUsage and on the agent_end and turn_end events. Reset on prompt(), preserved across continue() — matching stepCount semantics. Consumers should not have to re-walk messages[] to derive a sum we already compute. Per-message usage remains accessible at messages[i].usage.

  2. useChat exposes a single usage field (cumulative). The React hook surfaces usage: Usage | null, populated from turn_end/agent_end events and reset to null on each new prompt(). Advanced consumers can still inspect per-message usage by walking messages.

Provider implementation

  1. Each provider package is standalone — no runtime dependency on agentic-kit core. packages/anthropic, packages/openai, and packages/ollama each inline their own copies of the shared types (Usage, Message, ModelDescriptor, etc.) and their own calculateUsageCost helper. This is deliberate: provider packages must be drop-in usable without pulling the agentic-kit hub. Sync between the canonical type in packages/agentic-kit/src/types.ts and the per-provider copies is a maintenance cost we accept. Any change to Usage must land in all four locations. Earlier plan drafts proposed lifting calculateUsageCost to the shared package and importing it everywhere — that proposal is rejected here. (Only packages/agent depends on agentic-kit; it imports addUsage from the hub for cumulative-usage accumulation.)

  2. Ollama calls a local calculateUsageCost on the final payload. Prior to this change, the Ollama adapter set usage.input/usage.output/ totalTokens but never invoked any cost calculator — so cost.total stayed at zero even when model.cost was populated. Fixed by adding a local calculateUsageCost helper (mirroring the ones in packages/anthropic and packages/openai) and calling it in processPayload after token counts are assigned.

  3. OpenAI no longer double-counts reasoning_tokens into output. Previously, applyUsage did output = completion_tokens + reasoning_tokens — but completion_tokens already includes reasoning per OpenAI's contract. Now: output = completion_tokens, reasoning = reasoning_tokens.

  4. OpenAI totalTokens fallback includes cacheWrite. Prior fallback was prompt_tokens ?? (input + output + cacheRead) — missing cacheWrite. Currently a no-op for stock OpenAI (which doesn't emit cache writes), but breaks the invariant for OpenAI-compatible endpoints (OpenRouter) that do.

  5. OpenRouter prompt_tokens_details.cache_write_tokens ingestion is deferred. No billing consumer currently asks for it. When a consumer materializes, we add the read in applyUsage and the cost rate in the relevant model descriptor — both small. Tracking under #907 follow-up.

Streaming and abort semantics

  1. Anthropic writes usage.input at message_start, and overwrites on message_delta. This is intentional: it ensures input-token counts survive an early stream abort (caller has the input cost even if the completion never finishes). OpenAI providers only emit usage at the terminal chunk, so an aborted OpenAI stream yields all-zero usage; this is a provider-API limit, not something we paper over.

Out of scope (deferred, not declined)

  1. Service-tier cost multipliers (OpenAI Responses API flex/priority). Not on the agentic-kit roadmap until we add the Responses-API adapter. Pi-mono applies these as a post-hoc multiplier on usage.cost.*; we'll follow the same pattern when needed.

  2. Audio-token counts. No consumer; add when speech I/O lands.

  3. Per-session persistence / write-through to a database. Billing's consumer pulls from the event stream; storage is downstream of this package's concern.

  4. totalUsage on event emits is a shallow snapshot, not a live reference. The turn_end and agent_end events attach { ...this._state.totalUsage, cost: { ...this._state.totalUsage.cost } } rather than the mutable state object directly. Why: agent_end already does [...this._state.messages] (a shallow array copy) for the same reason — listeners receive a stable value that won't change if the agent continues running. Usage is a two-level object (cost is a nested object literal), so the copy must be two levels deep. A full deep clone (JSON.parse(JSON.stringify(...))) was rejected as overkill for a flat numeric object; structuredClone was rejected as unnecessary verbosity for the same reason. Downstream SSE serialisation (which JSON-serialises the event anyway) would have made a live reference safe in practice, but the shallow-copy convention is consistent with the messages precedent and makes the event contract independent of the serialisation path.

  5. useChat resets usage at the start of runStream, not at the send / sendMessages / respondWithDecision call sites. All three entry-points flow through runStream, so the reset is centralised there. This avoids three separate call-site edits and ensures the reset fires unconditionally for every new request — including decision-resume requests via respondWithDecision. Mirrors the agent-side rule from decision #6 (reset on each new request, not on continue()).

  6. Live provider eval suites are opt-in, .env-loaded, excluded from default pnpm test via testPathIgnorePatterns, and never run in CI. Three suites land: packages/openai/__tests__/openai.live.test.ts, packages/ollama/__tests__/ollama.live.test.ts (extended with a new Ollama live token-usage audit block), and packages/agent/__tests__/agent.live.test.ts. Each suite is gated by <NAMESPACE>_LIVE_SUITE=smoke|extended (e.g. OPENAI_LIVE_SUITE); the pnpm test:live:<provider>{,:smoke,:extended} runners set *_LIVE_READY=1 which both un-ignores the file in Jest config and disables the global.fetch = jest.fn() mock in openai/jest.setup.js. A shared tools/test/load-env.js walks up to find a workspace .env and is silent if absent, so CI is unaffected. Why: empirical wire-shape verification is the only way to confirm load-bearing claims like "completion_tokens already includes reasoning_tokens" — but live suites are expensive (real tokens) and require secrets, so they must stay out of the default loop. How to apply: when changing usage extraction, header construction, or any wire-shape detail, run the matching pnpm test:live:*:extended locally before merging. The .gitignore was updated to cover .env / .env.local to close a secrets-leak gap.

  7. Adapter-default compat must be the base layer of createModel's merge, not the override layer. The original spread order was { ...builtIn.compat, ...this.compat, ...overrides.compat }, which silently clobbered model-specific settings (notably maxTokensField: 'max_completion_tokens' for reasoning-capable models) with the adapter's generic default ('max_tokens'). OpenAI returned 400 (Unsupported parameter: 'max_tokens') for gpt-5.4-nano. The mock-mode unit tests didn't catch it because the mocked fetch never validated the body. The live smoke test caught it on the very first real call. Why: model-specific knowledge in the built-in catalog is more authoritative than weak adapter defaults; user-provided overrides are most authoritative of all. How to apply: spread order is now { ...this.compat, ...builtIn.compat, ...overrides.compat } — same rule for headers. Same precedence rule should be applied any time a new merge of compat-like fields is introduced.