Date: 2026-05-21
Companion to REDESIGN_DECISIONS.md. Records the design
choices behind the token-usage / cost / call-metadata surface that billing and
metering consumers depend on. Append-only — each new decision gets the next
number; older entries stay as-is even when superseded (superseding entries
explicitly reference the entry they replace).
Tracking issue: constructive-planning #907.
-
Reasoning is a subset of
output, not a sibling.outputkeeps thecompletion_tokensvalue the provider reports (which already includes reasoning per OpenAI's wire contract), andreasoningis exposed as a separate read-only count. ThetotalTokensinvariant remainsinput + output + cacheRead + cacheWrite— addingreasoningto the total would double-count, since the provider already folded it intocompletion_tokensupstream. Billing derives pure-completion tokens asoutput - reasoningwhen it needs a separate rate. -
Anthropic
reasoningstays zero. The Anthropic Messages API does not expose a reasoning-token count even when extended thinking is on; the cost of thinking blocks is server-side folded intooutput_tokens. We do not fabricate a value or estimate from thinking-content character counts. -
Ollama
reasoningstays zero. Ollama's native API reports onlyprompt_eval_countandeval_count; there is no reasoning breakdown. Same policy as Anthropic — leave the field at zero rather than guess. -
No OpenAI-named alias fields on
Usage. The canonical shape staysinput/output/reasoning/cacheRead/cacheWrite/totalTokens. Billing and downstream consumers translate at the boundary (prompt_tokens → input,completion_tokens → output,reasoning_tokens → reasoning,total_tokens → totalTokens, plus the cache fields). Adding aliases would either duplicate state or invite drift. -
No separate cost rate for reasoning tokens. Reasoning cost is folded into the output rate via
model.cost.output. Every model we currently ship prices reasoning at the same rate as output. Add amodel.cost.reasoningschedule field only when we onboard a model that prices reasoning separately.
-
Cumulative usage lives on
AgentState.totalUsageand on theagent_endandturn_endevents. Reset onprompt(), preserved acrosscontinue()— matchingstepCountsemantics. Consumers should not have to re-walkmessages[]to derive a sum we already compute. Per-message usage remains accessible atmessages[i].usage. -
useChatexposes a singleusagefield (cumulative). The React hook surfacesusage: Usage | null, populated fromturn_end/agent_endevents and reset tonullon each newprompt(). Advanced consumers can still inspect per-message usage by walkingmessages.
-
Each provider package is standalone — no runtime dependency on
agentic-kitcore.packages/anthropic,packages/openai, andpackages/ollamaeach inline their own copies of the shared types (Usage,Message,ModelDescriptor, etc.) and their owncalculateUsageCosthelper. This is deliberate: provider packages must be drop-in usable without pulling the agentic-kit hub. Sync between the canonical type inpackages/agentic-kit/src/types.tsand the per-provider copies is a maintenance cost we accept. Any change toUsagemust land in all four locations. Earlier plan drafts proposed liftingcalculateUsageCostto the shared package and importing it everywhere — that proposal is rejected here. (Onlypackages/agentdepends onagentic-kit; it importsaddUsagefrom the hub for cumulative-usage accumulation.) -
Ollama calls a local
calculateUsageCoston the final payload. Prior to this change, the Ollama adapter setusage.input/usage.output/totalTokensbut never invoked any cost calculator — socost.totalstayed at zero even whenmodel.costwas populated. Fixed by adding a localcalculateUsageCosthelper (mirroring the ones inpackages/anthropicandpackages/openai) and calling it inprocessPayloadafter token counts are assigned. -
OpenAI no longer double-counts
reasoning_tokensintooutput. Previously,applyUsagedidoutput = completion_tokens + reasoning_tokens— butcompletion_tokensalready includes reasoning per OpenAI's contract. Now:output = completion_tokens,reasoning = reasoning_tokens. -
OpenAI
totalTokensfallback includescacheWrite. Prior fallback wasprompt_tokens ?? (input + output + cacheRead)— missingcacheWrite. Currently a no-op for stock OpenAI (which doesn't emit cache writes), but breaks the invariant for OpenAI-compatible endpoints (OpenRouter) that do. -
OpenRouter
prompt_tokens_details.cache_write_tokensingestion is deferred. No billing consumer currently asks for it. When a consumer materializes, we add the read inapplyUsageand the cost rate in the relevant model descriptor — both small. Tracking under #907 follow-up.
- Anthropic writes
usage.inputatmessage_start, and overwrites onmessage_delta. This is intentional: it ensures input-token counts survive an early stream abort (caller has the input cost even if the completion never finishes). OpenAI providers only emit usage at the terminal chunk, so an aborted OpenAI stream yields all-zero usage; this is a provider-API limit, not something we paper over.
-
Service-tier cost multipliers (OpenAI Responses API
flex/priority). Not on the agentic-kit roadmap until we add the Responses-API adapter. Pi-mono applies these as a post-hoc multiplier onusage.cost.*; we'll follow the same pattern when needed. -
Audio-token counts. No consumer; add when speech I/O lands.
-
Per-session persistence / write-through to a database. Billing's consumer pulls from the event stream; storage is downstream of this package's concern.
-
totalUsageon event emits is a shallow snapshot, not a live reference. Theturn_endandagent_endevents attach{ ...this._state.totalUsage, cost: { ...this._state.totalUsage.cost } }rather than the mutable state object directly. Why:agent_endalready does[...this._state.messages](a shallow array copy) for the same reason — listeners receive a stable value that won't change if the agent continues running.Usageis a two-level object (costis a nested object literal), so the copy must be two levels deep. A full deep clone (JSON.parse(JSON.stringify(...))) was rejected as overkill for a flat numeric object;structuredClonewas rejected as unnecessary verbosity for the same reason. Downstream SSE serialisation (which JSON-serialises the event anyway) would have made a live reference safe in practice, but the shallow-copy convention is consistent with themessagesprecedent and makes the event contract independent of the serialisation path. -
useChatresetsusageat the start ofrunStream, not at thesend/sendMessages/respondWithDecisioncall sites. All three entry-points flow throughrunStream, so the reset is centralised there. This avoids three separate call-site edits and ensures the reset fires unconditionally for every new request — including decision-resume requests viarespondWithDecision. Mirrors the agent-side rule from decision #6 (reset on each new request, not oncontinue()). -
Live provider eval suites are opt-in,
.env-loaded, excluded from defaultpnpm testviatestPathIgnorePatterns, and never run in CI. Three suites land:packages/openai/__tests__/openai.live.test.ts,packages/ollama/__tests__/ollama.live.test.ts(extended with a newOllama live token-usage auditblock), andpackages/agent/__tests__/agent.live.test.ts. Each suite is gated by<NAMESPACE>_LIVE_SUITE=smoke|extended(e.g.OPENAI_LIVE_SUITE); thepnpm test:live:<provider>{,:smoke,:extended}runners set*_LIVE_READY=1which both un-ignores the file in Jest config and disables theglobal.fetch = jest.fn()mock inopenai/jest.setup.js. A sharedtools/test/load-env.jswalks up to find a workspace.envand is silent if absent, so CI is unaffected. Why: empirical wire-shape verification is the only way to confirm load-bearing claims like "completion_tokensalready includesreasoning_tokens" — but live suites are expensive (real tokens) and require secrets, so they must stay out of the default loop. How to apply: when changing usage extraction, header construction, or any wire-shape detail, run the matchingpnpm test:live:*:extendedlocally before merging. The.gitignorewas updated to cover.env/.env.localto close a secrets-leak gap. -
Adapter-default
compatmust be the base layer ofcreateModel's merge, not the override layer. The original spread order was{ ...builtIn.compat, ...this.compat, ...overrides.compat }, which silently clobbered model-specific settings (notablymaxTokensField: 'max_completion_tokens'for reasoning-capable models) with the adapter's generic default ('max_tokens'). OpenAI returned 400 (Unsupported parameter: 'max_tokens') forgpt-5.4-nano. The mock-mode unit tests didn't catch it because the mockedfetchnever validated the body. The live smoke test caught it on the very first real call. Why: model-specific knowledge in the built-in catalog is more authoritative than weak adapter defaults; user-provided overrides are most authoritative of all. How to apply: spread order is now{ ...this.compat, ...builtIn.compat, ...overrides.compat }— same rule forheaders. Same precedence rule should be applied any time a new merge of compat-like fields is introduced.