|
| 1 | +# LLM Call Metadata Decisions |
| 2 | + |
| 3 | +Date: 2026-05-21 |
| 4 | + |
| 5 | +Companion to [`REDESIGN_DECISIONS.md`](./REDESIGN_DECISIONS.md). Records the design |
| 6 | +choices behind the token-usage / cost / call-metadata surface that billing and |
| 7 | +metering consumers depend on. Append-only — each new decision gets the next |
| 8 | +number; older entries stay as-is even when superseded (superseding entries |
| 9 | +explicitly reference the entry they replace). |
| 10 | + |
| 11 | +Tracking issue: [constructive-planning #907](https://github.com/constructive-io/constructive-planning/issues/907). |
| 12 | + |
| 13 | +## Usage shape |
| 14 | + |
| 15 | +1. **Reasoning is a subset of `output`, not a sibling.** `output` keeps the |
| 16 | + `completion_tokens` value the provider reports (which already includes |
| 17 | + reasoning per OpenAI's wire contract), and `reasoning` is exposed as a |
| 18 | + separate read-only count. The `totalTokens` invariant remains |
| 19 | + `input + output + cacheRead + cacheWrite` — adding `reasoning` to the total |
| 20 | + would double-count, since the provider already folded it into |
| 21 | + `completion_tokens` upstream. Billing derives pure-completion tokens as |
| 22 | + `output - reasoning` when it needs a separate rate. |
| 23 | + |
| 24 | +2. **Anthropic `reasoning` stays zero.** The Anthropic Messages API does not |
| 25 | + expose a reasoning-token count even when extended thinking is on; the cost |
| 26 | + of thinking blocks is server-side folded into `output_tokens`. We do not |
| 27 | + fabricate a value or estimate from thinking-content character counts. |
| 28 | + |
| 29 | +3. **Ollama `reasoning` stays zero.** Ollama's native API reports only |
| 30 | + `prompt_eval_count` and `eval_count`; there is no reasoning breakdown. |
| 31 | + Same policy as Anthropic — leave the field at zero rather than guess. |
| 32 | + |
| 33 | +4. **No OpenAI-named alias fields on `Usage`.** The canonical shape stays |
| 34 | + `input` / `output` / `reasoning` / `cacheRead` / `cacheWrite` / |
| 35 | + `totalTokens`. Billing and downstream consumers translate at the boundary |
| 36 | + (`prompt_tokens → input`, `completion_tokens → output`, |
| 37 | + `reasoning_tokens → reasoning`, `total_tokens → totalTokens`, plus the |
| 38 | + cache fields). Adding aliases would either duplicate state or invite drift. |
| 39 | + |
| 40 | +5. **No separate cost rate for reasoning tokens.** Reasoning cost is folded |
| 41 | + into the output rate via `model.cost.output`. Every model we currently ship |
| 42 | + prices reasoning at the same rate as output. Add a `model.cost.reasoning` |
| 43 | + schedule field only when we onboard a model that prices reasoning |
| 44 | + separately. |
| 45 | + |
| 46 | +## Aggregation surface |
| 47 | + |
| 48 | +6. **Cumulative usage lives on `AgentState.totalUsage` and on the |
| 49 | + `agent_end` and `turn_end` events.** Reset on `prompt()`, preserved across |
| 50 | + `continue()` — matching `stepCount` semantics. Consumers should not have |
| 51 | + to re-walk `messages[]` to derive a sum we already compute. Per-message |
| 52 | + usage remains accessible at `messages[i].usage`. |
| 53 | + |
| 54 | +7. **`useChat` exposes a single `usage` field (cumulative).** The React hook |
| 55 | + surfaces `usage: Usage | null`, populated from `turn_end`/`agent_end` |
| 56 | + events and reset to `null` on each new `prompt()`. Advanced consumers can |
| 57 | + still inspect per-message usage by walking `messages`. |
| 58 | + |
| 59 | +## Provider implementation |
| 60 | + |
| 61 | +8. **Each provider package is standalone — no runtime dependency on |
| 62 | + `agentic-kit` core.** `packages/anthropic`, `packages/openai`, and |
| 63 | + `packages/ollama` each inline their own copies of the shared types |
| 64 | + (`Usage`, `Message`, `ModelDescriptor`, etc.) and their own |
| 65 | + `calculateUsageCost` helper. This is deliberate: provider packages must |
| 66 | + be drop-in usable without pulling the agentic-kit hub. Sync between the |
| 67 | + canonical type in `packages/agentic-kit/src/types.ts` and the per-provider |
| 68 | + copies is a maintenance cost we accept. Any change to `Usage` must land in |
| 69 | + all four locations. Earlier plan drafts proposed lifting |
| 70 | + `calculateUsageCost` to the shared package and importing it everywhere — |
| 71 | + that proposal is rejected here. (Only `packages/agent` depends on |
| 72 | + `agentic-kit`; it imports `addUsage` from the hub for cumulative-usage |
| 73 | + accumulation.) |
| 74 | + |
| 75 | +9. **Ollama calls a local `calculateUsageCost` on the final payload.** Prior |
| 76 | + to this change, the Ollama adapter set `usage.input`/`usage.output`/ |
| 77 | + `totalTokens` but never invoked any cost calculator — so `cost.total` |
| 78 | + stayed at zero even when `model.cost` was populated. Fixed by adding a |
| 79 | + local `calculateUsageCost` helper (mirroring the ones in |
| 80 | + `packages/anthropic` and `packages/openai`) and calling it in |
| 81 | + `processPayload` after token counts are assigned. |
| 82 | + |
| 83 | +10. **OpenAI no longer double-counts `reasoning_tokens` into `output`.** |
| 84 | + Previously, `applyUsage` did |
| 85 | + `output = completion_tokens + reasoning_tokens` — but |
| 86 | + `completion_tokens` already includes reasoning per OpenAI's contract. |
| 87 | + Now: `output = completion_tokens`, `reasoning = reasoning_tokens`. |
| 88 | + |
| 89 | +11. **OpenAI `totalTokens` fallback includes `cacheWrite`.** Prior fallback |
| 90 | + was `prompt_tokens ?? (input + output + cacheRead)` — missing `cacheWrite`. |
| 91 | + Currently a no-op for stock OpenAI (which doesn't emit cache writes), but |
| 92 | + breaks the invariant for OpenAI-compatible endpoints (OpenRouter) that |
| 93 | + do. |
| 94 | + |
| 95 | +12. **OpenRouter `prompt_tokens_details.cache_write_tokens` ingestion is |
| 96 | + deferred.** No billing consumer currently asks for it. When a consumer |
| 97 | + materializes, we add the read in `applyUsage` and the cost rate in the |
| 98 | + relevant model descriptor — both small. Tracking under #907 follow-up. |
| 99 | + |
| 100 | +## Streaming and abort semantics |
| 101 | + |
| 102 | +13. **Anthropic writes `usage.input` at `message_start`, and overwrites on |
| 103 | + `message_delta`.** This is intentional: it ensures input-token counts |
| 104 | + survive an early stream abort (caller has the input cost even if the |
| 105 | + completion never finishes). OpenAI providers only emit usage at the |
| 106 | + terminal chunk, so an aborted OpenAI stream yields all-zero usage; this |
| 107 | + is a provider-API limit, not something we paper over. |
| 108 | + |
| 109 | +## Out of scope (deferred, not declined) |
| 110 | + |
| 111 | +14. **Service-tier cost multipliers (OpenAI Responses API |
| 112 | + `flex`/`priority`).** Not on the agentic-kit roadmap until we add the |
| 113 | + Responses-API adapter. Pi-mono applies these as a post-hoc multiplier |
| 114 | + on `usage.cost.*`; we'll follow the same pattern when needed. |
| 115 | + |
| 116 | +15. **Audio-token counts.** No consumer; add when speech I/O lands. |
| 117 | + |
| 118 | +16. **Per-session persistence / write-through to a database.** Billing's |
| 119 | + consumer pulls from the event stream; storage is downstream of this |
| 120 | + package's concern. |
| 121 | + |
| 122 | +17. **`totalUsage` on event emits is a shallow snapshot, not a live reference.** |
| 123 | + The `turn_end` and `agent_end` events attach |
| 124 | + `{ ...this._state.totalUsage, cost: { ...this._state.totalUsage.cost } }` |
| 125 | + rather than the mutable state object directly. Why: `agent_end` already |
| 126 | + does `[...this._state.messages]` (a shallow array copy) for the same |
| 127 | + reason — listeners receive a stable value that won't change if the agent |
| 128 | + continues running. `Usage` is a two-level object (`cost` is a nested |
| 129 | + object literal), so the copy must be two levels deep. A full deep clone |
| 130 | + (`JSON.parse(JSON.stringify(...))`) was rejected as overkill for a flat |
| 131 | + numeric object; `structuredClone` was rejected as unnecessary verbosity |
| 132 | + for the same reason. Downstream SSE serialisation (which JSON-serialises |
| 133 | + the event anyway) would have made a live reference safe in practice, but |
| 134 | + the shallow-copy convention is consistent with the `messages` precedent |
| 135 | + and makes the event contract independent of the serialisation path. |
| 136 | + |
| 137 | +18. **`useChat` resets `usage` at the start of `runStream`, not at the |
| 138 | + `send` / `sendMessages` / `respondWithDecision` call sites.** All three |
| 139 | + entry-points flow through `runStream`, so the reset is centralised there. |
| 140 | + This avoids three separate call-site edits and ensures the reset fires |
| 141 | + unconditionally for every new request — including decision-resume |
| 142 | + requests via `respondWithDecision`. Mirrors the agent-side rule from |
| 143 | + decision #6 (reset on each new request, not on `continue()`). |
| 144 | + |
| 145 | +19. **Live provider eval suites are opt-in, `.env`-loaded, excluded from |
| 146 | + default `pnpm test` via `testPathIgnorePatterns`, and never run in CI.** |
| 147 | + Three suites land: `packages/openai/__tests__/openai.live.test.ts`, |
| 148 | + `packages/ollama/__tests__/ollama.live.test.ts` (extended with a new |
| 149 | + `Ollama live token-usage audit` block), and |
| 150 | + `packages/agent/__tests__/agent.live.test.ts`. Each suite is gated by |
| 151 | + `<NAMESPACE>_LIVE_SUITE=smoke|extended` (e.g. `OPENAI_LIVE_SUITE`); the |
| 152 | + `pnpm test:live:<provider>{,:smoke,:extended}` runners set |
| 153 | + `*_LIVE_READY=1` which both un-ignores the file in Jest config and |
| 154 | + disables the `global.fetch = jest.fn()` mock in `openai/jest.setup.js`. |
| 155 | + A shared `tools/test/load-env.js` walks up to find a workspace `.env` |
| 156 | + and is silent if absent, so CI is unaffected. Why: empirical wire-shape |
| 157 | + verification is the only way to confirm load-bearing claims like |
| 158 | + "`completion_tokens` already includes `reasoning_tokens`" — but live |
| 159 | + suites are expensive (real tokens) and require secrets, so they must |
| 160 | + stay out of the default loop. How to apply: when changing usage |
| 161 | + extraction, header construction, or any wire-shape detail, run the |
| 162 | + matching `pnpm test:live:*:extended` locally before merging. The |
| 163 | + `.gitignore` was updated to cover `.env` / `.env.local` to close a |
| 164 | + secrets-leak gap. |
| 165 | + |
| 166 | +20. **Adapter-default `compat` must be the base layer of `createModel`'s |
| 167 | + merge, not the override layer.** The original spread order was |
| 168 | + `{ ...builtIn.compat, ...this.compat, ...overrides.compat }`, which |
| 169 | + silently clobbered model-specific settings (notably |
| 170 | + `maxTokensField: 'max_completion_tokens'` for reasoning-capable models) |
| 171 | + with the adapter's generic default (`'max_tokens'`). OpenAI returned |
| 172 | + 400 (`Unsupported parameter: 'max_tokens'`) for `gpt-5.4-nano`. The |
| 173 | + mock-mode unit tests didn't catch it because the mocked `fetch` never |
| 174 | + validated the body. The live smoke test caught it on the very first |
| 175 | + real call. Why: model-specific knowledge in the built-in catalog is |
| 176 | + more authoritative than weak adapter defaults; user-provided overrides |
| 177 | + are most authoritative of all. How to apply: spread order is now |
| 178 | + `{ ...this.compat, ...builtIn.compat, ...overrides.compat }` — same |
| 179 | + rule for `headers`. Same precedence rule should be applied any time a |
| 180 | + new merge of compat-like fields is introduced. |
0 commit comments