Skip to content

Commit 79067c6

Browse files
authored
Merge pull request #10 from constructive-io/worktree-issue-907-llm-call-metadata
feat: surface token usage metadata for billing (#907)
2 parents d4f3f66 + 5118c06 commit 79067c6

37 files changed

Lines changed: 6935 additions & 9856 deletions

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,8 @@
44
**/yarn-error.log
55
lerna-debug.log
66
**/src/*.js
7-
**/src/*.d.ts
7+
**/src/*.d.ts
8+
.env
9+
.env.local
10+
**/.env
11+
**/.env.local

LLM_METADATA_DECISIONS.md

Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# LLM Call Metadata Decisions
2+
3+
Date: 2026-05-21
4+
5+
Companion to [`REDESIGN_DECISIONS.md`](./REDESIGN_DECISIONS.md). Records the design
6+
choices behind the token-usage / cost / call-metadata surface that billing and
7+
metering consumers depend on. Append-only — each new decision gets the next
8+
number; older entries stay as-is even when superseded (superseding entries
9+
explicitly reference the entry they replace).
10+
11+
Tracking issue: [constructive-planning #907](https://github.com/constructive-io/constructive-planning/issues/907).
12+
13+
## Usage shape
14+
15+
1. **Reasoning is a subset of `output`, not a sibling.** `output` keeps the
16+
`completion_tokens` value the provider reports (which already includes
17+
reasoning per OpenAI's wire contract), and `reasoning` is exposed as a
18+
separate read-only count. The `totalTokens` invariant remains
19+
`input + output + cacheRead + cacheWrite` — adding `reasoning` to the total
20+
would double-count, since the provider already folded it into
21+
`completion_tokens` upstream. Billing derives pure-completion tokens as
22+
`output - reasoning` when it needs a separate rate.
23+
24+
2. **Anthropic `reasoning` stays zero.** The Anthropic Messages API does not
25+
expose a reasoning-token count even when extended thinking is on; the cost
26+
of thinking blocks is server-side folded into `output_tokens`. We do not
27+
fabricate a value or estimate from thinking-content character counts.
28+
29+
3. **Ollama `reasoning` stays zero.** Ollama's native API reports only
30+
`prompt_eval_count` and `eval_count`; there is no reasoning breakdown.
31+
Same policy as Anthropic — leave the field at zero rather than guess.
32+
33+
4. **No OpenAI-named alias fields on `Usage`.** The canonical shape stays
34+
`input` / `output` / `reasoning` / `cacheRead` / `cacheWrite` /
35+
`totalTokens`. Billing and downstream consumers translate at the boundary
36+
(`prompt_tokens → input`, `completion_tokens → output`,
37+
`reasoning_tokens → reasoning`, `total_tokens → totalTokens`, plus the
38+
cache fields). Adding aliases would either duplicate state or invite drift.
39+
40+
5. **No separate cost rate for reasoning tokens.** Reasoning cost is folded
41+
into the output rate via `model.cost.output`. Every model we currently ship
42+
prices reasoning at the same rate as output. Add a `model.cost.reasoning`
43+
schedule field only when we onboard a model that prices reasoning
44+
separately.
45+
46+
## Aggregation surface
47+
48+
6. **Cumulative usage lives on `AgentState.totalUsage` and on the
49+
`agent_end` and `turn_end` events.** Reset on `prompt()`, preserved across
50+
`continue()` — matching `stepCount` semantics. Consumers should not have
51+
to re-walk `messages[]` to derive a sum we already compute. Per-message
52+
usage remains accessible at `messages[i].usage`.
53+
54+
7. **`useChat` exposes a single `usage` field (cumulative).** The React hook
55+
surfaces `usage: Usage | null`, populated from `turn_end`/`agent_end`
56+
events and reset to `null` on each new `prompt()`. Advanced consumers can
57+
still inspect per-message usage by walking `messages`.
58+
59+
## Provider implementation
60+
61+
8. **Each provider package is standalone — no runtime dependency on
62+
`agentic-kit` core.** `packages/anthropic`, `packages/openai`, and
63+
`packages/ollama` each inline their own copies of the shared types
64+
(`Usage`, `Message`, `ModelDescriptor`, etc.) and their own
65+
`calculateUsageCost` helper. This is deliberate: provider packages must
66+
be drop-in usable without pulling the agentic-kit hub. Sync between the
67+
canonical type in `packages/agentic-kit/src/types.ts` and the per-provider
68+
copies is a maintenance cost we accept. Any change to `Usage` must land in
69+
all four locations. Earlier plan drafts proposed lifting
70+
`calculateUsageCost` to the shared package and importing it everywhere —
71+
that proposal is rejected here. (Only `packages/agent` depends on
72+
`agentic-kit`; it imports `addUsage` from the hub for cumulative-usage
73+
accumulation.)
74+
75+
9. **Ollama calls a local `calculateUsageCost` on the final payload.** Prior
76+
to this change, the Ollama adapter set `usage.input`/`usage.output`/
77+
`totalTokens` but never invoked any cost calculator — so `cost.total`
78+
stayed at zero even when `model.cost` was populated. Fixed by adding a
79+
local `calculateUsageCost` helper (mirroring the ones in
80+
`packages/anthropic` and `packages/openai`) and calling it in
81+
`processPayload` after token counts are assigned.
82+
83+
10. **OpenAI no longer double-counts `reasoning_tokens` into `output`.**
84+
Previously, `applyUsage` did
85+
`output = completion_tokens + reasoning_tokens` — but
86+
`completion_tokens` already includes reasoning per OpenAI's contract.
87+
Now: `output = completion_tokens`, `reasoning = reasoning_tokens`.
88+
89+
11. **OpenAI `totalTokens` fallback includes `cacheWrite`.** Prior fallback
90+
was `prompt_tokens ?? (input + output + cacheRead)` — missing `cacheWrite`.
91+
Currently a no-op for stock OpenAI (which doesn't emit cache writes), but
92+
breaks the invariant for OpenAI-compatible endpoints (OpenRouter) that
93+
do.
94+
95+
12. **OpenRouter `prompt_tokens_details.cache_write_tokens` ingestion is
96+
deferred.** No billing consumer currently asks for it. When a consumer
97+
materializes, we add the read in `applyUsage` and the cost rate in the
98+
relevant model descriptor — both small. Tracking under #907 follow-up.
99+
100+
## Streaming and abort semantics
101+
102+
13. **Anthropic writes `usage.input` at `message_start`, and overwrites on
103+
`message_delta`.** This is intentional: it ensures input-token counts
104+
survive an early stream abort (caller has the input cost even if the
105+
completion never finishes). OpenAI providers only emit usage at the
106+
terminal chunk, so an aborted OpenAI stream yields all-zero usage; this
107+
is a provider-API limit, not something we paper over.
108+
109+
## Out of scope (deferred, not declined)
110+
111+
14. **Service-tier cost multipliers (OpenAI Responses API
112+
`flex`/`priority`).** Not on the agentic-kit roadmap until we add the
113+
Responses-API adapter. Pi-mono applies these as a post-hoc multiplier
114+
on `usage.cost.*`; we'll follow the same pattern when needed.
115+
116+
15. **Audio-token counts.** No consumer; add when speech I/O lands.
117+
118+
16. **Per-session persistence / write-through to a database.** Billing's
119+
consumer pulls from the event stream; storage is downstream of this
120+
package's concern.
121+
122+
17. **`totalUsage` on event emits is a shallow snapshot, not a live reference.**
123+
The `turn_end` and `agent_end` events attach
124+
`{ ...this._state.totalUsage, cost: { ...this._state.totalUsage.cost } }`
125+
rather than the mutable state object directly. Why: `agent_end` already
126+
does `[...this._state.messages]` (a shallow array copy) for the same
127+
reason — listeners receive a stable value that won't change if the agent
128+
continues running. `Usage` is a two-level object (`cost` is a nested
129+
object literal), so the copy must be two levels deep. A full deep clone
130+
(`JSON.parse(JSON.stringify(...))`) was rejected as overkill for a flat
131+
numeric object; `structuredClone` was rejected as unnecessary verbosity
132+
for the same reason. Downstream SSE serialisation (which JSON-serialises
133+
the event anyway) would have made a live reference safe in practice, but
134+
the shallow-copy convention is consistent with the `messages` precedent
135+
and makes the event contract independent of the serialisation path.
136+
137+
18. **`useChat` resets `usage` at the start of `runStream`, not at the
138+
`send` / `sendMessages` / `respondWithDecision` call sites.** All three
139+
entry-points flow through `runStream`, so the reset is centralised there.
140+
This avoids three separate call-site edits and ensures the reset fires
141+
unconditionally for every new request — including decision-resume
142+
requests via `respondWithDecision`. Mirrors the agent-side rule from
143+
decision #6 (reset on each new request, not on `continue()`).
144+
145+
19. **Live provider eval suites are opt-in, `.env`-loaded, excluded from
146+
default `pnpm test` via `testPathIgnorePatterns`, and never run in CI.**
147+
Three suites land: `packages/openai/__tests__/openai.live.test.ts`,
148+
`packages/ollama/__tests__/ollama.live.test.ts` (extended with a new
149+
`Ollama live token-usage audit` block), and
150+
`packages/agent/__tests__/agent.live.test.ts`. Each suite is gated by
151+
`<NAMESPACE>_LIVE_SUITE=smoke|extended` (e.g. `OPENAI_LIVE_SUITE`); the
152+
`pnpm test:live:<provider>{,:smoke,:extended}` runners set
153+
`*_LIVE_READY=1` which both un-ignores the file in Jest config and
154+
disables the `global.fetch = jest.fn()` mock in `openai/jest.setup.js`.
155+
A shared `tools/test/load-env.js` walks up to find a workspace `.env`
156+
and is silent if absent, so CI is unaffected. Why: empirical wire-shape
157+
verification is the only way to confirm load-bearing claims like
158+
"`completion_tokens` already includes `reasoning_tokens`" — but live
159+
suites are expensive (real tokens) and require secrets, so they must
160+
stay out of the default loop. How to apply: when changing usage
161+
extraction, header construction, or any wire-shape detail, run the
162+
matching `pnpm test:live:*:extended` locally before merging. The
163+
`.gitignore` was updated to cover `.env` / `.env.local` to close a
164+
secrets-leak gap.
165+
166+
20. **Adapter-default `compat` must be the base layer of `createModel`'s
167+
merge, not the override layer.** The original spread order was
168+
`{ ...builtIn.compat, ...this.compat, ...overrides.compat }`, which
169+
silently clobbered model-specific settings (notably
170+
`maxTokensField: 'max_completion_tokens'` for reasoning-capable models)
171+
with the adapter's generic default (`'max_tokens'`). OpenAI returned
172+
400 (`Unsupported parameter: 'max_tokens'`) for `gpt-5.4-nano`. The
173+
mock-mode unit tests didn't catch it because the mocked `fetch` never
174+
validated the body. The live smoke test caught it on the very first
175+
real call. Why: model-specific knowledge in the built-in catalog is
176+
more authoritative than weak adapter defaults; user-provided overrides
177+
are most authoritative of all. How to apply: spread order is now
178+
`{ ...this.compat, ...builtIn.compat, ...overrides.compat }` — same
179+
rule for `headers`. Same precedence rule should be applied any time a
180+
new merge of compat-like fields is introduced.

apps/tanstack-chat-demo/src/lib/use-chat.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ export function useChat() {
7676
usage: {
7777
input: 0,
7878
output: 0,
79+
reasoning: 0,
7980
cacheRead: 0,
8081
cacheWrite: 0,
8182
totalTokens: 0,

package.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,12 @@
2222
"typecheck": "node ./scripts/typecheck.js",
2323
"test:live:ollama": "pnpm --filter @agentic-kit/ollama run test:live:smoke",
2424
"test:live:ollama:extended": "pnpm --filter @agentic-kit/ollama run test:live:extended",
25+
"test:live:openai": "pnpm --filter @agentic-kit/openai run test:live:smoke",
26+
"test:live:openai:smoke": "pnpm --filter @agentic-kit/openai run test:live:smoke",
27+
"test:live:openai:extended": "pnpm --filter @agentic-kit/openai run test:live:extended",
28+
"test:live:agent": "pnpm --filter @agentic-kit/agent run test:live:smoke",
29+
"test:live:agent:smoke": "pnpm --filter @agentic-kit/agent run test:live:smoke",
30+
"test:live:agent:extended": "pnpm --filter @agentic-kit/agent run test:live:extended",
2531
"lint": "pnpm -r run lint",
2632
"internal:deps": "makage update-workspace",
2733
"deps": "pnpm up -r -i -L"
@@ -32,6 +38,7 @@
3238
"@types/node": "^20.12.7",
3339
"@typescript-eslint/eslint-plugin": "^8.58.2",
3440
"@typescript-eslint/parser": "^8.58.2",
41+
"dotenv": "^16.4.5",
3542
"eslint": "^9.39.2",
3643
"eslint-config-prettier": "^10.1.8",
3744
"eslint-plugin-simple-import-sort": "^12.1.0",
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
import { OpenAIAdapter } from '@agentic-kit/openai';
2+
import { createUserMessage, type AssistantMessage } from 'agentic-kit';
3+
4+
import { Agent } from '../src';
5+
6+
const modelId = process.env.OPENAI_LIVE_MODEL ?? 'gpt-5.4-nano';
7+
const apiKey = process.env.OPENAI_API_KEY;
8+
9+
if (!apiKey) {
10+
throw new Error('Missing required env var: OPENAI_API_KEY');
11+
}
12+
13+
const liveSuite = process.env.AGENT_LIVE_SUITE ?? 'smoke';
14+
const runSmoke = liveSuite === 'smoke' || liveSuite === 'extended';
15+
const runExtended = liveSuite === 'extended';
16+
const describeSmoke = runSmoke ? describe : describe.skip;
17+
const describeExtended = runExtended ? describe : describe.skip;
18+
19+
describeSmoke('Agent live smoke', () => {
20+
jest.setTimeout(60_000);
21+
22+
it('single turn populates state.totalUsage from the assistant message', async () => {
23+
const adapter = new OpenAIAdapter({ apiKey });
24+
const model = adapter.createModel(modelId);
25+
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
26+
27+
await agent.prompt('Reply with the single word PONG.');
28+
29+
expect(agent.state.totalUsage.input).toBeGreaterThan(0);
30+
expect(agent.state.totalUsage.output).toBeGreaterThan(0);
31+
expect(agent.state.totalUsage.totalTokens).toBeGreaterThan(0);
32+
expect(agent.state.totalUsage.cost.total).toBeGreaterThan(0);
33+
34+
const lastAssistant = agent.state.messages
35+
.filter((m): m is AssistantMessage => m.role === 'assistant')
36+
.at(-1)!;
37+
38+
// Single turn: the per-message usage IS the cumulative total.
39+
expect(agent.state.totalUsage.input).toBe(lastAssistant.usage.input);
40+
expect(agent.state.totalUsage.output).toBe(lastAssistant.usage.output);
41+
expect(agent.state.totalUsage.reasoning).toBe(lastAssistant.usage.reasoning);
42+
expect(agent.state.totalUsage.cacheRead).toBe(lastAssistant.usage.cacheRead);
43+
expect(agent.state.totalUsage.cacheWrite).toBe(lastAssistant.usage.cacheWrite);
44+
expect(agent.state.totalUsage.totalTokens).toBe(lastAssistant.usage.totalTokens);
45+
});
46+
});
47+
48+
describeExtended('Agent live extended', () => {
49+
jest.setTimeout(120_000);
50+
51+
it('state.totalUsage equals field-wise sum across two turns', async () => {
52+
const adapter = new OpenAIAdapter({ apiKey });
53+
const model = adapter.createModel(modelId);
54+
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
55+
56+
await agent.prompt('What is 2 + 2? Reply with just the number.');
57+
58+
const t1Usage = {
59+
...agent.state.totalUsage,
60+
cost: { ...agent.state.totalUsage.cost },
61+
};
62+
63+
// continue() does not accept text; append the follow-up user message first.
64+
agent.appendMessage(createUserMessage('Now what is that doubled? Reply with just the number.'));
65+
await agent.continue();
66+
67+
const lastAssistant = agent.state.messages
68+
.filter((m): m is AssistantMessage => m.role === 'assistant')
69+
.at(-1)!;
70+
71+
expect(agent.state.totalUsage.input).toBe(t1Usage.input + lastAssistant.usage.input);
72+
expect(agent.state.totalUsage.output).toBe(t1Usage.output + lastAssistant.usage.output);
73+
expect(agent.state.totalUsage.reasoning).toBe(t1Usage.reasoning + lastAssistant.usage.reasoning);
74+
expect(agent.state.totalUsage.cacheRead).toBe(t1Usage.cacheRead + lastAssistant.usage.cacheRead);
75+
expect(agent.state.totalUsage.cacheWrite).toBe(t1Usage.cacheWrite + lastAssistant.usage.cacheWrite);
76+
expect(agent.state.totalUsage.totalTokens).toBe(t1Usage.totalTokens + lastAssistant.usage.totalTokens);
77+
expect(agent.state.totalUsage.cost.input).toBeCloseTo(
78+
t1Usage.cost.input + lastAssistant.usage.cost.input,
79+
10
80+
);
81+
expect(agent.state.totalUsage.cost.output).toBeCloseTo(
82+
t1Usage.cost.output + lastAssistant.usage.cost.output,
83+
10
84+
);
85+
expect(agent.state.totalUsage.cost.total).toBeCloseTo(
86+
t1Usage.cost.total + lastAssistant.usage.cost.total,
87+
10
88+
);
89+
});
90+
91+
it('prompt() resets totalUsage; continue() preserves it', async () => {
92+
const adapter = new OpenAIAdapter({ apiKey });
93+
const model = adapter.createModel(modelId);
94+
const agent = new Agent({ initialState: { model }, streamFn: adapter.stream.bind(adapter) });
95+
96+
await agent.prompt('Reply with the single word A.');
97+
const firstTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };
98+
99+
agent.appendMessage(createUserMessage('Reply with the single word B.'));
100+
await agent.continue();
101+
const secondTotals = { ...agent.state.totalUsage, cost: { ...agent.state.totalUsage.cost } };
102+
103+
// continue() must not reset — totals should have grown.
104+
expect(secondTotals.input).toBeGreaterThanOrEqual(firstTotals.input);
105+
expect(secondTotals.totalTokens).toBeGreaterThanOrEqual(firstTotals.totalTokens);
106+
expect(agent.state.totalUsage.input).toBeGreaterThanOrEqual(firstTotals.input);
107+
108+
await agent.prompt('Reply with the single word C.');
109+
110+
const thirdAssistant = agent.state.messages
111+
.filter((m): m is AssistantMessage => m.role === 'assistant')
112+
.at(-1)!;
113+
114+
// prompt() resets: the new total should be one turn's worth, not cumulative
115+
// across all three. We use < rather than === because token counts vary and
116+
// we cannot pin the exact value — only that it did not carry over the prior
117+
// two turns' worth of input tokens.
118+
expect(agent.state.totalUsage.input).toBeLessThan(secondTotals.input + 100);
119+
expect(agent.state.totalUsage.totalTokens).toBe(thirdAssistant.usage.totalTokens);
120+
expect(agent.state.totalUsage.input).toBe(thirdAssistant.usage.input);
121+
expect(agent.state.totalUsage.output).toBe(thirdAssistant.usage.output);
122+
});
123+
});

0 commit comments

Comments
 (0)