feat: include usage in each message#1379
Conversation
…g (Phase 1) Phase 1 of multiturn turn-level usage tracking. Pure refactor and field additions; no behavior change yet. - Move `Usage` from `task_run.py` into new `libs/core/kiln_ai/datamodel/usage.py`, re-exported from `task_run` for backward compatibility. - Add `Usage.from_trace` static helper for summing per-message usage. - Add `cumulative_usage` field to `TaskRun` (defaults to `None`). - Add `usage` field to `ChatCompletionAssistantMessageParamWrapper` and include it in `KILN_ONLY_MESSAGE_FIELDS` so it is stripped before sending messages to providers. - New unit tests for `Usage.from_trace`, datamodel round-trip, and per-message usage sanitization. - Spec docs and Phase 1 phase plan checked in under `specs/projects/multiturn_turn_usage/`.
Wire per-message usage capture through the non-streaming LiteLLM adapter so each assistant trace message carries its own Usage, and compute TaskRun.cumulative_usage in BaseAdapter.generate_run by summing turn usages on top of any seeded TaskRun usage. - LiteLlmAdapter: thread usage through _run_model_turn, _run, litellm_message_to_trace_message, all_messages_to_trace, and ModelTurnResult so per-turn Usage is attached to assistant messages. - BaseAdapter.generate_run: aggregate per-message usage into cumulative_usage, respecting fresh vs seeded TaskRun behavior. - Tests: Usage.from_trace, message sanitization, non-streaming round-trip, fresh vs seeded TaskRun cumulative usage. - api_schema.d.ts: regenerated to expose cumulative_usage. - Phase 2 plan added; implementation_plan.md checkbox marked complete.
Mirror Phase 2's non-streaming usage tracking in the streaming
orchestrator (AdapterStream) so streaming runs persist the same
per-message usage data as non-streaming runs.
- AdapterStream gains _message_usage: dict[int, Usage] state, populated
alongside _message_latency in _stream_model_turn after each LLM call.
- __aiter__ passes _message_usage through to all_messages_to_trace at
finalization, so each assistant message in RunOutput.trace carries
per-call usage. cumulative_usage on TaskRun is then populated
automatically by Usage.from_trace in BaseAdapter.generate_run.
- StreamingCompletion now forces stream_options={"include_usage": True}
so LiteLLM's final assembled ModelResponse includes token counts and
cost; caller-provided stream_options are merged without clobbering.
- Tests cover single-call, tool-call loop, tool-call interruption,
empty-usage cases, and stream_options merging behavior.
Marks Phase 3 complete in implementation_plan.md and flips
phase_plans/phase_3.md status from draft to complete.
WalkthroughThis PR implements per-message MessageUsage, splits Usage into MessageUsage + Usage(subclass with latency), threads per-message usage through non-streaming and streaming adapters (forcing LiteLLM streaming include_usage), attaches per-message usage to trace messages, and computes TaskRun.cumulative_usage by summing per-message usage across full traces; schemas, sanitization, tests, and docs updated. ChangesMultiturn Turn-Level Usage Tracking
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements turn-level usage tracking for multiturn conversations by capturing per-LLM-call token usage and cost on assistant messages. It introduces a cumulative_usage field to TaskRun representing the sum of usage across the entire trace. Key changes include moving the Usage model to its own module, updating adapters to track per-message usage, and adding a from_trace helper. Feedback suggests that Usage.from_trace should also aggregate latency_ms from the trace into total_llm_latency_ms to ensure consistency with the TaskRun.usage field.
📊 Coverage ReportOverall Coverage: 92% Diff: origin/main...HEAD
Summary
Line-by-lineView line-by-line diff coveragelibs/core/kiln_ai/datamodel/usage.pyLines 2-10 2
3 from pydantic import BaseModel, Field
4
5 if TYPE_CHECKING:
! 6 from kiln_ai.utils.open_ai_types import ChatCompletionMessageParam
7
8
9 def _add_optional_int(a: int | None, b: int | None) -> int | None:
10 if a is None and b is None:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
libs/core/kiln_ai/adapters/model_adapters/test_adapter_stream.py (1)
569-746: 💤 Low valueNew
TestAdapterStreamPerMessageUsagetests look correct and well-structured.Coverage spans the four key streaming scenarios: single-call usage capture, per-turn isolation across tool-call loops, usage preservation under
return_on_tool_callinterruption, and robustness against emptyUsage()returns. The use ofpytest.approxat line 674 for the multi-call cost sum is the right approach for float equality.One consistency note: line 598 guards
call_args.args[2]withif len(call_args.args) >= 3 else None, while lines 664, 712, and 742 access.args[2]directly. All four should use the same pattern; if the implementation ever moves to keyword-arg passing formessage_usage, the guarded form gives a clearer failure thanIndexError.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/adapters/model_adapters/test_adapter_stream.py` around lines 569 - 746, Tests access mock_adapter.all_messages_to_trace.call_args.args[2] inconsistently; make them all use the guarded extraction pattern to avoid IndexError if the implementation switches to keyword args. Update the occurrences in TestAdapterStreamPerMessageUsage (tests test_per_message_usage_distinct_per_tool_call_loop, test_per_message_usage_on_tool_call_interruption, test_per_message_usage_handles_empty_usage) to mirror the earlier pattern (inspect call_args = mock_adapter.all_messages_to_trace.call_args and set message_usage_arg = call_args.args[2] if len(call_args.args) >= 3 else None), then assert message_usage_arg is not None before using it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@libs/core/kiln_ai/adapters/model_adapters/test_adapter_stream.py`:
- Around line 569-746: Tests access
mock_adapter.all_messages_to_trace.call_args.args[2] inconsistently; make them
all use the guarded extraction pattern to avoid IndexError if the implementation
switches to keyword args. Update the occurrences in
TestAdapterStreamPerMessageUsage (tests
test_per_message_usage_distinct_per_tool_call_loop,
test_per_message_usage_on_tool_call_interruption,
test_per_message_usage_handles_empty_usage) to mirror the earlier pattern
(inspect call_args = mock_adapter.all_messages_to_trace.call_args and set
message_usage_arg = call_args.args[2] if len(call_args.args) >= 3 else None),
then assert message_usage_arg is not None before using it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: b9f21be2-252f-4b70-bfa6-e816137c4d5e
📒 Files selected for processing (22)
app/web_ui/src/lib/api_schema.d.tslibs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.pylibs/core/kiln_ai/adapters/litellm_utils/test_litellm_streaming.pylibs/core/kiln_ai/adapters/model_adapters/adapter_stream.pylibs/core/kiln_ai/adapters/model_adapters/base_adapter.pylibs/core/kiln_ai/adapters/model_adapters/litellm_adapter.pylibs/core/kiln_ai/adapters/model_adapters/test_adapter_stream.pylibs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.pylibs/core/kiln_ai/adapters/model_adapters/test_saving_adapter_results.pylibs/core/kiln_ai/datamodel/task_run.pylibs/core/kiln_ai/datamodel/test_example_models.pylibs/core/kiln_ai/datamodel/test_usage.pylibs/core/kiln_ai/datamodel/usage.pylibs/core/kiln_ai/utils/open_ai_types.pylibs/core/kiln_ai/utils/test_open_ai_types.pyspecs/projects/multiturn_turn_usage/architecture.mdspecs/projects/multiturn_turn_usage/functional_spec.mdspecs/projects/multiturn_turn_usage/implementation_plan.mdspecs/projects/multiturn_turn_usage/phase_plans/phase_1.mdspecs/projects/multiturn_turn_usage/phase_plans/phase_2.mdspecs/projects/multiturn_turn_usage/phase_plans/phase_3.mdspecs/projects/multiturn_turn_usage/project_overview.md
…s (Phase 4) Introduces MessageUsage with the five aggregatable fields and reshapes Usage as a subclass that adds total_llm_latency_ms. Per-message usage and TaskRun.cumulative_usage are re-typed to MessageUsage so the aggregated latency field is dropped from sums where it has no meaning. usage_from_response now returns MessageUsage. Tests cover the new add semantics and loading legacy JSON that still carries total_llm_latency_ms.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@libs/core/kiln_ai/adapters/model_adapters/test_multiturn_usage_paid.py`:
- Around line 559-595: Remove the brittle conversation-shape thresholds by
deleting the two assertions that check pending_rounds_total >= 4 and
plain_user_messages >= 1; keep the existing per-message sanity checks (the
assert on len(chain) >= 1 and assert not chain[-1].is_toolcall_pending) and let
the later trace/chain assertions validate resume/path behavior using full_chain
and per_message_chain_lens instead of enforcing global counts (i.e., remove the
blocks referencing pending_rounds_total and plain_user_messages).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: f3cf787d-f784-44fd-ac11-f2e890f4ad19
📒 Files selected for processing (1)
libs/core/kiln_ai/adapters/model_adapters/test_multiturn_usage_paid.py
…ix-usage-cost-summing-in-chat-multi-turn-conversations
tawnymanticore
left a comment
There was a problem hiding this comment.
did a quick diff against my WIP branch which does the same thing, all looks solid. claude said yours is better than mine XD
What does this PR do?
Add usage in each message for multiturn task runs and cumulative usage.
Adds new usage info blocks into the
TaskRunto support the multiturn case:trace[x].usage-> contains theusageinfo for the LLM call that produced this output (noting that in multiturn, not every message is necessarily produced by an LLM, it could be manually injected into theprior_trace)cumulative_usage-> the total usage info across all the turns included in the trace; this is the same assum(map(task_run.trace => (message) => message.usage))Produces
TaskRun->tracelike this:{ "id": "8221839524_[uuid-redacted]", "task_run": { "v": 1, "id": "8221839524_[uuid-redacted]", "path": null, "created_at": "2026-05-07T21:27:55.900923+08:00", "created_by": "xxx", "input": "xxx", "input_source": { "type": "human", "properties": { "created_by": "xxx" }, "run_config": null }, "output": { "rating": null, "model_type": "task_output" }, "repair_instructions": null, "repaired_output": null, "intermediate_outputs": { "reasoning": "xxx" }, "tags": [], "usage": { "input_tokens": 12834, "output_tokens": 449, "total_tokens": 13283, "cost": null, "cached_tokens": 11586, "total_llm_latency_ms": 17008 }, "cumulative_usage": { "input_tokens": 84140, "output_tokens": 2583, "total_tokens": 86723, "cost": null, "cached_tokens": 61552 }, "trace": [ { "content": "xxx", "role": "system" }, { "content": "xxx", "role": "user" }, { "role": "assistant", "content": "xxx", "reasoning_content": "xxx", "latency_ms": 11733, "usage": { "input_tokens": 5429, "output_tokens": 263, "total_tokens": 5692, "cost": null, "cached_tokens": 0 } }, { "content": "xxx", "role": "user" }, { "role": "assistant", "content": "xxx", "reasoning_content": "xxx", "latency_ms": 4251, "usage": { "input_tokens": 5612, "output_tokens": 90, "total_tokens": 5702, "cost": null, "cached_tokens": 5428 } }, { "content": "xxx", "role": "user" }, { "role": "assistant", "content": "xxx", "reasoning_content": "xxx", "latency_ms": 7653, "usage": { "input_tokens": 5674, "output_tokens": 120, "total_tokens": 5794, "cost": null, "cached_tokens": 5611 } }, { "content": "xxx", "role": "user" }, { "role": "assistant", "content": null, "reasoning_content": "xxx", "tool_calls": [ { "id": "tool-call-001-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" } ], "latency_ms": 2726, "usage": { "input_tokens": 5756, "output_tokens": 77, "total_tokens": 5833, "cost": null, "cached_tokens": 5673 } }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-001-redacted" }, { "role": "assistant", "content": null, "reasoning_content": "xxx", "tool_calls": [ { "id": "tool-call-002-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" } ], "latency_ms": 5750, "usage": { "input_tokens": 8573, "output_tokens": 125, "total_tokens": 8698, "cost": null, "cached_tokens": 5755 } }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-002-redacted" }, { "role": "assistant", "content": null, "reasoning_content": "xxx", "tool_calls": [ { "id": "tool-call-003-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" } ], "latency_ms": 2284, "usage": { "input_tokens": 8941, "output_tokens": 50, "total_tokens": 8991, "cost": null, "cached_tokens": 8572 } }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-003-redacted" }, { "role": "assistant", "content": "xxx", "reasoning_content": "xxx", "latency_ms": 8166, "usage": { "input_tokens": 9746, "output_tokens": 348, "total_tokens": 10094, "cost": null, "cached_tokens": 8940 } }, { "content": "xxx", "role": "user" }, { "role": "assistant", "content": "xxx", "reasoning_content": "xxx", "tool_calls": [ { "id": "tool-call-004-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" }, { "id": "tool-call-005-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" } ], "latency_ms": 30962, "usage": { "input_tokens": 9988, "output_tokens": 830, "total_tokens": 10818, "cost": null, "cached_tokens": 0 } }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-004-redacted" }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-005-redacted" }, { "role": "assistant", "content": "xxx", "reasoning_content": "xxx", "tool_calls": [ { "id": "tool-call-006-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" }, { "id": "tool-call-007-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" }, { "id": "tool-call-008-redacted", "function": { "arguments": "xxx", "name": "some_tool" }, "type": "function" } ], "latency_ms": 5243, "usage": { "input_tokens": 11587, "output_tokens": 231, "total_tokens": 11818, "cost": null, "cached_tokens": 9987 } }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-006-redacted" }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-007-redacted" }, { "content": "xxx", "role": "tool", "tool_call_id": "tool-call-008-redacted" }, { "role": "assistant", "content": "xxx", "reasoning_content": "xxx", "latency_ms": 17008, "usage": { "input_tokens": 12834, "output_tokens": 449, "total_tokens": 13283, "cost": null, "cached_tokens": 11586 } } ], "parent_task_run_id": null, "model_type": "task_run" } }Checklists