Add per-call usage to every assistant trace message#1381
Add per-call usage to every assistant trace message#1381tawnymanticore wants to merge 3 commits into
usage to every assistant trace message#1381Conversation
When a single ``call_model`` invocation runs an internal tool-use loop (model → tool → model → tool → final reply, with ``return_on_tool_call=False``), each iteration is a separate provider inference billed independently. Today only the LAST inference's usage shows up on the saved snapshot's ``task_run.usage`` — the inner-loop inferences are billed but invisible to trace consumers, so per-case token totals computed by walking the snapshot chain undercount the provider's actual bill (kintsugi observed ~30-40% gap on driver runs). Per-call ``latency_ms`` was already attached to every assistant message via ``message_latency``. This change mirrors that pattern for usage: - ``ChatCompletionAssistantMessageParamWrapper`` gains an optional ``usage`` field, listed in ``KILN_ONLY_MESSAGE_FIELDS`` so it gets sanitized before being sent back to the provider. - ``Usage`` is extracted to its own module (``kiln_ai.datamodel.usage``) so ``open_ai_types`` can import it without creating a cycle with ``task_run``. Re-exported from ``task_run`` for backwards compat. - ``LiteLlmAdapter._run_model_turn`` and ``AdapterStream._stream_model_turn`` capture ``call_usage = usage_from_response(response)`` per inference and stamp it (with that call's ``latency_ms``) onto a ``message_usage: dict[int, Usage]`` keyed by the appended message's index — same shape as ``message_latency``. - ``ModelTurnResult`` carries the dict; ``_run`` merges across turns; ``all_messages_to_trace`` threads it into ``litellm_message_to_trace_message`` which attaches ``usage`` to the emitted trace dict. Sums of per-call ``usage`` across the trace recover provider-true totals. Existing ``task_run.usage`` (turn-summed) and ``Usage.__add__`` are unchanged. Tests: - ``test_run_model_turn_records_per_call_usage_for_each_tool_loop_inference`` drives a 2-call tool loop with distinct usage shapes per call and asserts both per-call entries land in ``message_usage``. - ``test_all_messages_to_trace_attaches_per_message_usage`` verifies end-to-end flow onto the trace messages. - ``test_litellm_message_to_trace_message_includes_usage`` / ``_no_usage`` cover the leaf attach. - ``test_open_ai_types`` updated for the new wrapper field + ``KILN_ONLY_MESSAGE_FIELDS`` membership. 3847/3847 ``libs/core`` tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit declared the field as ``Optional[Usage]``, which required eagerly importing ``Usage`` at module top. That triggered a circular import: ``open_ai_types`` would import from ``kiln_ai.datamodel.usage``, but loading the parent ``kiln_ai.datamodel`` package eagerly imports ``task_run``, which itself imports ``ChatCompletionMessageParam`` from this module. Cold imports through the cycle hit a "TaskRun is not fully defined" error during Pydantic schema build because the forward-ref ``Usage`` couldn't resolve in this module's namespace. Type the field as ``Optional[Any]`` instead. Runtime behavior is unchanged — the value is still a ``Usage`` instance (or its dict serialization on a deserialized trace) — and the docstring spells out the shape. Costs a level of OpenAPI/TS schema precision (the generated ``api_schema.d.ts`` now emits ``unknown | null`` instead of ``Usage | null`` for the field), but that's a strictly cosmetic loss: consumers that need typed access already pass through the Pydantic ``TaskRun`` model. 3847/3847 ``libs/core`` tests pass; cold import succeeds end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
🚧 Files skipped from review as they are similar to previous changes (4)
WalkthroughPer-message token usage is recorded and threaded through adapters and traces: new ChangesPer-Message Usage Tracking Through LLM Adapters
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
📊 Coverage ReportOverall Coverage: 92% Diff: origin/main...HEAD
Summary
|
There was a problem hiding this comment.
Code Review
This pull request implements per-message token usage tracking to ensure accurate accounting during multi-inference tool loops. The Usage model was moved to a standalone module to resolve circular dependencies, and both the LiteLlmAdapter and adapter_stream were updated to capture and propagate usage data within assistant message traces. Review feedback recommends refactoring duplicated usage processing logic into a shared helper method and improving type safety by using forward references for the usage field instead of Any to enhance the generated OpenAPI schema.
| # count the usage (both summed for the turn-level total and | ||
| # captured per-message so the trace can show inner-loop calls) | ||
| call_usage = self.usage_from_response(model_response) | ||
| usage += call_usage | ||
| usage.total_llm_latency_ms = ( | ||
| usage.total_llm_latency_ms or 0 | ||
| ) + call_latency_ms |
There was a problem hiding this comment.
There was a problem hiding this comment.
Done in 2341488 — extracted into record_per_call_usage_and_latency in kiln_ai.datamodel.usage (stateless free function so the existing mock_adapter fixture keeps working without modification). Both _run_model_turn and _stream_model_turn now call it.
| latency_ms: Optional[int] | ||
| """Time spent waiting on this specific LLM API call in milliseconds.""" | ||
|
|
||
| usage: Optional[Any] |
There was a problem hiding this comment.
While using Any avoids the circular import as noted in the docstring, it results in a less-than-ideal unknown type in the generated OpenAPI schema for the frontend. This reduces type safety on the client side.
You can provide a concrete type hint and still avoid the import cycle by using a forward reference (string literal). This will allow the type to be resolved correctly by type checkers and the OpenAPI generator.
To do this, you would also need to add a TYPE_CHECKING block at the top of the file:
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from kiln_ai.datamodel.usage import Usage| usage: Optional[Any] | |
| usage: Optional["Usage"] |
There was a problem hiding this comment.
Tried Optional["Usage"] per the suggestion, but Pydantic resolves the forward ref against open_ai_typess globals during TaskRun schema build — and at that point the import cycle (open_ai_types → kiln_ai.datamodel.usage → kiln_ai.datamodel.__init__ → task_run → open_ai_types again) hasn't completed, so it hits "TaskRun is not fully defined". defer_build=True would work but ripples to other models.
Landed a different fix in 2341488: Annotated[Optional[Any], WithJsonSchema({...})]. Python type stays Any (no cycle) while the OpenAPI generator emits $ref: Usage. Regenerated api_schema.d.ts now reads usage?: components["schemas"]["Usage"] | null again — same as a concrete annotation would have produced.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
libs/core/kiln_ai/datamodel/usage.py (2)
46-55: 💤 Low valueConsider adding
__radd__to supportsum()without an explicit start.
sum([u1, u2])tries0 + u1first (i.e.,int.__add__(u1)), which returnsNotImplemented, then falls back tou1.__radd__(0)— which doesn't exist, raisingTypeError. The current code always uses explicit+=in loops, so this isn't a bug today, but it's a footgun for callers who reach forsum().def __radd__(self, other: "Usage | int") -> "Usage": if other == 0: # identity for sum() return self return self.__add__(other) # type: ignore[arg-type]🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/datamodel/usage.py` around lines 46 - 55, The Usage class currently implements __add__ but not __radd__, so using sum([u1, u2]) fails when Python tries 0 + u1; add a __radd__ method on Usage that treats other==0 as the additive identity returning self, and otherwise delegates to __add__ (e.g., call self.__add__(other) or raise a TypeError consistently), so sum() works without an explicit start while preserving the existing None-handling in __add__.
57-73: 💤 Low value
_add_optional_intand_add_optional_floatare identical — consolidate into one helper.Both inner functions share the same body; only the annotations differ (and Python ignores those at runtime). A single
_add_optionalthat's typed with aTypeVaror just uses the union covers both cases:♻️ Proposed refactor
- def _add_optional_int(a: int | None, b: int | None) -> int | None: - if a is None and b is None: - return None - if a is None: - return b - if b is None: - return a - return a + b - - def _add_optional_float(a: float | None, b: float | None) -> float | None: - if a is None and b is None: - return None - if a is None: - return b - if b is None: - return a - return a + b + from typing import TypeVar + _N = TypeVar("_N", int, float) + + def _add_optional(a: _N | None, b: _N | None) -> _N | None: + if a is None and b is None: + return None + return (a or 0) + (b or 0) # type: ignore[operator] return Usage( - input_tokens=_add_optional_int(self.input_tokens, other.input_tokens), - output_tokens=_add_optional_int(self.output_tokens, other.output_tokens), - total_tokens=_add_optional_int(self.total_tokens, other.total_tokens), - cost=_add_optional_float(self.cost, other.cost), - cached_tokens=_add_optional_int(self.cached_tokens, other.cached_tokens), - total_llm_latency_ms=_add_optional_int( - self.total_llm_latency_ms, other.total_llm_latency_ms - ), + input_tokens=_add_optional(self.input_tokens, other.input_tokens), + output_tokens=_add_optional(self.output_tokens, other.output_tokens), + total_tokens=_add_optional(self.total_tokens, other.total_tokens), + cost=_add_optional(self.cost, other.cost), + cached_tokens=_add_optional(self.cached_tokens, other.cached_tokens), + total_llm_latency_ms=_add_optional( + self.total_llm_latency_ms, other.total_llm_latency_ms + ), )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@libs/core/kiln_ai/datamodel/usage.py` around lines 57 - 73, Consolidate the duplicate functions _add_optional_int and _add_optional_float into a single helper (e.g., _add_optional) that handles optional numeric values; implement it using a TypeVar bound to numbers or simply use Union[int, float] -> Optional[...] for the signature, preserve the same logic body, and replace/internalize calls to _add_optional_int/_add_optional_float to call the new _add_optional so callers like any usage in this module continue to work with both ints and floats.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@libs/core/kiln_ai/datamodel/usage.py`:
- Around line 46-55: The Usage class currently implements __add__ but not
__radd__, so using sum([u1, u2]) fails when Python tries 0 + u1; add a __radd__
method on Usage that treats other==0 as the additive identity returning self,
and otherwise delegates to __add__ (e.g., call self.__add__(other) or raise a
TypeError consistently), so sum() works without an explicit start while
preserving the existing None-handling in __add__.
- Around line 57-73: Consolidate the duplicate functions _add_optional_int and
_add_optional_float into a single helper (e.g., _add_optional) that handles
optional numeric values; implement it using a TypeVar bound to numbers or simply
use Union[int, float] -> Optional[...] for the signature, preserve the same
logic body, and replace/internalize calls to
_add_optional_int/_add_optional_float to call the new _add_optional so callers
like any usage in this module continue to work with both ints and floats.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 4976f561-b2d1-44bb-bfcf-67cd804c9609
📒 Files selected for processing (8)
app/web_ui/src/lib/api_schema.d.tslibs/core/kiln_ai/adapters/model_adapters/adapter_stream.pylibs/core/kiln_ai/adapters/model_adapters/litellm_adapter.pylibs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.pylibs/core/kiln_ai/datamodel/task_run.pylibs/core/kiln_ai/datamodel/usage.pylibs/core/kiln_ai/utils/open_ai_types.pylibs/core/kiln_ai/utils/test_open_ai_types.py
Two review comments addressed: 1. **Extract shared helper.** ``record_per_call_usage_and_latency`` in ``kiln_ai.datamodel.usage`` now owns the "aggregate per-call usage onto the turn total + stamp per-message dicts" logic. Both ``_run_model_turn`` (litellm_adapter.py) and ``_stream_model_turn`` (adapter_stream.py) call it, removing the duplicated 6-line block. Stateless free function (no ``self`` needed) so the existing ``mock_adapter`` fixture in test_adapter_stream.py keeps working without extending it. 2. **Concrete OpenAPI typing.** Tried ``Optional["Usage"]`` forward-ref per the suggestion, but Pydantic's schema build for ``TaskRun.trace`` eagerly resolves the forward ref against ``open_ai_types``'s globals during ``TaskRun`` class definition — at which point the cycle isn't yet broken (we're still inside ``open_ai_types``'s import) and we hit "TaskRun is not fully defined". ``defer_build=True`` would fix it but ripples to many models. Instead: ``Annotated[Optional[Any], WithJsonSchema(...)]`` keeps the Python type as ``Any`` (no cycle) while pinning the OpenAPI / TS schema to ``$ref: Usage``. Generated ``api_schema.d.ts`` now reads ``usage?: components["schemas"]["Usage"] | null`` again, identical to what a direct ``Optional[Usage]`` annotation would emit. 3847/3847 ``libs/core`` tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Both nitpicks from CodeRabbit ( |
Why
When a single
call_modelruns an internal tool-use loop (model → tool → model → tool → final reply), each iteration is a separate billed inference, but the saved snapshot'stask_run.usageonly carries the last inference's tokens. Inner-loop inferences are billed but invisible to anything reading the trace, so consumers that sum across snapshots under-count tokens by ~20–50% per case.What
Per-message
latency_mswas already attached to every assistant turn via amessage_latencydict in the inference loop. This change mirrors that pattern forusage: captureusage_from_response(...)per call, store undermessage_usage[len(messages) - 1], and thread it throughModelTurnResult→_run→all_messages_to_trace→litellm_message_to_trace_message. Sanitized viaKILN_ONLY_MESSAGE_FIELDSbefore being sent back to providers.Usageextracted to its own module (kiln_ai.datamodel.usage) soopen_ai_typescan reference it without an import cycle. Re-exported fromtask_runfor back-compat.Example (real driven case, same trace)
task_run.usage(8 snapshots × last-inference each)usageon leaf trace (all 12 inferences)12 provider-billed inferences fit into 8 saved snapshots; 4 inner-loop calls were invisible before.
Tests
3847/3847
libs/corepass. New coverage:test_run_model_turn_records_per_call_usage_for_each_tool_loop_inference— distinct usage shapes across a 2-call tool loop both land inmessage_usage.test_all_messages_to_trace_attaches_per_message_usage— end-to-end onto trace dicts.test_litellm_message_to_trace_message_includes_usage/_no_usage— leaf attach.test_open_ai_types— wrapper field +KILN_ONLY_MESSAGE_FIELDSmembership.