Skip to content

Add per-call usage to every assistant trace message#1381

Closed
tawnymanticore wants to merge 3 commits into
mainfrom
mike/per-message-usage-on-trace
Closed

Add per-call usage to every assistant trace message#1381
tawnymanticore wants to merge 3 commits into
mainfrom
mike/per-message-usage-on-trace

Conversation

@tawnymanticore
Copy link
Copy Markdown
Collaborator

Why

When a single call_model runs an internal tool-use loop (model → tool → model → tool → final reply), each iteration is a separate billed inference, but the saved snapshot's task_run.usage only carries the last inference's tokens. Inner-loop inferences are billed but invisible to anything reading the trace, so consumers that sum across snapshots under-count tokens by ~20–50% per case.

What

Per-message latency_ms was already attached to every assistant turn via a message_latency dict in the inference loop. This change mirrors that pattern for usage: capture usage_from_response(...) per call, store under message_usage[len(messages) - 1], and thread it through ModelTurnResult_runall_messages_to_tracelitellm_message_to_trace_message. Sanitized via KILN_ONLY_MESSAGE_FIELDS before being sent back to providers.

Usage extracted to its own module (kiln_ai.datamodel.usage) so open_ai_types can reference it without an import cycle. Re-exported from task_run for back-compat.

Example (real driven case, same trace)

Method Tokens
Σ chain-summed task_run.usage (8 snapshots × last-inference each) 140,590
Σ per-message usage on leaf trace (all 12 inferences) 181,239

12 provider-billed inferences fit into 8 saved snapshots; 4 inner-loop calls were invisible before.

Tests

3847/3847 libs/core pass. New coverage:

  • test_run_model_turn_records_per_call_usage_for_each_tool_loop_inference — distinct usage shapes across a 2-call tool loop both land in message_usage.
  • test_all_messages_to_trace_attaches_per_message_usage — end-to-end onto trace dicts.
  • test_litellm_message_to_trace_message_includes_usage / _no_usage — leaf attach.
  • test_open_ai_types — wrapper field + KILN_ONLY_MESSAGE_FIELDS membership.

tawnymanticore and others added 2 commits May 7, 2026 16:35
When a single ``call_model`` invocation runs an internal tool-use loop
(model → tool → model → tool → final reply, with
``return_on_tool_call=False``), each iteration is a separate provider
inference billed independently. Today only the LAST inference's usage
shows up on the saved snapshot's ``task_run.usage`` — the inner-loop
inferences are billed but invisible to trace consumers, so per-case
token totals computed by walking the snapshot chain undercount the
provider's actual bill (kintsugi observed ~30-40% gap on driver runs).

Per-call ``latency_ms`` was already attached to every assistant message
via ``message_latency``. This change mirrors that pattern for usage:

- ``ChatCompletionAssistantMessageParamWrapper`` gains an optional
  ``usage`` field, listed in ``KILN_ONLY_MESSAGE_FIELDS`` so it gets
  sanitized before being sent back to the provider.
- ``Usage`` is extracted to its own module
  (``kiln_ai.datamodel.usage``) so ``open_ai_types`` can import it
  without creating a cycle with ``task_run``. Re-exported from
  ``task_run`` for backwards compat.
- ``LiteLlmAdapter._run_model_turn`` and ``AdapterStream._stream_model_turn``
  capture ``call_usage = usage_from_response(response)`` per inference
  and stamp it (with that call's ``latency_ms``) onto a
  ``message_usage: dict[int, Usage]`` keyed by the appended message's
  index — same shape as ``message_latency``.
- ``ModelTurnResult`` carries the dict; ``_run`` merges across turns;
  ``all_messages_to_trace`` threads it into
  ``litellm_message_to_trace_message`` which attaches ``usage`` to the
  emitted trace dict.

Sums of per-call ``usage`` across the trace recover provider-true
totals. Existing ``task_run.usage`` (turn-summed) and ``Usage.__add__``
are unchanged.

Tests:

- ``test_run_model_turn_records_per_call_usage_for_each_tool_loop_inference``
  drives a 2-call tool loop with distinct usage shapes per call and
  asserts both per-call entries land in ``message_usage``.
- ``test_all_messages_to_trace_attaches_per_message_usage`` verifies
  end-to-end flow onto the trace messages.
- ``test_litellm_message_to_trace_message_includes_usage`` /
  ``_no_usage`` cover the leaf attach.
- ``test_open_ai_types`` updated for the new wrapper field +
  ``KILN_ONLY_MESSAGE_FIELDS`` membership.

3847/3847 ``libs/core`` tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit declared the field as ``Optional[Usage]``, which
required eagerly importing ``Usage`` at module top. That triggered a
circular import: ``open_ai_types`` would import from
``kiln_ai.datamodel.usage``, but loading the parent
``kiln_ai.datamodel`` package eagerly imports ``task_run``, which
itself imports ``ChatCompletionMessageParam`` from this module. Cold
imports through the cycle hit a "TaskRun is not fully defined" error
during Pydantic schema build because the forward-ref ``Usage`` couldn't
resolve in this module's namespace.

Type the field as ``Optional[Any]`` instead. Runtime behavior is
unchanged — the value is still a ``Usage`` instance (or its dict
serialization on a deserialized trace) — and the docstring spells out
the shape. Costs a level of OpenAPI/TS schema precision (the generated
``api_schema.d.ts`` now emits ``unknown | null`` instead of
``Usage | null`` for the field), but that's a strictly cosmetic loss:
consumers that need typed access already pass through the Pydantic
``TaskRun`` model.

3847/3847 ``libs/core`` tests pass; cold import succeeds end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 01c6e709-a57d-469b-a1e4-8fc47bbfbf36

📥 Commits

Reviewing files that changed from the base of the PR and between 8b715f7 and 2341488.

📒 Files selected for processing (5)
  • app/web_ui/src/lib/api_schema.d.ts
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
  • libs/core/kiln_ai/datamodel/usage.py
  • libs/core/kiln_ai/utils/open_ai_types.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • libs/core/kiln_ai/datamodel/usage.py
  • app/web_ui/src/lib/api_schema.d.ts
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py

Walkthrough

Per-message token usage is recorded and threaded through adapters and traces: new Usage model and helper, usage added to assistant message wrappers and OpenAPI schema, adapters and streaming code stamp per-call usage into per-message maps, and tests updated.

Changes

Per-Message Usage Tracking Through LLM Adapters

Layer / File(s) Summary
Data Models and Contracts
libs/core/kiln_ai/datamodel/usage.py, libs/core/kiln_ai/datamodel/task_run.py
Adds standalone Usage model and record_per_call_usage_and_latency; TaskRun imports/re-exports Usage.
OpenAI Message Types and API Schema
libs/core/kiln_ai/utils/open_ai_types.py, app/web_ui/src/lib/api_schema.d.ts
ChatCompletionAssistantMessageParamWrapper adds optional usage; KILN_ONLY_MESSAGE_FIELDS includes usage; generated OpenAPI/TS schema adds usage to input/output wrappers.
LiteLLM Adapter Core
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
ModelTurnResult gains message_usage map; per-call Usage captured and stored at message index; outer run loop accumulates message_usage and threads it into trace builders via all_messages_to_trace.
Streaming Adapter Integration
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
AdapterStream adds _message_usage dict, records per-call usage/latency via helper during streaming turns, and passes per-message usage into trace generation.
Trace Message Conversion
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
litellm_message_to_trace_message accepts optional usage and includes it in assistant trace messages; all_messages_to_trace threads per-index usage through conversion.
Tests and Verification
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py, libs/core/kiln_ai/utils/test_open_ai_types.py
Unit tests for conversion with/without usage; async tests validate per-message usage accumulation and propagation; compatibility tests updated for sanitization and wrapper properties.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

  • Kiln-AI/Kiln#1340: Related adapter and trace plumbing that propagates per-message metadata through adapters and trace conversion.
  • Kiln-AI/Kiln#509: Prior changes to messaging/trace pipeline and OpenAI-typed trace messages touching the same paths.
  • Kiln-AI/Kiln#1361: Earlier work introducing Kiln-only message fields and sanitization logic that this change extends with usage.

Suggested reviewers

  • chiang-daniel
  • leonardmq
  • scosman

Poem

🐰 I hopped through token streams so bright,

Per-message usage captured in light,
Latency and cost beside each call,
Traces now show them, one and all.
A rabbit cheers — metrics delight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: adding per-call usage tracking to every assistant trace message during model inference, which is the core objective of this PR.
Description check ✅ Passed The description comprehensively covers the motivation (under-counting inner-loop inference tokens), the solution (per-message usage tracking), implementation details, and test coverage. All required template sections are present and filled out.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mike/per-message-usage-on-trace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (100%)
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (100%)
  • libs/core/kiln_ai/datamodel/task_run.py (100%)
  • libs/core/kiln_ai/datamodel/usage.py (100%)
  • libs/core/kiln_ai/utils/open_ai_types.py (100%)

Summary

  • Total: 59 lines
  • Missing: 0 lines
  • Coverage: 100%

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements per-message token usage tracking to ensure accurate accounting during multi-inference tool loops. The Usage model was moved to a standalone module to resolve circular dependencies, and both the LiteLlmAdapter and adapter_stream were updated to capture and propagate usage data within assistant message traces. Review feedback recommends refactoring duplicated usage processing logic into a shared helper method and improving type safety by using forward references for the usage field instead of Any to enhance the generated OpenAPI schema.

Comment on lines 159 to 165
# count the usage (both summed for the turn-level total and
# captured per-message so the trace can show inner-loop calls)
call_usage = self.usage_from_response(model_response)
usage += call_usage
usage.total_llm_latency_ms = (
usage.total_llm_latency_ms or 0
) + call_latency_ms
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for processing usage and latency (lines 159-165 and 177-184) is nearly identical to the logic in _stream_model_turn in adapter_stream.py. To improve maintainability and reduce code duplication, consider extracting this logic into a shared helper method on LiteLlmAdapter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 2341488 — extracted into record_per_call_usage_and_latency in kiln_ai.datamodel.usage (stateless free function so the existing mock_adapter fixture keeps working without modification). Both _run_model_turn and _stream_model_turn now call it.

latency_ms: Optional[int]
"""Time spent waiting on this specific LLM API call in milliseconds."""

usage: Optional[Any]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using Any avoids the circular import as noted in the docstring, it results in a less-than-ideal unknown type in the generated OpenAPI schema for the frontend. This reduces type safety on the client side.

You can provide a concrete type hint and still avoid the import cycle by using a forward reference (string literal). This will allow the type to be resolved correctly by type checkers and the OpenAPI generator.

To do this, you would also need to add a TYPE_CHECKING block at the top of the file:

from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from kiln_ai.datamodel.usage import Usage
Suggested change
usage: Optional[Any]
usage: Optional["Usage"]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried Optional["Usage"] per the suggestion, but Pydantic resolves the forward ref against open_ai_typess globals during TaskRun schema build — and at that point the import cycle (open_ai_typeskiln_ai.datamodel.usagekiln_ai.datamodel.__init__task_runopen_ai_types again) hasn't completed, so it hits "TaskRun is not fully defined". defer_build=True would work but ripples to other models.

Landed a different fix in 2341488: Annotated[Optional[Any], WithJsonSchema({...})]. Python type stays Any (no cycle) while the OpenAPI generator emits $ref: Usage. Regenerated api_schema.d.ts now reads usage?: components["schemas"]["Usage"] | null again — same as a concrete annotation would have produced.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
libs/core/kiln_ai/datamodel/usage.py (2)

46-55: 💤 Low value

Consider adding __radd__ to support sum() without an explicit start.

sum([u1, u2]) tries 0 + u1 first (i.e., int.__add__(u1)), which returns NotImplemented, then falls back to u1.__radd__(0) — which doesn't exist, raising TypeError. The current code always uses explicit += in loops, so this isn't a bug today, but it's a footgun for callers who reach for sum().

def __radd__(self, other: "Usage | int") -> "Usage":
    if other == 0:  # identity for sum()
        return self
    return self.__add__(other)  # type: ignore[arg-type]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/datamodel/usage.py` around lines 46 - 55, The Usage class
currently implements __add__ but not __radd__, so using sum([u1, u2]) fails when
Python tries 0 + u1; add a __radd__ method on Usage that treats other==0 as the
additive identity returning self, and otherwise delegates to __add__ (e.g., call
self.__add__(other) or raise a TypeError consistently), so sum() works without
an explicit start while preserving the existing None-handling in __add__.

57-73: 💤 Low value

_add_optional_int and _add_optional_float are identical — consolidate into one helper.

Both inner functions share the same body; only the annotations differ (and Python ignores those at runtime). A single _add_optional that's typed with a TypeVar or just uses the union covers both cases:

♻️ Proposed refactor
-        def _add_optional_int(a: int | None, b: int | None) -> int | None:
-            if a is None and b is None:
-                return None
-            if a is None:
-                return b
-            if b is None:
-                return a
-            return a + b
-
-        def _add_optional_float(a: float | None, b: float | None) -> float | None:
-            if a is None and b is None:
-                return None
-            if a is None:
-                return b
-            if b is None:
-                return a
-            return a + b
+        from typing import TypeVar
+        _N = TypeVar("_N", int, float)
+
+        def _add_optional(a: _N | None, b: _N | None) -> _N | None:
+            if a is None and b is None:
+                return None
+            return (a or 0) + (b or 0)  # type: ignore[operator]

         return Usage(
-            input_tokens=_add_optional_int(self.input_tokens, other.input_tokens),
-            output_tokens=_add_optional_int(self.output_tokens, other.output_tokens),
-            total_tokens=_add_optional_int(self.total_tokens, other.total_tokens),
-            cost=_add_optional_float(self.cost, other.cost),
-            cached_tokens=_add_optional_int(self.cached_tokens, other.cached_tokens),
-            total_llm_latency_ms=_add_optional_int(
-                self.total_llm_latency_ms, other.total_llm_latency_ms
-            ),
+            input_tokens=_add_optional(self.input_tokens, other.input_tokens),
+            output_tokens=_add_optional(self.output_tokens, other.output_tokens),
+            total_tokens=_add_optional(self.total_tokens, other.total_tokens),
+            cost=_add_optional(self.cost, other.cost),
+            cached_tokens=_add_optional(self.cached_tokens, other.cached_tokens),
+            total_llm_latency_ms=_add_optional(
+                self.total_llm_latency_ms, other.total_llm_latency_ms
+            ),
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/datamodel/usage.py` around lines 57 - 73, Consolidate the
duplicate functions _add_optional_int and _add_optional_float into a single
helper (e.g., _add_optional) that handles optional numeric values; implement it
using a TypeVar bound to numbers or simply use Union[int, float] ->
Optional[...] for the signature, preserve the same logic body, and
replace/internalize calls to _add_optional_int/_add_optional_float to call the
new _add_optional so callers like any usage in this module continue to work with
both ints and floats.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@libs/core/kiln_ai/datamodel/usage.py`:
- Around line 46-55: The Usage class currently implements __add__ but not
__radd__, so using sum([u1, u2]) fails when Python tries 0 + u1; add a __radd__
method on Usage that treats other==0 as the additive identity returning self,
and otherwise delegates to __add__ (e.g., call self.__add__(other) or raise a
TypeError consistently), so sum() works without an explicit start while
preserving the existing None-handling in __add__.
- Around line 57-73: Consolidate the duplicate functions _add_optional_int and
_add_optional_float into a single helper (e.g., _add_optional) that handles
optional numeric values; implement it using a TypeVar bound to numbers or simply
use Union[int, float] -> Optional[...] for the signature, preserve the same
logic body, and replace/internalize calls to
_add_optional_int/_add_optional_float to call the new _add_optional so callers
like any usage in this module continue to work with both ints and floats.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4976f561-b2d1-44bb-bfcf-67cd804c9609

📥 Commits

Reviewing files that changed from the base of the PR and between d4bc730 and 8b715f7.

📒 Files selected for processing (8)
  • app/web_ui/src/lib/api_schema.d.ts
  • libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
  • libs/core/kiln_ai/datamodel/task_run.py
  • libs/core/kiln_ai/datamodel/usage.py
  • libs/core/kiln_ai/utils/open_ai_types.py
  • libs/core/kiln_ai/utils/test_open_ai_types.py

Two review comments addressed:

1. **Extract shared helper.** ``record_per_call_usage_and_latency`` in
   ``kiln_ai.datamodel.usage`` now owns the "aggregate per-call usage
   onto the turn total + stamp per-message dicts" logic. Both
   ``_run_model_turn`` (litellm_adapter.py) and ``_stream_model_turn``
   (adapter_stream.py) call it, removing the duplicated 6-line block.
   Stateless free function (no ``self`` needed) so the existing
   ``mock_adapter`` fixture in test_adapter_stream.py keeps working
   without extending it.

2. **Concrete OpenAPI typing.** Tried ``Optional["Usage"]`` forward-ref
   per the suggestion, but Pydantic's schema build for ``TaskRun.trace``
   eagerly resolves the forward ref against ``open_ai_types``'s globals
   during ``TaskRun`` class definition — at which point the cycle isn't
   yet broken (we're still inside ``open_ai_types``'s import) and we
   hit "TaskRun is not fully defined". ``defer_build=True`` would fix
   it but ripples to many models.

   Instead: ``Annotated[Optional[Any], WithJsonSchema(...)]`` keeps the
   Python type as ``Any`` (no cycle) while pinning the OpenAPI / TS
   schema to ``$ref: Usage``. Generated ``api_schema.d.ts`` now reads
   ``usage?: components["schemas"]["Usage"] | null`` again, identical
   to what a direct ``Optional[Usage]`` annotation would emit.

3847/3847 ``libs/core`` tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tawnymanticore
Copy link
Copy Markdown
Collaborator Author

Both nitpicks from CodeRabbit (__radd__ for sum(), and consolidating _add_optional_int / _add_optional_float) are on Usage.__add__ code that I moved unchanged from task_run.py — they pre-date this PR. CodeRabbit also flagged them as "💤 Low value" itself. Leaving them for a follow-up cleanup PR to keep this diff focused on the per-message-usage fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant