Add per-call `usage` to every assistant trace message by tawnymanticore · Pull Request #1381 · Kiln-AI/Kiln

tawnymanticore · 2026-05-07T21:00:17Z

Why

When a single call_model runs an internal tool-use loop (model → tool → model → tool → final reply), each iteration is a separate billed inference, but the saved snapshot's task_run.usage only carries the last inference's tokens. Inner-loop inferences are billed but invisible to anything reading the trace, so consumers that sum across snapshots under-count tokens by ~20–50% per case.

What

Per-message latency_ms was already attached to every assistant turn via a message_latency dict in the inference loop. This change mirrors that pattern for usage: capture usage_from_response(...) per call, store under message_usage[len(messages) - 1], and thread it through ModelTurnResult → _run → all_messages_to_trace → litellm_message_to_trace_message. Sanitized via KILN_ONLY_MESSAGE_FIELDS before being sent back to providers.

Usage extracted to its own module (kiln_ai.datamodel.usage) so open_ai_types can reference it without an import cycle. Re-exported from task_run for back-compat.

Example (real driven case, same trace)

Method	Tokens
Σ chain-summed `task_run.usage` (8 snapshots × last-inference each)	140,590
Σ per-message `usage` on leaf trace (all 12 inferences)	181,239

12 provider-billed inferences fit into 8 saved snapshots; 4 inner-loop calls were invisible before.

Tests

3847/3847 libs/core pass. New coverage:

test_run_model_turn_records_per_call_usage_for_each_tool_loop_inference — distinct usage shapes across a 2-call tool loop both land in message_usage.
test_all_messages_to_trace_attaches_per_message_usage — end-to-end onto trace dicts.
test_litellm_message_to_trace_message_includes_usage / _no_usage — leaf attach.
test_open_ai_types — wrapper field + KILN_ONLY_MESSAGE_FIELDS membership.

When a single ``call_model`` invocation runs an internal tool-use loop (model → tool → model → tool → final reply, with ``return_on_tool_call=False``), each iteration is a separate provider inference billed independently. Today only the LAST inference's usage shows up on the saved snapshot's ``task_run.usage`` — the inner-loop inferences are billed but invisible to trace consumers, so per-case token totals computed by walking the snapshot chain undercount the provider's actual bill (kintsugi observed ~30-40% gap on driver runs). Per-call ``latency_ms`` was already attached to every assistant message via ``message_latency``. This change mirrors that pattern for usage: - ``ChatCompletionAssistantMessageParamWrapper`` gains an optional ``usage`` field, listed in ``KILN_ONLY_MESSAGE_FIELDS`` so it gets sanitized before being sent back to the provider. - ``Usage`` is extracted to its own module (``kiln_ai.datamodel.usage``) so ``open_ai_types`` can import it without creating a cycle with ``task_run``. Re-exported from ``task_run`` for backwards compat. - ``LiteLlmAdapter._run_model_turn`` and ``AdapterStream._stream_model_turn`` capture ``call_usage = usage_from_response(response)`` per inference and stamp it (with that call's ``latency_ms``) onto a ``message_usage: dict[int, Usage]`` keyed by the appended message's index — same shape as ``message_latency``. - ``ModelTurnResult`` carries the dict; ``_run`` merges across turns; ``all_messages_to_trace`` threads it into ``litellm_message_to_trace_message`` which attaches ``usage`` to the emitted trace dict. Sums of per-call ``usage`` across the trace recover provider-true totals. Existing ``task_run.usage`` (turn-summed) and ``Usage.__add__`` are unchanged. Tests: - ``test_run_model_turn_records_per_call_usage_for_each_tool_loop_inference`` drives a 2-call tool loop with distinct usage shapes per call and asserts both per-call entries land in ``message_usage``. - ``test_all_messages_to_trace_attaches_per_message_usage`` verifies end-to-end flow onto the trace messages. - ``test_litellm_message_to_trace_message_includes_usage`` / ``_no_usage`` cover the leaf attach. - ``test_open_ai_types`` updated for the new wrapper field + ``KILN_ONLY_MESSAGE_FIELDS`` membership. 3847/3847 ``libs/core`` tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commit declared the field as ``Optional[Usage]``, which required eagerly importing ``Usage`` at module top. That triggered a circular import: ``open_ai_types`` would import from ``kiln_ai.datamodel.usage``, but loading the parent ``kiln_ai.datamodel`` package eagerly imports ``task_run``, which itself imports ``ChatCompletionMessageParam`` from this module. Cold imports through the cycle hit a "TaskRun is not fully defined" error during Pydantic schema build because the forward-ref ``Usage`` couldn't resolve in this module's namespace. Type the field as ``Optional[Any]`` instead. Runtime behavior is unchanged — the value is still a ``Usage`` instance (or its dict serialization on a deserialized trace) — and the docstring spells out the shape. Costs a level of OpenAPI/TS schema precision (the generated ``api_schema.d.ts`` now emits ``unknown | null`` instead of ``Usage | null`` for the field), but that's a strictly cosmetic loss: consumers that need typed access already pass through the Pydantic ``TaskRun`` model. 3847/3847 ``libs/core`` tests pass; cold import succeeds end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-07T21:00:32Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 01c6e709-a57d-469b-a1e4-8fc47bbfbf36

📥 Commits

Reviewing files that changed from the base of the PR and between 8b715f7 and 2341488.

📒 Files selected for processing (5)

app/web_ui/src/lib/api_schema.d.ts
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
libs/core/kiln_ai/datamodel/usage.py
libs/core/kiln_ai/utils/open_ai_types.py

🚧 Files skipped from review as they are similar to previous changes (4)

libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
libs/core/kiln_ai/datamodel/usage.py
app/web_ui/src/lib/api_schema.d.ts
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py

Walkthrough

Per-message token usage is recorded and threaded through adapters and traces: new Usage model and helper, usage added to assistant message wrappers and OpenAPI schema, adapters and streaming code stamp per-call usage into per-message maps, and tests updated.

Changes

Per-Message Usage Tracking Through LLM Adapters

Layer / File(s)	Summary
Data Models and Contracts `libs/core/kiln_ai/datamodel/usage.py`, `libs/core/kiln_ai/datamodel/task_run.py`	Adds standalone `Usage` model and `record_per_call_usage_and_latency`; `TaskRun` imports/re-exports `Usage`.
OpenAI Message Types and API Schema `libs/core/kiln_ai/utils/open_ai_types.py`, `app/web_ui/src/lib/api_schema.d.ts`	`ChatCompletionAssistantMessageParamWrapper` adds optional `usage`; `KILN_ONLY_MESSAGE_FIELDS` includes `usage`; generated OpenAPI/TS schema adds `usage` to input/output wrappers.
LiteLLM Adapter Core `libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py`	`ModelTurnResult` gains `message_usage` map; per-call `Usage` captured and stored at message index; outer run loop accumulates `message_usage` and threads it into trace builders via `all_messages_to_trace`.
Streaming Adapter Integration `libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py`	`AdapterStream` adds `_message_usage` dict, records per-call usage/latency via helper during streaming turns, and passes per-message usage into trace generation.
Trace Message Conversion `libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py`	`litellm_message_to_trace_message` accepts optional `usage` and includes it in assistant trace messages; `all_messages_to_trace` threads per-index `usage` through conversion.
Tests and Verification `libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py`, `libs/core/kiln_ai/utils/test_open_ai_types.py`	Unit tests for conversion with/without usage; async tests validate per-message usage accumulation and propagation; compatibility tests updated for sanitization and wrapper properties.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Kiln-AI/Kiln#1340: Related adapter and trace plumbing that propagates per-message metadata through adapters and trace conversion.
Kiln-AI/Kiln#509: Prior changes to messaging/trace pipeline and OpenAI-typed trace messages touching the same paths.
Kiln-AI/Kiln#1361: Earlier work introducing Kiln-only message fields and sanitization logic that this change extends with usage.

Suggested reviewers

chiang-daniel
leonardmq
scosman

Poem

🐰 I hopped through token streams so bright,

Per-message usage captured in light,
Latency and cost beside each call,
Traces now show them, one and all.
A rabbit cheers — metrics delight!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: adding per-call usage tracking to every assistant trace message during model inference, which is the core objective of this PR.
Description check	✅ Passed	The description comprehensively covers the motivation (under-counting inner-loop inference tokens), the solution (per-message usage tracking), implementation details, and test coverage. All required template sections are present and filled out.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch mike/per-message-usage-on-trace

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-07T21:03:02Z

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py (100%)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (100%)
libs/core/kiln_ai/datamodel/task_run.py (100%)
libs/core/kiln_ai/datamodel/usage.py (100%)
libs/core/kiln_ai/utils/open_ai_types.py (100%)

Summary

Total: 59 lines
Missing: 0 lines
Coverage: 100%

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

gemini-code-assist

Code Review

This pull request implements per-message token usage tracking to ensure accurate accounting during multi-inference tool loops. The Usage model was moved to a standalone module to resolve circular dependencies, and both the LiteLlmAdapter and adapter_stream were updated to capture and propagate usage data within assistant message traces. Review feedback recommends refactoring duplicated usage processing logic into a shared helper method and improving type safety by using forward references for the usage field instead of Any to enhance the generated OpenAPI schema.

gemini-code-assist · 2026-05-07T21:05:19Z

+            # count the usage (both summed for the turn-level total and
+            # captured per-message so the trace can show inner-loop calls)
+            call_usage = self.usage_from_response(model_response)
+            usage += call_usage
            usage.total_llm_latency_ms = (
                usage.total_llm_latency_ms or 0
            ) + call_latency_ms


This logic for processing usage and latency (lines 159-165 and 177-184) is nearly identical to the logic in _stream_model_turn in adapter_stream.py. To improve maintainability and reduce code duplication, consider extracting this logic into a shared helper method on LiteLlmAdapter.

Done in 2341488 — extracted into record_per_call_usage_and_latency in kiln_ai.datamodel.usage (stateless free function so the existing mock_adapter fixture keeps working without modification). Both _run_model_turn and _stream_model_turn now call it.

gemini-code-assist · 2026-05-07T21:05:19Z

    latency_ms: Optional[int]
    """Time spent waiting on this specific LLM API call in milliseconds."""

+    usage: Optional[Any]


While using Any avoids the circular import as noted in the docstring, it results in a less-than-ideal unknown type in the generated OpenAPI schema for the frontend. This reduces type safety on the client side.

You can provide a concrete type hint and still avoid the import cycle by using a forward reference (string literal). This will allow the type to be resolved correctly by type checkers and the OpenAPI generator.

To do this, you would also need to add a TYPE_CHECKING block at the top of the file:

from typing import TYPE_CHECKING if TYPE_CHECKING: from kiln_ai.datamodel.usage import Usage

Suggested change

usage: Optional[Any]

usage: Optional["Usage"]

Tried Optional["Usage"] per the suggestion, but Pydantic resolves the forward ref against open_ai_typess globals during TaskRun schema build — and at that point the import cycle (open_ai_types → kiln_ai.datamodel.usage → kiln_ai.datamodel.__init__ → task_run → open_ai_types again) hasn't completed, so it hits "TaskRun is not fully defined". defer_build=True would work but ripples to other models.

Landed a different fix in 2341488: Annotated[Optional[Any], WithJsonSchema({...})]. Python type stays Any (no cycle) while the OpenAPI generator emits $ref: Usage. Regenerated api_schema.d.ts now reads usage?: components["schemas"]["Usage"] | null again — same as a concrete annotation would have produced.

coderabbitai

🧹 Nitpick comments (2)

libs/core/kiln_ai/datamodel/usage.py (2)

46-55: 💤 Low value

Consider adding __radd__ to support sum() without an explicit start.

sum([u1, u2]) tries 0 + u1 first (i.e., int.__add__(u1)), which returns NotImplemented, then falls back to u1.__radd__(0) — which doesn't exist, raising TypeError. The current code always uses explicit += in loops, so this isn't a bug today, but it's a footgun for callers who reach for sum().

def __radd__(self, other: "Usage | int") -> "Usage":
    if other == 0:  # identity for sum()
        return self
    return self.__add__(other)  # type: ignore[arg-type]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/datamodel/usage.py` around lines 46 - 55, The Usage class
currently implements __add__ but not __radd__, so using sum([u1, u2]) fails when
Python tries 0 + u1; add a __radd__ method on Usage that treats other==0 as the
additive identity returning self, and otherwise delegates to __add__ (e.g., call
self.__add__(other) or raise a TypeError consistently), so sum() works without
an explicit start while preserving the existing None-handling in __add__.

57-73: 💤 Low value

_add_optional_int and _add_optional_float are identical — consolidate into one helper.

Both inner functions share the same body; only the annotations differ (and Python ignores those at runtime). A single _add_optional that's typed with a TypeVar or just uses the union covers both cases:

♻️ Proposed refactor

-        def _add_optional_int(a: int | None, b: int | None) -> int | None:
-            if a is None and b is None:
-                return None
-            if a is None:
-                return b
-            if b is None:
-                return a
-            return a + b
-
-        def _add_optional_float(a: float | None, b: float | None) -> float | None:
-            if a is None and b is None:
-                return None
-            if a is None:
-                return b
-            if b is None:
-                return a
-            return a + b
+        from typing import TypeVar
+        _N = TypeVar("_N", int, float)
+
+        def _add_optional(a: _N | None, b: _N | None) -> _N | None:
+            if a is None and b is None:
+                return None
+            return (a or 0) + (b or 0)  # type: ignore[operator]

         return Usage(
-            input_tokens=_add_optional_int(self.input_tokens, other.input_tokens),
-            output_tokens=_add_optional_int(self.output_tokens, other.output_tokens),
-            total_tokens=_add_optional_int(self.total_tokens, other.total_tokens),
-            cost=_add_optional_float(self.cost, other.cost),
-            cached_tokens=_add_optional_int(self.cached_tokens, other.cached_tokens),
-            total_llm_latency_ms=_add_optional_int(
-                self.total_llm_latency_ms, other.total_llm_latency_ms
-            ),
+            input_tokens=_add_optional(self.input_tokens, other.input_tokens),
+            output_tokens=_add_optional(self.output_tokens, other.output_tokens),
+            total_tokens=_add_optional(self.total_tokens, other.total_tokens),
+            cost=_add_optional(self.cost, other.cost),
+            cached_tokens=_add_optional(self.cached_tokens, other.cached_tokens),
+            total_llm_latency_ms=_add_optional(
+                self.total_llm_latency_ms, other.total_llm_latency_ms
+            ),
         )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@libs/core/kiln_ai/datamodel/usage.py` around lines 57 - 73, Consolidate the
duplicate functions _add_optional_int and _add_optional_float into a single
helper (e.g., _add_optional) that handles optional numeric values; implement it
using a TypeVar bound to numbers or simply use Union[int, float] ->
Optional[...] for the signature, preserve the same logic body, and
replace/internalize calls to _add_optional_int/_add_optional_float to call the
new _add_optional so callers like any usage in this module continue to work with
both ints and floats.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@libs/core/kiln_ai/datamodel/usage.py`:
- Around line 46-55: The Usage class currently implements __add__ but not
__radd__, so using sum([u1, u2]) fails when Python tries 0 + u1; add a __radd__
method on Usage that treats other==0 as the additive identity returning self,
and otherwise delegates to __add__ (e.g., call self.__add__(other) or raise a
TypeError consistently), so sum() works without an explicit start while
preserving the existing None-handling in __add__.
- Around line 57-73: Consolidate the duplicate functions _add_optional_int and
_add_optional_float into a single helper (e.g., _add_optional) that handles
optional numeric values; implement it using a TypeVar bound to numbers or simply
use Union[int, float] -> Optional[...] for the signature, preserve the same
logic body, and replace/internalize calls to
_add_optional_int/_add_optional_float to call the new _add_optional so callers
like any usage in this module continue to work with both ints and floats.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 4976f561-b2d1-44bb-bfcf-67cd804c9609

📥 Commits

Reviewing files that changed from the base of the PR and between d4bc730 and 8b715f7.

📒 Files selected for processing (8)

app/web_ui/src/lib/api_schema.d.ts
libs/core/kiln_ai/adapters/model_adapters/adapter_stream.py
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
libs/core/kiln_ai/datamodel/task_run.py
libs/core/kiln_ai/datamodel/usage.py
libs/core/kiln_ai/utils/open_ai_types.py
libs/core/kiln_ai/utils/test_open_ai_types.py

Two review comments addressed: 1. **Extract shared helper.** ``record_per_call_usage_and_latency`` in ``kiln_ai.datamodel.usage`` now owns the "aggregate per-call usage onto the turn total + stamp per-message dicts" logic. Both ``_run_model_turn`` (litellm_adapter.py) and ``_stream_model_turn`` (adapter_stream.py) call it, removing the duplicated 6-line block. Stateless free function (no ``self`` needed) so the existing ``mock_adapter`` fixture in test_adapter_stream.py keeps working without extending it. 2. **Concrete OpenAPI typing.** Tried ``Optional["Usage"]`` forward-ref per the suggestion, but Pydantic's schema build for ``TaskRun.trace`` eagerly resolves the forward ref against ``open_ai_types``'s globals during ``TaskRun`` class definition — at which point the cycle isn't yet broken (we're still inside ``open_ai_types``'s import) and we hit "TaskRun is not fully defined". ``defer_build=True`` would fix it but ripples to many models. Instead: ``Annotated[Optional[Any], WithJsonSchema(...)]`` keeps the Python type as ``Any`` (no cycle) while pinning the OpenAPI / TS schema to ``$ref: Usage``. Generated ``api_schema.d.ts`` now reads ``usage?: components["schemas"]["Usage"] | null`` again, identical to what a direct ``Optional[Usage]`` annotation would emit. 3847/3847 ``libs/core`` tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tawnymanticore · 2026-05-07T21:27:57Z

Both nitpicks from CodeRabbit (__radd__ for sum(), and consolidating _add_optional_int / _add_optional_float) are on Usage.__add__ code that I moved unchanged from task_run.py — they pre-date this PR. CodeRabbit also flagged them as "💤 Low value" itself. Leaving them for a follow-up cleanup PR to keep this diff focused on the per-message-usage fix.

tawnymanticore and others added 2 commits May 7, 2026 16:35

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

tawnymanticore closed this May 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-call `usage` to every assistant trace message#1381

Add per-call `usage` to every assistant trace message#1381
tawnymanticore wants to merge 3 commits into
mainfrom
mike/per-message-usage-on-trace

tawnymanticore commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

tawnymanticore May 7, 2026

Uh oh!

gemini-code-assist Bot May 7, 2026

Uh oh!

tawnymanticore May 7, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

tawnymanticore commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tawnymanticore commented May 7, 2026

Why

What

Example (real driven case, same trace)

Tests

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage Report

Diff: origin/main...HEAD

Summary

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

tawnymanticore May 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

tawnymanticore May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tawnymanticore commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading