Add cache-stat fields to Response.usage (#136)

chris-colinsky · web-flow · commit dd804e2cb4dc · 2026-06-06T11:49:35.000-07:00
* Add cache-stat fields to Response.usage

Extend the Usage model with two optional non-negative integer fields:
cached_tokens (count of input tokens that hit a prefix cache) and
cache_creation_tokens (count of input tokens written to the cache).
Both default to None to preserve the absent-vs-reported-zero
distinction the spec mandates.

The OpenAI adapter sources cached_tokens from the nested response
path usage.prompt_tokens_details.cached_tokens; vLLM and other
OpenAI-compatible servers that surface implicit-cache stats follow
the same nesting. cache_creation_tokens stays None for this mapping
per the OpenAI-compat wire surface. Malformed values surface as
ProviderInvalidResponse via the existing Pydantic validation path.

Foundation for OTel cache-attribute emission and the typed LLM
completion event work landing in subsequent PRs of the v0.13.0
LLM hardening cycle.

* Subset-compare usage assertions in conformance harness

Conformance fixtures pre-dating proposal 0047 assert the three base
token-count fields only and don't mention the new cache-stat fields.
Adding cached_tokens and cache_creation_tokens to Usage made the
exact-equality comparison fail against any fixture whose expected
usage shape doesn't include the cache fields.

Treat the fixture's expected usage as the floor: only compare keys
the fixture explicitly asserts about. A spec-required field absent
from actual still fails the comparison (the filtered dict won't
contain it). Impl-extension fields that the fixture is silent about
no longer trip exact-equality, so additive Usage extensions don't
break pre-existing fixtures.

* Set cached_tokens conditionally for cleaner Pydantic semantics

Address PR review feedback: passing cached_tokens=None
unconditionally marks the Pydantic field as "set", so
model_dump(exclude_unset=True) includes it as None — undistinguishable
from a wire response that explicitly carried a null cached_tokens.

Switch the if-branch to a kwargs-dict pattern that only sets
cached_tokens when prompt_tokens_details is a dict AND the nested
cached_tokens key is present. Attribute access (usage.cached_tokens)
still returns None when the wire didn't report; exclude_unset dumps
now reflect the wire shape exactly. The else-branch's explicit
cached_tokens=None (added in the previous commit for if/else symmetry)
is dropped — symmetry now sits at "reflect wire reality", not "pass
identical args".

Two regression tests pin the bidirectional projection: exclude_unset
omits cached_tokens when the wire didn't report it; includes it with
the int value when it did.

* Guard subset filter against non-mapping expected_usage

Address PR review feedback: the conformance fixture's typed model
allows usage: null. When that happens, expected_usage is None and
the subset filter's k in expected_usage raises TypeError instead of
producing a clean assertion failure.

Wrap the subset filter in an isinstance(dict) guard. The mapping path
keeps the same subset semantics; the non-mapping path falls back to
direct comparison so the assert fires with a clear shape mismatch.
diff --git a/src/openarmature/llm/providers/openai.py b/src/openarmature/llm/providers/openai.py
@@ -646,13 +646,47 @@ def _parse_response(
         try:
             if isinstance(usage_wire_raw, dict):
                 usage_wire = cast("dict[str, Any]", usage_wire_raw)
-                usage = Usage(
-                    prompt_tokens=usage_wire.get("prompt_tokens"),
-                    completion_tokens=usage_wire.get("completion_tokens"),
-                    total_tokens=usage_wire.get("total_tokens"),
+                # cached_tokens sources from
+                # usage.prompt_tokens_details.cached_tokens per spec
+                # §8.1.2; vLLM and other OpenAI-compatible servers that
+                # surface implicit-cache stats follow the same nesting.
+                # Defaults to None when prompt_tokens_details is absent
+                # or when the nested cached_tokens key is missing
+                # (preserves the absent-vs-reported-zero distinction).
+                # cache_creation_tokens stays None — OpenAI-compatible
+                # providers do not report a discrete cache-creation
+                # count under this mapping.
+                prompt_tokens_details_raw = usage_wire.get("prompt_tokens_details")
+                prompt_tokens_details: dict[str, Any] | None = (
+                    cast("dict[str, Any]", prompt_tokens_details_raw)
+                    if isinstance(prompt_tokens_details_raw, dict)
+                    else None
                 )
+                # Conditional set: only pass cached_tokens to Usage(...)
+                # when the wire actually reports the nested key. Pydantic
+                # tracks the field as "set" only when explicitly passed;
+                # downstream consumers using model_dump(exclude_unset=True)
+                # then get a clean wire-shape projection (cached_tokens
+                # omitted entirely when the provider didn't report it).
+                # Attribute access (usage.cached_tokens) still returns
+                # None when absent per the spec's absent-vs-reported
+                # distinction. Malformed values surface as
+                # ProviderInvalidResponse via the same Pydantic validation
+                # path the other token-count fields take.
+                usage_kwargs: dict[str, Any] = {
+                    "prompt_tokens": usage_wire.get("prompt_tokens"),
+                    "completion_tokens": usage_wire.get("completion_tokens"),
+                    "total_tokens": usage_wire.get("total_tokens"),
+                }
+                if prompt_tokens_details is not None and "cached_tokens" in prompt_tokens_details:
+                    usage_kwargs["cached_tokens"] = prompt_tokens_details["cached_tokens"]
+                usage = Usage(**usage_kwargs)
             else:
-                usage = Usage(prompt_tokens=None, completion_tokens=None, total_tokens=None)
+                usage = Usage(
+                    prompt_tokens=None,
+                    completion_tokens=None,
+                    total_tokens=None,
+                )
         except ValidationError as exc:
             raise ProviderInvalidResponse(f"invalid usage record: {exc}") from exc
 
diff --git a/src/openarmature/llm/response.py b/src/openarmature/llm/response.py
@@ -41,18 +41,38 @@
 FinishReason = Literal["stop", "length", "tool_calls", "content_filter", "error"]
 
 
+# Cache-stat fields (cached_tokens / cache_creation_tokens) are
+# optional and default to None. The absent-vs-reported-zero distinction
+# is observable: None means the provider did not report the field; 0
+# means the provider reported the field with value zero (a "reported
+# miss"). Each per-provider wire-format mapping documents which fields
+# it sources.
 class Usage(BaseModel):
     """Token-accounting record.
 
     Each field is a non-negative integer or ``None``. If the provider
-    does not report usage, all three MUST be ``None``.
+    does not report token counts, ``prompt_tokens`` / ``completion_tokens``
+    / ``total_tokens`` MUST be ``None``.
     """
 
     model_config = ConfigDict(extra="forbid")
 
     prompt_tokens: int | None = Field(ge=0)
     completion_tokens: int | None = Field(ge=0)
     total_tokens: int | None = Field(ge=0)
+    # The count of input tokens that hit a prefix cache, sourced from
+    # the provider's response. Absent (None) when the provider does
+    # not report cache statistics; set to 0 when the provider reports
+    # zero cache-hit tokens. Each wire-format mapping documents the
+    # provider response field this value is sourced from.
+    cached_tokens: int | None = Field(default=None, ge=0)
+    # The count of input tokens written to the cache during the call.
+    # Populated primarily by providers with explicit cache-control
+    # surfaces that report a discrete cache-creation count alongside
+    # cache reads. Absent (None) for providers that only report
+    # implicit cache reads (the §8.1 OpenAI-compat mapping leaves this
+    # field absent).
+    cache_creation_tokens: int | None = Field(default=None, ge=0)
 
 
 class Response(BaseModel):
diff --git a/tests/conformance/test_llm_provider.py b/tests/conformance/test_llm_provider.py
@@ -430,7 +430,24 @@ def _assert_response_matches(actual: Response, expected: Mapping[str, Any]) -> N
         assert actual.finish_reason == expected["finish_reason"]
     if "usage" in expected:
         expected_usage = expected["usage"]
-        actual_usage = actual.usage.model_dump()
+        actual_usage_full = actual.usage.model_dump()
+        # Subset comparison when the fixture asserts about specific
+        # usage fields: spec fixtures pin which fields MUST be present
+        # with what values. Impl-extension fields outside the fixture's
+        # expected set (e.g., the 0047 cache-stat fields on impls that
+        # have adopted them but against fixtures that pre-date the
+        # proposal) are ignored when the fixture doesn't assert about
+        # them. A fixture key that's absent from actual surfaces as a
+        # missing key in the filtered dict and fails the comparison;
+        # the impl can't silently drop a field the spec requires.
+        #
+        # Non-mapping expected_usage (e.g., a fixture sets usage: null)
+        # falls back to direct comparison so the assertion fires with a
+        # clean shape mismatch rather than crashing on the subset filter.
+        if isinstance(expected_usage, dict):
+            actual_usage = {k: v for k, v in actual_usage_full.items() if k in expected_usage}
+        else:
+            actual_usage = actual_usage_full
         assert actual_usage == expected_usage, (
             f"usage mismatch: actual={actual_usage}, expected={expected_usage}"
         )
diff --git a/tests/unit/test_llm_provider.py b/tests/unit/test_llm_provider.py
@@ -534,6 +534,223 @@ def _bad(_req: httpx.Request) -> httpx.Response:
         await provider.aclose()
 
 
+# ---------------------------------------------------------------------------
+# Usage cache-stat fields (proposal 0047 — llm-provider §6 extension)
+# ---------------------------------------------------------------------------
+
+
+def test_usage_cache_fields_default_to_none() -> None:
+    # Backwards-compat: existing Usage constructions that don't pass
+    # cache fields produce instances with cached_tokens = None and
+    # cache_creation_tokens = None (the "not reported" state, distinct
+    # from a "reported zero" value of 0).
+    usage = Usage(prompt_tokens=1, completion_tokens=2, total_tokens=3)
+    assert usage.cached_tokens is None
+    assert usage.cache_creation_tokens is None
+
+
+def test_usage_negative_cached_tokens_rejected_at_construction() -> None:
+    with pytest.raises(ValidationError):
+        Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0, cached_tokens=-1)
+
+
+def test_usage_negative_cache_creation_tokens_rejected_at_construction() -> None:
+    with pytest.raises(ValidationError):
+        Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0, cache_creation_tokens=-1)
+
+
+def _make_openai_response_with_usage(usage_body: dict[str, object]) -> httpx.MockTransport:
+    """Build a MockTransport returning a minimal Chat Completions
+    response with the given ``usage`` body. Helper for the cache-stat
+    end-to-end tests below.
+    """
+
+    def _handler(_req: httpx.Request) -> httpx.Response:
+        return httpx.Response(
+            200,
+            json={
+                "id": "x",
+                "object": "chat.completion",
+                "created": 0,
+                "model": "m",
+                "choices": [
+                    {
+                        "index": 0,
+                        "message": {"role": "assistant", "content": "ok"},
+                        "finish_reason": "stop",
+                    }
+                ],
+                "usage": usage_body,
+            },
+        )
+
+    return httpx.MockTransport(_handler)
+
+
+async def test_complete_sources_cached_tokens_from_nested_prompt_tokens_details() -> None:
+    # Cache-hit reported with a positive value. Spec §8.1.2: the
+    # OpenAI-compat mapping sources cached_tokens from
+    # usage.prompt_tokens_details.cached_tokens.
+    transport = _make_openai_response_with_usage(
+        {
+            "prompt_tokens": 100,
+            "completion_tokens": 20,
+            "total_tokens": 120,
+            "prompt_tokens_details": {"cached_tokens": 75},
+        }
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        response = await provider.complete([UserMessage(content="hi")])
+        assert response.usage.cached_tokens == 75
+        # cache_creation_tokens is not sourced by the OpenAI-compat
+        # mapping per spec §8.1.2.
+        assert response.usage.cache_creation_tokens is None
+    finally:
+        await provider.aclose()
+
+
+async def test_complete_reports_zero_cached_tokens_distinct_from_absent() -> None:
+    # The spec mandates the absent-vs-reported-zero distinction:
+    # absent (None) means the provider didn't report; 0 means the
+    # provider reported zero hits. Locks down the distinction.
+    transport = _make_openai_response_with_usage(
+        {
+            "prompt_tokens": 100,
+            "completion_tokens": 20,
+            "total_tokens": 120,
+            "prompt_tokens_details": {"cached_tokens": 0},
+        }
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        response = await provider.complete([UserMessage(content="hi")])
+        assert response.usage.cached_tokens == 0
+    finally:
+        await provider.aclose()
+
+
+async def test_complete_cached_tokens_absent_when_prompt_tokens_details_missing() -> None:
+    # Common pre-cache path: vLLM without --enable-prompt-tokens-details,
+    # OpenAI responses pre-cache-support, etc. No prompt_tokens_details
+    # nesting at all → cached_tokens stays None.
+    transport = _make_openai_response_with_usage(
+        {"prompt_tokens": 100, "completion_tokens": 20, "total_tokens": 120}
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        response = await provider.complete([UserMessage(content="hi")])
+        assert response.usage.cached_tokens is None
+    finally:
+        await provider.aclose()
+
+
+async def test_complete_cached_tokens_absent_when_nested_key_missing() -> None:
+    # Defensive: prompt_tokens_details dict exists (provider may report
+    # other details there, e.g., audio_tokens) but cached_tokens is
+    # absent within it. Sourcing path stays defensive — no KeyError,
+    # cached_tokens stays None.
+    transport = _make_openai_response_with_usage(
+        {
+            "prompt_tokens": 100,
+            "completion_tokens": 20,
+            "total_tokens": 120,
+            "prompt_tokens_details": {"audio_tokens": 0},
+        }
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        response = await provider.complete([UserMessage(content="hi")])
+        assert response.usage.cached_tokens is None
+    finally:
+        await provider.aclose()
+
+
+async def test_complete_excludes_unset_cached_tokens_when_wire_did_not_report() -> None:
+    # When the wire response doesn't carry prompt_tokens_details (or
+    # carries it without a cached_tokens key), the parser leaves the
+    # Pydantic field unset. model_dump(exclude_unset=True) then omits
+    # cached_tokens entirely, giving downstream consumers a clean
+    # wire-shape projection. Attribute access still returns None per
+    # the spec's absent-vs-reported distinction.
+    transport = _make_openai_response_with_usage(
+        {"prompt_tokens": 100, "completion_tokens": 20, "total_tokens": 120}
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        response = await provider.complete([UserMessage(content="hi")])
+        assert response.usage.cached_tokens is None
+        dumped = response.usage.model_dump(exclude_unset=True)
+        assert "cached_tokens" not in dumped
+        # Conversely, when the wire DID report (covered separately
+        # above), the field IS set and appears in the projection.
+    finally:
+        await provider.aclose()
+
+
+async def test_complete_includes_cached_tokens_in_exclude_unset_dump_when_wire_reported() -> None:
+    # Companion to the no-wire-report case above: when the wire reports
+    # prompt_tokens_details.cached_tokens, the field IS marked set and
+    # appears in model_dump(exclude_unset=True). Locks down the
+    # bidirectional projection for downstream consumers.
+    transport = _make_openai_response_with_usage(
+        {
+            "prompt_tokens": 100,
+            "completion_tokens": 20,
+            "total_tokens": 120,
+            "prompt_tokens_details": {"cached_tokens": 75},
+        }
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        response = await provider.complete([UserMessage(content="hi")])
+        dumped = response.usage.model_dump(exclude_unset=True)
+        assert dumped.get("cached_tokens") == 75
+    finally:
+        await provider.aclose()
+
+
+async def test_complete_cached_tokens_absent_when_prompt_tokens_details_not_a_dict() -> None:
+    # Defensive against malformed wire responses: if prompt_tokens_details
+    # is a non-dict scalar / string / list, the isinstance guard in the
+    # parser treats it as absent rather than crashing. cached_tokens
+    # stays None.
+    transport = _make_openai_response_with_usage(
+        {
+            "prompt_tokens": 100,
+            "completion_tokens": 20,
+            "total_tokens": 120,
+            "prompt_tokens_details": "unexpected_shape",
+        }
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        response = await provider.complete([UserMessage(content="hi")])
+        assert response.usage.cached_tokens is None
+    finally:
+        await provider.aclose()
+
+
+async def test_complete_negative_cached_tokens_surfaces_as_invalid_response() -> None:
+    # Same invariant the existing test pins for prompt_tokens — a
+    # wire response carrying a negative cache count MUST surface as
+    # ``provider_invalid_response`` rather than silently passing through.
+    transport = _make_openai_response_with_usage(
+        {
+            "prompt_tokens": 100,
+            "completion_tokens": 20,
+            "total_tokens": 120,
+            "prompt_tokens_details": {"cached_tokens": -1},
+        }
+    )
+    provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
+    try:
+        with pytest.raises(ProviderInvalidResponse, match="invalid usage record"):
+            await provider.complete([UserMessage(content="hi")])
+    finally:
+        await provider.aclose()
+
+
 # RuntimeConfig.from_partial — Python ergonomic introduced alongside
 # proposal 0032. Wire-layer null-skip already drops Nones; this just
 # lets callers splat a partial dict without filtering at the call site.