Skip to content

Commit dd804e2

Browse files
Add cache-stat fields to Response.usage (#136)
* Add cache-stat fields to Response.usage Extend the Usage model with two optional non-negative integer fields: cached_tokens (count of input tokens that hit a prefix cache) and cache_creation_tokens (count of input tokens written to the cache). Both default to None to preserve the absent-vs-reported-zero distinction the spec mandates. The OpenAI adapter sources cached_tokens from the nested response path usage.prompt_tokens_details.cached_tokens; vLLM and other OpenAI-compatible servers that surface implicit-cache stats follow the same nesting. cache_creation_tokens stays None for this mapping per the OpenAI-compat wire surface. Malformed values surface as ProviderInvalidResponse via the existing Pydantic validation path. Foundation for OTel cache-attribute emission and the typed LLM completion event work landing in subsequent PRs of the v0.13.0 LLM hardening cycle. * Subset-compare usage assertions in conformance harness Conformance fixtures pre-dating proposal 0047 assert the three base token-count fields only and don't mention the new cache-stat fields. Adding cached_tokens and cache_creation_tokens to Usage made the exact-equality comparison fail against any fixture whose expected usage shape doesn't include the cache fields. Treat the fixture's expected usage as the floor: only compare keys the fixture explicitly asserts about. A spec-required field absent from actual still fails the comparison (the filtered dict won't contain it). Impl-extension fields that the fixture is silent about no longer trip exact-equality, so additive Usage extensions don't break pre-existing fixtures. * Set cached_tokens conditionally for cleaner Pydantic semantics Address PR review feedback: passing cached_tokens=None unconditionally marks the Pydantic field as "set", so model_dump(exclude_unset=True) includes it as None — undistinguishable from a wire response that explicitly carried a null cached_tokens. Switch the if-branch to a kwargs-dict pattern that only sets cached_tokens when prompt_tokens_details is a dict AND the nested cached_tokens key is present. Attribute access (usage.cached_tokens) still returns None when the wire didn't report; exclude_unset dumps now reflect the wire shape exactly. The else-branch's explicit cached_tokens=None (added in the previous commit for if/else symmetry) is dropped — symmetry now sits at "reflect wire reality", not "pass identical args". Two regression tests pin the bidirectional projection: exclude_unset omits cached_tokens when the wire didn't report it; includes it with the int value when it did. * Guard subset filter against non-mapping expected_usage Address PR review feedback: the conformance fixture's typed model allows usage: null. When that happens, expected_usage is None and the subset filter's k in expected_usage raises TypeError instead of producing a clean assertion failure. Wrap the subset filter in an isinstance(dict) guard. The mapping path keeps the same subset semantics; the non-mapping path falls back to direct comparison so the assert fires with a clear shape mismatch.
1 parent 20a078a commit dd804e2

4 files changed

Lines changed: 295 additions & 7 deletions

File tree

src/openarmature/llm/providers/openai.py

Lines changed: 39 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -646,13 +646,47 @@ def _parse_response(
646646
try:
647647
if isinstance(usage_wire_raw, dict):
648648
usage_wire = cast("dict[str, Any]", usage_wire_raw)
649-
usage = Usage(
650-
prompt_tokens=usage_wire.get("prompt_tokens"),
651-
completion_tokens=usage_wire.get("completion_tokens"),
652-
total_tokens=usage_wire.get("total_tokens"),
649+
# cached_tokens sources from
650+
# usage.prompt_tokens_details.cached_tokens per spec
651+
# §8.1.2; vLLM and other OpenAI-compatible servers that
652+
# surface implicit-cache stats follow the same nesting.
653+
# Defaults to None when prompt_tokens_details is absent
654+
# or when the nested cached_tokens key is missing
655+
# (preserves the absent-vs-reported-zero distinction).
656+
# cache_creation_tokens stays None — OpenAI-compatible
657+
# providers do not report a discrete cache-creation
658+
# count under this mapping.
659+
prompt_tokens_details_raw = usage_wire.get("prompt_tokens_details")
660+
prompt_tokens_details: dict[str, Any] | None = (
661+
cast("dict[str, Any]", prompt_tokens_details_raw)
662+
if isinstance(prompt_tokens_details_raw, dict)
663+
else None
653664
)
665+
# Conditional set: only pass cached_tokens to Usage(...)
666+
# when the wire actually reports the nested key. Pydantic
667+
# tracks the field as "set" only when explicitly passed;
668+
# downstream consumers using model_dump(exclude_unset=True)
669+
# then get a clean wire-shape projection (cached_tokens
670+
# omitted entirely when the provider didn't report it).
671+
# Attribute access (usage.cached_tokens) still returns
672+
# None when absent per the spec's absent-vs-reported
673+
# distinction. Malformed values surface as
674+
# ProviderInvalidResponse via the same Pydantic validation
675+
# path the other token-count fields take.
676+
usage_kwargs: dict[str, Any] = {
677+
"prompt_tokens": usage_wire.get("prompt_tokens"),
678+
"completion_tokens": usage_wire.get("completion_tokens"),
679+
"total_tokens": usage_wire.get("total_tokens"),
680+
}
681+
if prompt_tokens_details is not None and "cached_tokens" in prompt_tokens_details:
682+
usage_kwargs["cached_tokens"] = prompt_tokens_details["cached_tokens"]
683+
usage = Usage(**usage_kwargs)
654684
else:
655-
usage = Usage(prompt_tokens=None, completion_tokens=None, total_tokens=None)
685+
usage = Usage(
686+
prompt_tokens=None,
687+
completion_tokens=None,
688+
total_tokens=None,
689+
)
656690
except ValidationError as exc:
657691
raise ProviderInvalidResponse(f"invalid usage record: {exc}") from exc
658692

src/openarmature/llm/response.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,18 +41,38 @@
4141
FinishReason = Literal["stop", "length", "tool_calls", "content_filter", "error"]
4242

4343

44+
# Cache-stat fields (cached_tokens / cache_creation_tokens) are
45+
# optional and default to None. The absent-vs-reported-zero distinction
46+
# is observable: None means the provider did not report the field; 0
47+
# means the provider reported the field with value zero (a "reported
48+
# miss"). Each per-provider wire-format mapping documents which fields
49+
# it sources.
4450
class Usage(BaseModel):
4551
"""Token-accounting record.
4652
4753
Each field is a non-negative integer or ``None``. If the provider
48-
does not report usage, all three MUST be ``None``.
54+
does not report token counts, ``prompt_tokens`` / ``completion_tokens``
55+
/ ``total_tokens`` MUST be ``None``.
4956
"""
5057

5158
model_config = ConfigDict(extra="forbid")
5259

5360
prompt_tokens: int | None = Field(ge=0)
5461
completion_tokens: int | None = Field(ge=0)
5562
total_tokens: int | None = Field(ge=0)
63+
# The count of input tokens that hit a prefix cache, sourced from
64+
# the provider's response. Absent (None) when the provider does
65+
# not report cache statistics; set to 0 when the provider reports
66+
# zero cache-hit tokens. Each wire-format mapping documents the
67+
# provider response field this value is sourced from.
68+
cached_tokens: int | None = Field(default=None, ge=0)
69+
# The count of input tokens written to the cache during the call.
70+
# Populated primarily by providers with explicit cache-control
71+
# surfaces that report a discrete cache-creation count alongside
72+
# cache reads. Absent (None) for providers that only report
73+
# implicit cache reads (the §8.1 OpenAI-compat mapping leaves this
74+
# field absent).
75+
cache_creation_tokens: int | None = Field(default=None, ge=0)
5676

5777

5878
class Response(BaseModel):

tests/conformance/test_llm_provider.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -430,7 +430,24 @@ def _assert_response_matches(actual: Response, expected: Mapping[str, Any]) -> N
430430
assert actual.finish_reason == expected["finish_reason"]
431431
if "usage" in expected:
432432
expected_usage = expected["usage"]
433-
actual_usage = actual.usage.model_dump()
433+
actual_usage_full = actual.usage.model_dump()
434+
# Subset comparison when the fixture asserts about specific
435+
# usage fields: spec fixtures pin which fields MUST be present
436+
# with what values. Impl-extension fields outside the fixture's
437+
# expected set (e.g., the 0047 cache-stat fields on impls that
438+
# have adopted them but against fixtures that pre-date the
439+
# proposal) are ignored when the fixture doesn't assert about
440+
# them. A fixture key that's absent from actual surfaces as a
441+
# missing key in the filtered dict and fails the comparison;
442+
# the impl can't silently drop a field the spec requires.
443+
#
444+
# Non-mapping expected_usage (e.g., a fixture sets usage: null)
445+
# falls back to direct comparison so the assertion fires with a
446+
# clean shape mismatch rather than crashing on the subset filter.
447+
if isinstance(expected_usage, dict):
448+
actual_usage = {k: v for k, v in actual_usage_full.items() if k in expected_usage}
449+
else:
450+
actual_usage = actual_usage_full
434451
assert actual_usage == expected_usage, (
435452
f"usage mismatch: actual={actual_usage}, expected={expected_usage}"
436453
)

tests/unit/test_llm_provider.py

Lines changed: 217 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -534,6 +534,223 @@ def _bad(_req: httpx.Request) -> httpx.Response:
534534
await provider.aclose()
535535

536536

537+
# ---------------------------------------------------------------------------
538+
# Usage cache-stat fields (proposal 0047 — llm-provider §6 extension)
539+
# ---------------------------------------------------------------------------
540+
541+
542+
def test_usage_cache_fields_default_to_none() -> None:
543+
# Backwards-compat: existing Usage constructions that don't pass
544+
# cache fields produce instances with cached_tokens = None and
545+
# cache_creation_tokens = None (the "not reported" state, distinct
546+
# from a "reported zero" value of 0).
547+
usage = Usage(prompt_tokens=1, completion_tokens=2, total_tokens=3)
548+
assert usage.cached_tokens is None
549+
assert usage.cache_creation_tokens is None
550+
551+
552+
def test_usage_negative_cached_tokens_rejected_at_construction() -> None:
553+
with pytest.raises(ValidationError):
554+
Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0, cached_tokens=-1)
555+
556+
557+
def test_usage_negative_cache_creation_tokens_rejected_at_construction() -> None:
558+
with pytest.raises(ValidationError):
559+
Usage(prompt_tokens=0, completion_tokens=0, total_tokens=0, cache_creation_tokens=-1)
560+
561+
562+
def _make_openai_response_with_usage(usage_body: dict[str, object]) -> httpx.MockTransport:
563+
"""Build a MockTransport returning a minimal Chat Completions
564+
response with the given ``usage`` body. Helper for the cache-stat
565+
end-to-end tests below.
566+
"""
567+
568+
def _handler(_req: httpx.Request) -> httpx.Response:
569+
return httpx.Response(
570+
200,
571+
json={
572+
"id": "x",
573+
"object": "chat.completion",
574+
"created": 0,
575+
"model": "m",
576+
"choices": [
577+
{
578+
"index": 0,
579+
"message": {"role": "assistant", "content": "ok"},
580+
"finish_reason": "stop",
581+
}
582+
],
583+
"usage": usage_body,
584+
},
585+
)
586+
587+
return httpx.MockTransport(_handler)
588+
589+
590+
async def test_complete_sources_cached_tokens_from_nested_prompt_tokens_details() -> None:
591+
# Cache-hit reported with a positive value. Spec §8.1.2: the
592+
# OpenAI-compat mapping sources cached_tokens from
593+
# usage.prompt_tokens_details.cached_tokens.
594+
transport = _make_openai_response_with_usage(
595+
{
596+
"prompt_tokens": 100,
597+
"completion_tokens": 20,
598+
"total_tokens": 120,
599+
"prompt_tokens_details": {"cached_tokens": 75},
600+
}
601+
)
602+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
603+
try:
604+
response = await provider.complete([UserMessage(content="hi")])
605+
assert response.usage.cached_tokens == 75
606+
# cache_creation_tokens is not sourced by the OpenAI-compat
607+
# mapping per spec §8.1.2.
608+
assert response.usage.cache_creation_tokens is None
609+
finally:
610+
await provider.aclose()
611+
612+
613+
async def test_complete_reports_zero_cached_tokens_distinct_from_absent() -> None:
614+
# The spec mandates the absent-vs-reported-zero distinction:
615+
# absent (None) means the provider didn't report; 0 means the
616+
# provider reported zero hits. Locks down the distinction.
617+
transport = _make_openai_response_with_usage(
618+
{
619+
"prompt_tokens": 100,
620+
"completion_tokens": 20,
621+
"total_tokens": 120,
622+
"prompt_tokens_details": {"cached_tokens": 0},
623+
}
624+
)
625+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
626+
try:
627+
response = await provider.complete([UserMessage(content="hi")])
628+
assert response.usage.cached_tokens == 0
629+
finally:
630+
await provider.aclose()
631+
632+
633+
async def test_complete_cached_tokens_absent_when_prompt_tokens_details_missing() -> None:
634+
# Common pre-cache path: vLLM without --enable-prompt-tokens-details,
635+
# OpenAI responses pre-cache-support, etc. No prompt_tokens_details
636+
# nesting at all → cached_tokens stays None.
637+
transport = _make_openai_response_with_usage(
638+
{"prompt_tokens": 100, "completion_tokens": 20, "total_tokens": 120}
639+
)
640+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
641+
try:
642+
response = await provider.complete([UserMessage(content="hi")])
643+
assert response.usage.cached_tokens is None
644+
finally:
645+
await provider.aclose()
646+
647+
648+
async def test_complete_cached_tokens_absent_when_nested_key_missing() -> None:
649+
# Defensive: prompt_tokens_details dict exists (provider may report
650+
# other details there, e.g., audio_tokens) but cached_tokens is
651+
# absent within it. Sourcing path stays defensive — no KeyError,
652+
# cached_tokens stays None.
653+
transport = _make_openai_response_with_usage(
654+
{
655+
"prompt_tokens": 100,
656+
"completion_tokens": 20,
657+
"total_tokens": 120,
658+
"prompt_tokens_details": {"audio_tokens": 0},
659+
}
660+
)
661+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
662+
try:
663+
response = await provider.complete([UserMessage(content="hi")])
664+
assert response.usage.cached_tokens is None
665+
finally:
666+
await provider.aclose()
667+
668+
669+
async def test_complete_excludes_unset_cached_tokens_when_wire_did_not_report() -> None:
670+
# When the wire response doesn't carry prompt_tokens_details (or
671+
# carries it without a cached_tokens key), the parser leaves the
672+
# Pydantic field unset. model_dump(exclude_unset=True) then omits
673+
# cached_tokens entirely, giving downstream consumers a clean
674+
# wire-shape projection. Attribute access still returns None per
675+
# the spec's absent-vs-reported distinction.
676+
transport = _make_openai_response_with_usage(
677+
{"prompt_tokens": 100, "completion_tokens": 20, "total_tokens": 120}
678+
)
679+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
680+
try:
681+
response = await provider.complete([UserMessage(content="hi")])
682+
assert response.usage.cached_tokens is None
683+
dumped = response.usage.model_dump(exclude_unset=True)
684+
assert "cached_tokens" not in dumped
685+
# Conversely, when the wire DID report (covered separately
686+
# above), the field IS set and appears in the projection.
687+
finally:
688+
await provider.aclose()
689+
690+
691+
async def test_complete_includes_cached_tokens_in_exclude_unset_dump_when_wire_reported() -> None:
692+
# Companion to the no-wire-report case above: when the wire reports
693+
# prompt_tokens_details.cached_tokens, the field IS marked set and
694+
# appears in model_dump(exclude_unset=True). Locks down the
695+
# bidirectional projection for downstream consumers.
696+
transport = _make_openai_response_with_usage(
697+
{
698+
"prompt_tokens": 100,
699+
"completion_tokens": 20,
700+
"total_tokens": 120,
701+
"prompt_tokens_details": {"cached_tokens": 75},
702+
}
703+
)
704+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
705+
try:
706+
response = await provider.complete([UserMessage(content="hi")])
707+
dumped = response.usage.model_dump(exclude_unset=True)
708+
assert dumped.get("cached_tokens") == 75
709+
finally:
710+
await provider.aclose()
711+
712+
713+
async def test_complete_cached_tokens_absent_when_prompt_tokens_details_not_a_dict() -> None:
714+
# Defensive against malformed wire responses: if prompt_tokens_details
715+
# is a non-dict scalar / string / list, the isinstance guard in the
716+
# parser treats it as absent rather than crashing. cached_tokens
717+
# stays None.
718+
transport = _make_openai_response_with_usage(
719+
{
720+
"prompt_tokens": 100,
721+
"completion_tokens": 20,
722+
"total_tokens": 120,
723+
"prompt_tokens_details": "unexpected_shape",
724+
}
725+
)
726+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
727+
try:
728+
response = await provider.complete([UserMessage(content="hi")])
729+
assert response.usage.cached_tokens is None
730+
finally:
731+
await provider.aclose()
732+
733+
734+
async def test_complete_negative_cached_tokens_surfaces_as_invalid_response() -> None:
735+
# Same invariant the existing test pins for prompt_tokens — a
736+
# wire response carrying a negative cache count MUST surface as
737+
# ``provider_invalid_response`` rather than silently passing through.
738+
transport = _make_openai_response_with_usage(
739+
{
740+
"prompt_tokens": 100,
741+
"completion_tokens": 20,
742+
"total_tokens": 120,
743+
"prompt_tokens_details": {"cached_tokens": -1},
744+
}
745+
)
746+
provider = OpenAIProvider(base_url="http://test", model="m", api_key="k", transport=transport)
747+
try:
748+
with pytest.raises(ProviderInvalidResponse, match="invalid usage record"):
749+
await provider.complete([UserMessage(content="hi")])
750+
finally:
751+
await provider.aclose()
752+
753+
537754
# RuntimeConfig.from_partial — Python ergonomic introduced alongside
538755
# proposal 0032. Wire-layer null-skip already drops Nones; this just
539756
# lets callers splat a partial dict without filtering at the call site.

0 commit comments

Comments
 (0)