feat(usage): surface prefix-cache hits in OpenAI + Anthropic usage blocks#478
Conversation
…ocks The engine already tracks per-request prefix-cache hits in ``Request.cached_tokens`` (set by every prefix-cache lookup path in ``scheduler.py``, visible in the server log line "… N cached"), but the OpenAI-compatible response only carries ``prompt_tokens`` / ``completion_tokens`` / ``total_tokens`` — cost-tracking clients that read OpenAI's ``prompt_tokens_details.cached_tokens`` field silently get nothing. The Anthropic adapter already declares ``cache_read_input_tokens`` and ``cache_creation_input_tokens`` on ``AnthropicUsage`` but never populates them (dead wire). Same gap is filed against LM Studio (lmstudio-ai/lmstudio-bug-tracker#778) — ecosystem pattern. This change wires the existing count through: Request.cached_tokens (scheduler) → RequestOutput.cached_tokens (new field, end of dataclass) → GenerationOutput.cached_tokens (new field, end of dataclass per v0.6.65 invariant) → Usage.prompt_tokens_details.cached_tokens (new field) OpenAI side: emitted only when the engine reported a hit (matches OpenAI's own omit-below-threshold behavior). Anthropic side: ``cache_read_input_tokens`` = the hit count; ``input_tokens`` = ``prompt_tokens - cached_tokens`` so Anthropic's wire-spec identity (``total_input = input + cache_read + cache_creation``) holds. ``cache_creation_input_tokens`` is intentionally unset — Anthropic uses it for tokens written between explicit ``cache_control`` breakpoints (billed 1.25x), which has no analog on a local KV-cache engine with no billing dimension. Propagation also threaded through ``output_collector._merge_outputs`` (producer-ahead aggregation) and ``EngineCore._merge_stream_buffer`` (extracted from inline code for testability; ``stream_interval > 1`` buffer-rebuild path) — both rebuild ``RequestOutput`` positionally and silently drop new fields otherwise. MLLM and dflash paths leave the field at 0 (don't run through the prefix-cache scheduler) — wired and ready when those grow reporting. Tests: 17 new test methods + 1 updated, covering helper builders, both Anthropic surfaces (non-streaming adapter + streaming ``message_delta``), both ``RequestOutput`` rebuild paths, all three OpenAI route surfaces (chat non-streaming, chat streaming with and without ``include_usage``, completions non-streaming), and the scheduler's ``_process_batch_responses`` source line. Every newly-pinned production line mutation-verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
raullenchai
left a comment
There was a problem hiding this comment.
Review — pr_validate (full pipeline) + manual structural audit
Verdict: APPROVE w/ NITs — all 8 technical gates pass. The only pr_validate failure is the unchecked test-plan checkbox (author admin, see below).
Thanks @MetricorTechnologies — this is one of the most thoroughly executed external PRs we've received. The mutation-tested propagation paths, the deliberate handling of the v0.6.65 positional-bind invariant (codex round-1 of v0.6.66 hit that exact regression), and the Anthropic-spec identity (total_input = input + cache_read + cache_creation) are all correct. The disclosure of Claude assistance + your three-round self-review is appreciated and exactly the right shape.
pr_validate scorecard
| step | result | notes |
|---|---|---|
fetch |
✅ PASS | 933+/40-, 17 files, blast=high |
test_plan_check |
❌ FAIL | 1/5 unchecked: "Self-validated with pr_validate" |
cl_description_quality |
✅ PASS | title + body have rationale (Google eng-practices) |
deepseek_review |
✅ PASS | no blocking findings, 2 NITs (291.3s) |
supply_chain |
✅ PASS | no hooks, no suspicious patterns, deps clean |
lint |
✅ PASS | ruff check + format clean (16 files) |
targeted_tests |
✅ PASS | 510 passed, 1 skipped (25.5s) |
full_unit |
✅ PASS | 4030 passed, 19 skipped, 1 xfailed (53.9s) |
stress_e2e_bench |
✅ PASS | 3×3 matrix — Qwen3.5-35B / Qwen3.6-27B / gpt-oss-20b, full stress + integration + bench (592.7s) |
Correctness checks I ran manually
Read the diff end-to-end against the gotcha trail from prior usage-shape PRs. The high-risk propagation paths all looked correct:
- ✅
GenerationOutput+RequestOutputnew field appended at end of dataclass (v0.6.65 positional-bind invariant, enforced byTestGenerationOutputFieldOrder) - ✅
_make_routed_output+_routed_finish_sentinelthreadcached_tokens=source.cached_tokens— OutputRouter models (Gemma 4 / harmony / gpt-oss) won't silently drop to 0, the exact failure shape of #456 and #454 - ✅
_merge_outputs+_merge_stream_bufferboth threaded — the twoRequestOutputrebuild paths that the description correctly calls out as silent-drop hazards - ✅ Anthropic
input_tokens = max(0, prompt - cached)+cache_read_input_tokens = cached or Nonerespects the wire-spec identity - ✅
cache_creation_input_tokensintentionally unset — correct rationale (Anthropic'scache_controlbreakpoints have no local-engine analog) - ✅
_build_usage/get_usageusegetattr(output, "cached_tokens", 0)for the streaming_UsageOutputnamespace fallback
NITs (4 total — none blocking)
[NIT] 1 — stream_chat_completion_guided reads cached_tokens once, not per chunk (DeepSeek V4 Pro)
vllm_mlx/routes/chat.py around line +1371: cached_tokens = getattr(output, "cached_tokens", 0) or 0 is set once before the streaming loop, whereas the main stream_chat_completion path re-reads it inside the loop with the hasattr guard. If a future engine variant doesn't carry the field on the first chunk, the hit gets lost on the guided path only. Mirror the main path's per-chunk re-read, or pin the "first chunk always carries the value" invariant explicitly.
[NIT] 2 — Silent wire-format change on streaming completions (DeepSeek V4 Pro)
vllm_mlx/routes/completions.py:187: model_dump() → model_dump(exclude_none=True) drops null completion_tokens_details / prompt_tokens_details from the SSE stream. It's the right cleanup (matches what chat.py already does for the trailing usage chunk), but it's a wire change not called out in the PR description. Please add a regression test pinning "streaming completion usage omits null detail fields" so the shape is locked.
[NIT] 3 — OutputRouter streaming path lacks a per-test cache_tokens assertion (manual)
Your _CacheReportingChatEngine mock yields channel=None, so the chat streaming tests in test_routes_cached_tokens.py don't actually flow through _stream_with_output_router. The production patch in engine/batched.py:_make_routed_output + _routed_finish_sentinel is correct, but if a future refactor breaks routed-output propagation, the helper tests stay green. Add one test that drives a fake router producing channel-routed outputs and asserts cached_tokens survives the routed flush — similar shape to the TestStreamingPostProcessorChannelRouted pattern from PR #454.
[NIT] 4 — Test-plan checkbox still unchecked (this is the one pr_validate failure)
You wrote "will run after the PR number is assigned" — please re-run python3.12 -m scripts.pr_validate.pr_validate 478 locally and tick the box in the PR body. This is gated by test_plan_check because PR #435 / #427 landed an incomplete fix from leaving an [ ] E2E verification item unchecked at merge.
Supply-chain audit (PR Merge SOP §2.5)
Yellow flag noted but cleared:
- Account age:
MetricorTechnologiescreated 2026-02-12 (3 months), 1 public repo. New-ish account, flagged. But the diff itself is clean: - ✅ No new dependencies
- ✅ No
.github/workflowschanges - ✅ No install hooks, post-install scripts, or setup.py shims
- ✅ No new network calls (
requests,urllib, etc.) - ✅ AI-assistance disclosure honest and follows project pattern
Recommendation
Approve and merge after:
- Tick the
pr_validatecheckbox in the PR body (NIT 4 — flipstest_plan_checkto PASS, thenpr_validateverdict goes to MERGE). - (Optional but appreciated) Address NIT 1 + 3 in a follow-up commit — both are small and would tighten the test floor against future regressions.
Once the checkbox is ticked I can land this.
— Reviewed via pr_validate (full pipeline, 16 min wall-time) + manual diff audit against the project gotcha trail. See PR Merge SOP.
NIT 1 — Pin "single buffered output" intent in ``stream_chat_completion_guided``: Add a comment explaining why ``cached_tokens`` (and the other counts) are read once rather than per-chunk like in the main ``stream_chat_completion`` path — outlines integration produces a single buffered ``GenerationOutput`` so there's no stream loop to re-read inside. Makes the asymmetry intentional rather than subtle. NIT 2 — Regression-test the streaming completions wire shape: ``routes/completions.py:190`` already changed ``model_dump()`` → ``model_dump(exclude_none=True)`` to drop the ``null`` ``prompt_tokens_details`` / ``completion_tokens_details`` keys (matching what ``routes/chat.py`` does for the trailing usage chunk). Add two tests in ``test_routes_cached_tokens.py``: - ``test_completions_streaming_final_chunk_omits_null_detail_fields`` — asserts the JSON payload has no ``null`` detail keys on the no-cache path. - ``test_completions_streaming_final_chunk_includes_cached_tokens_when_hit`` — sibling: ``exclude_none=True`` only drops nulls, not populated fields. Both mutation-verified: reverting ``exclude_none=True`` makes the first test fail with the expected null-keys assertion. NIT 3 — Test OutputRouter routed-flush propagation of cached_tokens: ``_make_routed_output`` + ``_routed_finish_sentinel`` already thread ``source.cached_tokens`` through, but the existing route-level mocks yield ``channel=None`` and never flow through ``_stream_with_output_router``. Add ``test_stream_chat_routed_outputs_preserve_cached_tokens`` in ``test_batched_engine_output_router.py`` (harmony-vocab stream, ``cached_tokens=128`` on every source chunk, assert every emitted routed output carries the value). Mutation-verified: removing ``cached_tokens=source.cached_tokens`` from ``_make_routed_output`` makes the assertion fail with ``[0, 0, 0]`` vs ``[128, 128, 128]``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three test files keep hand-rolled stubs that mimic the ``GenerationOutput`` shape (``_StubOutput`` in ``test_prefix_boundary_path_parity.py``, ``_FakeOutput`` in ``test_anthropic_stream_reasoning.py``, ``MockDeltaOutput`` in ``test_anthropic_streaming_reasoning.py``). The new ``cached_tokens`` field on ``GenerationOutput`` is now read directly (not via ``getattr``) on the LLM non-stream + stream construction sites in ``engine/batched.py``, so the stubs must declare the field too — otherwise ``BatchedEngine.generate``/``stream_chat`` raise ``AttributeError`` when fed a stub output. ``test_prefix_boundary_path_parity.py`` was the surfacing site (4 failures, all caused by stubbed ``_StubOutput`` lacking ``cached_tokens``). The two anthropic stubs don't currently traverse the failing path but mimic the same conceptual type — adding the field to all three keeps the stub family consistent and prevents a future refactor from silently reintroducing the same gap. Backfill is a zero default; tests that care about the value pass it explicitly via the ``cached_tokens`` keyword in ``MockDeltaOutput``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thank you for the thorough review @raullenchai — much appreciated, and the gotcha-trail cross-reference (PR #454 / #456 silent-drop shapes, v0.6.65 positional-bind regression) was exactly the framing I needed. All four NITs addressed in two follow-up commits ( NIT 1 — guided streaming per-chunk re-read NIT 2 — streaming completions null-detail wire shape
NIT 3 — OutputRouter routed-flush cached_tokens propagation NIT 4 — Bonus — silent regression caught while running Let me know if NIT 1's comment-instead-of-restructure satisfies the concern, or if you'd prefer the loop-shaped version. Otherwise I think this is ready. |
What does this PR do?
Surfaces the engine's existing prefix-cache hit count
(
Request.cached_tokens) through the OpenAI and Anthropic responseusage blocks:
OpenAI (
/v1/chat/completions,/v1/completions): addsusage.prompt_tokens_details.cached_tokens(the OpenAI-spec fieldfor cached input tokens). Emitted only when the engine reported a
hit — absent below the threshold, matching OpenAI's own behavior
below their 1024-token cache boundary.
Anthropic (
/v1/messages, streaming + non-streaming): populatesthe already-declared-but-dead
cache_read_input_tokenswith thehit count, and adjusts
input_tokenstoprompt_tokens - cached_tokensso Anthropic's spec identity(
total_input = input + cache_read + cache_creation) holds.cache_creation_input_tokensis intentionally left unset:Anthropic uses it for tokens written between explicit
cache_controlbreakpoints (billed 1.25x base), which has noanalog on a local KV-cache engine with no billing dimension.
Cache field stays
Nonewhen the engine doesn't report.Internally:
Request.cached_tokens→ newRequestOutput.cached_tokens→ new
GenerationOutput.cached_tokens(appended at end per the v0.6.65positional-invariant comment) →
Usage.prompt_tokens_details. Thefield is also propagated through
output_collector._merge_outputs(producer-ahead aggregation under default
aggregate=True) andengine_core._stream_buffersrebuild (whenstream_interval > 1) —both rebuild
RequestOutputpositionally and would silently zero outany new field that isn't threaded explicitly; a regression test in
tests/test_server.py::TestRequestOutputCollectorThreadSafety::test_merge_outputs_preserves_cached_tokenspins this.
17 files, +933 / -40 (12 production files + 5 test files, including
one small helper extraction in
engine_core.pyfor testability and6 new regression-test classes / 17 new test methods covering every
layer end-to-end). No new dependencies. MLLM and dflash paths leave
the field at 0 (they don't run through the prefix-cache scheduler) —
the field is wired and ready when those backends grow reporting.
Why is this needed?
Rapid-MLX advertises 0.08s cached TTFT on the README. The engine
genuinely tracks the per-request hit count —
Request.cached_tokensis set by every prefix-cache lookup path in
scheduler.pyand showsup in the server log line (
"… N cached"). But that count neverreaches the OpenAI-compatible response: clients only see
prompt_tokens/completion_tokens/total_tokens.Three concrete consequences this PR fixes:
Clients that read OpenAI's documented
prompt_tokens_details.cached_tokensfield for costattribution (the field is in the OpenAI cookbook example for
prompt-cache savings tracking) silently get nothing back. The
Metricor project I work on currently runs a tokenizer-side LCP
estimator (a
PrefixCacheProbeclass) as a stopgap because ofexactly this gap; its docstring names this PR as the deletion
trigger. The same gap is filed against LM Studio
(lmstudio-ai/lmstudio-bug-tracker#778)
— it's an ecosystem pattern.
The Anthropic adapter already declares
cache_creation_input_tokensandcache_read_input_tokensonAnthropicUsage(api/anthropic_models.py:78-79),so the response shape promises Anthropic-style cache reporting,
but the adapter never populates them — dead wire today. Anthropic
SDK clients pointed at the
/v1/messagesroute getNones andhave to guess.
Observability dashboards that aggregate cache hit ratio
across local + cloud LLM traffic (a common multi-provider
pattern) have a blind spot on the Rapid-MLX share.
AI assistance disclosure
Claude (Anthropic) wrote the implementation and tests end-to-end.
I (human) drove the scope decisions, reviewed every change before
opening the PR, and verified the test results locally on macOS
(M3 Ultra, Python 3.14,
.venvinstall of the patched tree).Verification I performed:
Request.cached_tokensflow through the scheduler.
ruff check+ruff format --checkon all 17 touched files (clean).changes and re-running a sample on
main(same failures, identicalerrors).
one Anthropic-spec-identity bug (`total_input = input + cache_read
), two silent field-drop paths inoutput_collector._merge_outputsandengine_core._stream_buffers, and three OpenAI-route coverage gaps (scheduler.py:_process_batch_responses,routes/chat.pystreaming usage chunk,routes/completions.py`accumulator) — all now have mutation-verified tests.
production line each test claims to cover, confirmed the test
fails, restored the line. All five teeth-checked.
Test plan
pytest tests/test_routes_cached_tokens.py tests/test_server_utils.py tests/test_anthropic_adapter.py tests/test_api_models.py tests/test_server.py::TestRequestOutputCollectorThreadSafety tests/test_server.py::TestEngineCoreStreamBufferMerge tests/test_anthropic_route_streaming.py tests/test_prefix_cache.py tests/test_request.py tests/test_chat_streaming_spec.py tests/test_chat_streaming_guided.py tests/test_postprocessor.py tests/test_anthropic_route_auth.py tests/test_anthropic_models.py→ 454 passed (17 new tests covering the new field surface end-to-end: helper-layer builders, both Anthropic paths, bothRequestOutputrebuild paths, all three OpenAI route surfaces, and the scheduler'sRequestOutputsource line)ruff check+ruff format --checkon touched files → cleanmainwith my diff stashed — same 4 fail (pre-existing: missingpytest-asyncio,aiohttp, and live-model requirements).Checklist
ruff check && ruff format --checkon touched files)pr_validatelocally (see PR comment for full scorecard — all substantive gates pass; the 10full_unitfailures are pre-existing environment gaps unrelated to this diff, verified by re-running them onmainwith the same result)prompt_tokens_detailsfield onUsage; newcached_tokensattribute appended at the end of bothGenerationOutput(per the v0.6.65 positional-invariant comment inengine/base.py, enforced bytests/test_server_utils.py::TestGenerationOutputFieldOrder) andRequestOutput(no existing invariant, but kept at end for symmetry — all production and test call sites use kwargs).