feat(usage): surface prefix-cache hits in OpenAI + Anthropic usage blocks by MetricorTechnologies · Pull Request #478 · raullenchai/Rapid-MLX

MetricorTechnologies · 2026-05-25T23:12:31Z

What does this PR do?

Surfaces the engine's existing prefix-cache hit count
(Request.cached_tokens) through the OpenAI and Anthropic response
usage blocks:

OpenAI (/v1/chat/completions, /v1/completions): adds
usage.prompt_tokens_details.cached_tokens (the OpenAI-spec field
for cached input tokens). Emitted only when the engine reported a
hit — absent below the threshold, matching OpenAI's own behavior
below their 1024-token cache boundary.
Anthropic (/v1/messages, streaming + non-streaming): populates
the already-declared-but-dead cache_read_input_tokens with the
hit count, and adjusts input_tokens to prompt_tokens - cached_tokens so Anthropic's spec identity
(total_input = input + cache_read + cache_creation) holds.
cache_creation_input_tokens is intentionally left unset:
Anthropic uses it for tokens written between explicit
cache_control breakpoints (billed 1.25x base), which has no
analog on a local KV-cache engine with no billing dimension.
Cache field stays None when the engine doesn't report.

Internally: Request.cached_tokens → new RequestOutput.cached_tokens
→ new GenerationOutput.cached_tokens (appended at end per the v0.6.65
positional-invariant comment) → Usage.prompt_tokens_details. The
field is also propagated through output_collector._merge_outputs
(producer-ahead aggregation under default aggregate=True) and
engine_core._stream_buffers rebuild (when stream_interval > 1) —
both rebuild RequestOutput positionally and would silently zero out
any new field that isn't threaded explicitly; a regression test in
tests/test_server.py::TestRequestOutputCollectorThreadSafety::test_merge_outputs_preserves_cached_tokens
pins this.

17 files, +933 / -40 (12 production files + 5 test files, including
one small helper extraction in engine_core.py for testability and
6 new regression-test classes / 17 new test methods covering every
layer end-to-end). No new dependencies. MLLM and dflash paths leave
the field at 0 (they don't run through the prefix-cache scheduler) —
the field is wired and ready when those backends grow reporting.

Why is this needed?

Rapid-MLX advertises 0.08s cached TTFT on the README. The engine
genuinely tracks the per-request hit count — Request.cached_tokens
is set by every prefix-cache lookup path in scheduler.py and shows
up in the server log line ("… N cached"). But that count never
reaches the OpenAI-compatible response: clients only see
prompt_tokens / completion_tokens / total_tokens.

Three concrete consequences this PR fixes:

Clients that read OpenAI's documented
prompt_tokens_details.cached_tokens field for cost
attribution (the field is in the OpenAI cookbook example for
prompt-cache savings tracking) silently get nothing back. The
Metricor project I work on currently runs a tokenizer-side LCP
estimator (a PrefixCacheProbe class) as a stopgap because of
exactly this gap; its docstring names this PR as the deletion
trigger. The same gap is filed against LM Studio
(lmstudio-ai/lmstudio-bug-tracker#778)
— it's an ecosystem pattern.
The Anthropic adapter already declares
cache_creation_input_tokens and cache_read_input_tokens on
AnthropicUsage (api/anthropic_models.py:78-79),
so the response shape promises Anthropic-style cache reporting,
but the adapter never populates them — dead wire today. Anthropic
SDK clients pointed at the /v1/messages route get Nones and
have to guess.
Observability dashboards that aggregate cache hit ratio
across local + cloud LLM traffic (a common multi-provider
pattern) have a blind spot on the Rapid-MLX share.

AI assistance disclosure

Claude (Anthropic) wrote the implementation and tests end-to-end.
I (human) drove the scope decisions, reviewed every change before
opening the PR, and verified the test results locally on macOS
(M3 Ultra, Python 3.14, .venv install of the patched tree).

Verification I performed:

Read the diff in full, file by file, before greenlighting the PR.
Confirmed the patch surface against my reading of Request.cached_tokens
flow through the scheduler.
Ran ruff check + ruff format --check on all 17 touched files (clean).
Ran the targeted + adjacent test suites (454 passed across 14 files).
Confirmed full-unit-suite failures are pre-existing by stashing my
changes and re-running a sample on main (same failures, identical
errors).
Three independent code reviews on the diff. Findings, all fixed:
one Anthropic-spec-identity bug (`total_input = input + cache_read
- cache_creation), two silent field-drop paths in output_collector._merge_outputsandengine_core._stream_buffers, and three OpenAI-route coverage gaps (scheduler.py:_process_batch_responses, routes/chat.pystreaming usage chunk,routes/completions.py`
  accumulator) — all now have mutation-verified tests.
Mutation-tested every newly-added regression test: reverted the
production line each test claims to cover, confirmed the test
fails, restored the line. All five teeth-checked.

Test plan

pytest tests/test_routes_cached_tokens.py tests/test_server_utils.py tests/test_anthropic_adapter.py tests/test_api_models.py tests/test_server.py::TestRequestOutputCollectorThreadSafety tests/test_server.py::TestEngineCoreStreamBufferMerge tests/test_anthropic_route_streaming.py tests/test_prefix_cache.py tests/test_request.py tests/test_chat_streaming_spec.py tests/test_chat_streaming_guided.py tests/test_postprocessor.py tests/test_anthropic_route_auth.py tests/test_anthropic_models.py → 454 passed (17 new tests covering the new field surface end-to-end: helper-layer builders, both Anthropic paths, both RequestOutput rebuild paths, all three OpenAI route surfaces, and the scheduler's RequestOutput source line)
ruff check + ruff format --check on touched files → clean
Full unit suite: 3957 passed, 73 fail; sampled 4 failures re-run on main with my diff stashed — same 4 fail (pre-existing: missing pytest-asyncio, aiohttp, and live-model requirements).
Mutation tests on each of the 3 newly-pinned production lines flagged by review feat: Prompt cache for SimpleEngine + tool logits safety #3 as uncovered: each line reverted in isolation → corresponding test fails with the expected assertion → line restored. All 3 confirmed teeth.

Checklist

Tests pass locally (relevant subset; pre-existing failures unrelated)
Lint passes (ruff check && ruff format --check on touched files)
Self-validated with pr_validate locally (see PR comment for full scorecard — all substantive gates pass; the 10 full_unit failures are pre-existing environment gaps unrelated to this diff, verified by re-running them on main with the same result)
Spot-checked that new tests fail when the cache wiring is broken (manual diff revert + test re-run)
No breaking changes to existing API. Additive: new optional prompt_tokens_details field on Usage; new cached_tokens attribute appended at the end of both GenerationOutput (per the v0.6.65 positional-invariant comment in engine/base.py, enforced by tests/test_server_utils.py::TestGenerationOutputFieldOrder) and RequestOutput (no existing invariant, but kept at end for symmetry — all production and test call sites use kwargs).

…ocks The engine already tracks per-request prefix-cache hits in ``Request.cached_tokens`` (set by every prefix-cache lookup path in ``scheduler.py``, visible in the server log line "… N cached"), but the OpenAI-compatible response only carries ``prompt_tokens`` / ``completion_tokens`` / ``total_tokens`` — cost-tracking clients that read OpenAI's ``prompt_tokens_details.cached_tokens`` field silently get nothing. The Anthropic adapter already declares ``cache_read_input_tokens`` and ``cache_creation_input_tokens`` on ``AnthropicUsage`` but never populates them (dead wire). Same gap is filed against LM Studio (lmstudio-ai/lmstudio-bug-tracker#778) — ecosystem pattern. This change wires the existing count through: Request.cached_tokens (scheduler) → RequestOutput.cached_tokens (new field, end of dataclass) → GenerationOutput.cached_tokens (new field, end of dataclass per v0.6.65 invariant) → Usage.prompt_tokens_details.cached_tokens (new field) OpenAI side: emitted only when the engine reported a hit (matches OpenAI's own omit-below-threshold behavior). Anthropic side: ``cache_read_input_tokens`` = the hit count; ``input_tokens`` = ``prompt_tokens - cached_tokens`` so Anthropic's wire-spec identity (``total_input = input + cache_read + cache_creation``) holds. ``cache_creation_input_tokens`` is intentionally unset — Anthropic uses it for tokens written between explicit ``cache_control`` breakpoints (billed 1.25x), which has no analog on a local KV-cache engine with no billing dimension. Propagation also threaded through ``output_collector._merge_outputs`` (producer-ahead aggregation) and ``EngineCore._merge_stream_buffer`` (extracted from inline code for testability; ``stream_interval > 1`` buffer-rebuild path) — both rebuild ``RequestOutput`` positionally and silently drop new fields otherwise. MLLM and dflash paths leave the field at 0 (don't run through the prefix-cache scheduler) — wired and ready when those grow reporting. Tests: 17 new test methods + 1 updated, covering helper builders, both Anthropic surfaces (non-streaming adapter + streaming ``message_delta``), both ``RequestOutput`` rebuild paths, all three OpenAI route surfaces (chat non-streaming, chat streaming with and without ``include_usage``, completions non-streaming), and the scheduler's ``_process_batch_responses`` source line. Every newly-pinned production line mutation-verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

raullenchai

Review — `pr_validate` (full pipeline) + manual structural audit

Verdict: APPROVE w/ NITs — all 8 technical gates pass. The only pr_validate failure is the unchecked test-plan checkbox (author admin, see below).

Thanks @MetricorTechnologies — this is one of the most thoroughly executed external PRs we've received. The mutation-tested propagation paths, the deliberate handling of the v0.6.65 positional-bind invariant (codex round-1 of v0.6.66 hit that exact regression), and the Anthropic-spec identity (total_input = input + cache_read + cache_creation) are all correct. The disclosure of Claude assistance + your three-round self-review is appreciated and exactly the right shape.

`pr_validate` scorecard

step	result	notes
`fetch`	✅ PASS	933+/40-, 17 files, blast=high
`test_plan_check`	❌ FAIL	1/5 unchecked: "Self-validated with `pr_validate`"
`cl_description_quality`	✅ PASS	title + body have rationale (Google eng-practices)
`deepseek_review`	✅ PASS	no blocking findings, 2 NITs (291.3s)
`supply_chain`	✅ PASS	no hooks, no suspicious patterns, deps clean
`lint`	✅ PASS	ruff check + format clean (16 files)
`targeted_tests`	✅ PASS	510 passed, 1 skipped (25.5s)
`full_unit`	✅ PASS	4030 passed, 19 skipped, 1 xfailed (53.9s)
`stress_e2e_bench`	✅ PASS	3×3 matrix — Qwen3.5-35B / Qwen3.6-27B / gpt-oss-20b, full stress + integration + bench (592.7s)

Correctness checks I ran manually

Read the diff end-to-end against the gotcha trail from prior usage-shape PRs. The high-risk propagation paths all looked correct:

✅ GenerationOutput + RequestOutput new field appended at end of dataclass (v0.6.65 positional-bind invariant, enforced by TestGenerationOutputFieldOrder)
✅ _make_routed_output + _routed_finish_sentinel thread cached_tokens=source.cached_tokens — OutputRouter models (Gemma 4 / harmony / gpt-oss) won't silently drop to 0, the exact failure shape of #456 and #454
✅ _merge_outputs + _merge_stream_buffer both threaded — the two RequestOutput rebuild paths that the description correctly calls out as silent-drop hazards
✅ Anthropic input_tokens = max(0, prompt - cached) + cache_read_input_tokens = cached or None respects the wire-spec identity
✅ cache_creation_input_tokens intentionally unset — correct rationale (Anthropic's cache_control breakpoints have no local-engine analog)
✅ _build_usage / get_usage use getattr(output, "cached_tokens", 0) for the streaming _UsageOutput namespace fallback

NITs (4 total — none blocking)

[NIT] 1 — stream_chat_completion_guided reads cached_tokens once, not per chunk (DeepSeek V4 Pro)
vllm_mlx/routes/chat.py around line +1371: cached_tokens = getattr(output, "cached_tokens", 0) or 0 is set once before the streaming loop, whereas the main stream_chat_completion path re-reads it inside the loop with the hasattr guard. If a future engine variant doesn't carry the field on the first chunk, the hit gets lost on the guided path only. Mirror the main path's per-chunk re-read, or pin the "first chunk always carries the value" invariant explicitly.

[NIT] 2 — Silent wire-format change on streaming completions (DeepSeek V4 Pro)
vllm_mlx/routes/completions.py:187: model_dump() → model_dump(exclude_none=True) drops null completion_tokens_details / prompt_tokens_details from the SSE stream. It's the right cleanup (matches what chat.py already does for the trailing usage chunk), but it's a wire change not called out in the PR description. Please add a regression test pinning "streaming completion usage omits null detail fields" so the shape is locked.

[NIT] 3 — OutputRouter streaming path lacks a per-test cache_tokens assertion (manual)
Your _CacheReportingChatEngine mock yields channel=None, so the chat streaming tests in test_routes_cached_tokens.py don't actually flow through _stream_with_output_router. The production patch in engine/batched.py:_make_routed_output + _routed_finish_sentinel is correct, but if a future refactor breaks routed-output propagation, the helper tests stay green. Add one test that drives a fake router producing channel-routed outputs and asserts cached_tokens survives the routed flush — similar shape to the TestStreamingPostProcessorChannelRouted pattern from PR #454.

[NIT] 4 — Test-plan checkbox still unchecked (this is the one pr_validate failure)
You wrote "will run after the PR number is assigned" — please re-run python3.12 -m scripts.pr_validate.pr_validate 478 locally and tick the box in the PR body. This is gated by test_plan_check because PR #435 / #427 landed an incomplete fix from leaving an [ ] E2E verification item unchecked at merge.

Supply-chain audit (PR Merge SOP §2.5)

Yellow flag noted but cleared:

Account age: MetricorTechnologies created 2026-02-12 (3 months), 1 public repo. New-ish account, flagged. But the diff itself is clean:
✅ No new dependencies
✅ No .github/workflows changes
✅ No install hooks, post-install scripts, or setup.py shims
✅ No new network calls (requests, urllib, etc.)
✅ AI-assistance disclosure honest and follows project pattern

Recommendation

Approve and merge after:

Tick the pr_validate checkbox in the PR body (NIT 4 — flips test_plan_check to PASS, then pr_validate verdict goes to MERGE).
(Optional but appreciated) Address NIT 1 + 3 in a follow-up commit — both are small and would tighten the test floor against future regressions.

Once the checkbox is ticked I can land this.

— Reviewed via pr_validate (full pipeline, 16 min wall-time) + manual diff audit against the project gotcha trail. See PR Merge SOP.

NIT 1 — Pin "single buffered output" intent in ``stream_chat_completion_guided``: Add a comment explaining why ``cached_tokens`` (and the other counts) are read once rather than per-chunk like in the main ``stream_chat_completion`` path — outlines integration produces a single buffered ``GenerationOutput`` so there's no stream loop to re-read inside. Makes the asymmetry intentional rather than subtle. NIT 2 — Regression-test the streaming completions wire shape: ``routes/completions.py:190`` already changed ``model_dump()`` → ``model_dump(exclude_none=True)`` to drop the ``null`` ``prompt_tokens_details`` / ``completion_tokens_details`` keys (matching what ``routes/chat.py`` does for the trailing usage chunk). Add two tests in ``test_routes_cached_tokens.py``: - ``test_completions_streaming_final_chunk_omits_null_detail_fields`` — asserts the JSON payload has no ``null`` detail keys on the no-cache path. - ``test_completions_streaming_final_chunk_includes_cached_tokens_when_hit`` — sibling: ``exclude_none=True`` only drops nulls, not populated fields. Both mutation-verified: reverting ``exclude_none=True`` makes the first test fail with the expected null-keys assertion. NIT 3 — Test OutputRouter routed-flush propagation of cached_tokens: ``_make_routed_output`` + ``_routed_finish_sentinel`` already thread ``source.cached_tokens`` through, but the existing route-level mocks yield ``channel=None`` and never flow through ``_stream_with_output_router``. Add ``test_stream_chat_routed_outputs_preserve_cached_tokens`` in ``test_batched_engine_output_router.py`` (harmony-vocab stream, ``cached_tokens=128`` on every source chunk, assert every emitted routed output carries the value). Mutation-verified: removing ``cached_tokens=source.cached_tokens`` from ``_make_routed_output`` makes the assertion fail with ``[0, 0, 0]`` vs ``[128, 128, 128]``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three test files keep hand-rolled stubs that mimic the ``GenerationOutput`` shape (``_StubOutput`` in ``test_prefix_boundary_path_parity.py``, ``_FakeOutput`` in ``test_anthropic_stream_reasoning.py``, ``MockDeltaOutput`` in ``test_anthropic_streaming_reasoning.py``). The new ``cached_tokens`` field on ``GenerationOutput`` is now read directly (not via ``getattr``) on the LLM non-stream + stream construction sites in ``engine/batched.py``, so the stubs must declare the field too — otherwise ``BatchedEngine.generate``/``stream_chat`` raise ``AttributeError`` when fed a stub output. ``test_prefix_boundary_path_parity.py`` was the surfacing site (4 failures, all caused by stubbed ``_StubOutput`` lacking ``cached_tokens``). The two anthropic stubs don't currently traverse the failing path but mimic the same conceptual type — adding the field to all three keeps the stub family consistent and prevents a future refactor from silently reintroducing the same gap. Backfill is a zero default; tests that care about the value pass it explicitly via the ``cached_tokens`` keyword in ``MockDeltaOutput``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MetricorTechnologies · 2026-05-27T20:22:29Z

Thank you for the thorough review @raullenchai — much appreciated, and the gotcha-trail cross-reference (PR #454 / #456 silent-drop shapes, v0.6.65 positional-bind regression) was exactly the framing I needed.

All four NITs addressed in two follow-up commits (766a0be + 2a72b3c); test-plan checkbox now ticked. Detail per NIT:

NIT 1 — guided streaming per-chunk re-read
The guided path (stream_chat_completion_guided) reads from a single buffered GenerationOutput produced by engine.generate_with_schema (outlines integration has no native streaming interface), so there's no actual stream loop to re-read inside — the asymmetry vs stream_chat_completion is structural, not a defensive gap. Rather than restructure into a fake loop, I pinned the intent in a comment so the next reader doesn't trip over the same question. If you'd prefer the loop-shaped fix for symmetry's sake, happy to do that instead.

NIT 2 — streaming completions null-detail wire shape
Added two regression tests in tests/test_routes_cached_tokens.py:

test_completions_streaming_final_chunk_omits_null_detail_fields — asserts the JSON payload has no null prompt_tokens_details / completion_tokens_details keys on the no-cache path.
test_completions_streaming_final_chunk_includes_cached_tokens_when_hit — sibling: exclude_none=True only drops nulls.
Mutation-verified: reverting model_dump() → model_dump(exclude_none=True) makes the first test fail with the literal null-keys assertion.

NIT 3 — OutputRouter routed-flush cached_tokens propagation
Added test_stream_chat_routed_outputs_preserve_cached_tokens in tests/test_batched_engine_output_router.py, modeled on the harmony-vocab pattern of test_stream_chat_routes_supported_tokenizer_channels and test_router_propagates_per_token_logprobs_to_routed_outputs. Drives a 3-chunk routed flush with cached_tokens=128 on every source chunk and asserts every emitted routed GenerationOutput (reasoning + content + finish sentinel) carries the value. Mutation-verified: removing cached_tokens=source.cached_tokens from _make_routed_output fails the assertion with [0, 0, 0] vs [128, 128, 128].

NIT 4 — pr_validate checkbox + local-run caveat
Ticked. Ran PR_VALIDATE_NO_DEEPSEEK=1 PR_VALIDATE_NO_STRESS=1 python -m scripts.pr_validate.pr_validate 478 locally. All substantive gates pass (fetch, cl_description_quality, supply_chain, lint, targeted_tests). My local full_unit reports 10 failures, all verified pre-existing by re-running them on ec0b06f (your base commit) with identical results: tests/test_platform.py (4× ModuleNotFoundError, missing optional mlx_metal-side install in my venv), tests/test_tool_call_e2e.py (5×, needs a live model), tests/test_dflash_integration.py (1×, same). On your machine these all passed (your scorecard: 4030 passed) — they're environment gaps in my setup, not regressions. Happy to provide the diff-test output if useful.

Bonus — silent regression caught while running pr_validate locally (commit 2a72b3c)
Three pre-existing test files (test_prefix_boundary_path_parity.py, test_anthropic_stream_reasoning.py, test_anthropic_streaming_reasoning.py) carry hand-rolled stubs that fake the GenerationOutput shape. After the NIT-suggested cleanup that removed the defensive getattr on output.cached_tokens in engine/batched.py, the stubs raised AttributeError when the LLM path tried to read the field directly. pr_validate caught this on my machine (4 test_prefix_boundary_path_parity.py failures) even though your run was green — _StubOutput simply didn't have the attribute. Backfilled cached_tokens = 0 on all three stubs to match the new dataclass shape; all 10 stub-affected tests now pass cleanly. Curious whether your full_unit also flagged this — possible there's a pytest discovery difference or your _StubOutput snapshot was on a different revision.

Let me know if NIT 1's comment-instead-of-restructure satisfies the concern, or if you'd prefer the loop-shaped version. Otherwise I think this is ready.

raullenchai reviewed May 27, 2026

View reviewed changes

andreirusu42 and others added 2 commits May 27, 2026 23:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(usage): surface prefix-cache hits in OpenAI + Anthropic usage blocks#478

feat(usage): surface prefix-cache hits in OpenAI + Anthropic usage blocks#478
MetricorTechnologies wants to merge 3 commits into
raullenchai:mainfrom
MetricorTechnologies:feat/cached-tokens-in-usage

MetricorTechnologies commented May 25, 2026 •

edited

Loading

Uh oh!

raullenchai left a comment

Uh oh!

MetricorTechnologies commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MetricorTechnologies commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is this needed?

AI assistance disclosure

Test plan

Checklist

Uh oh!

raullenchai left a comment

Choose a reason for hiding this comment

Review — pr_validate (full pipeline) + manual structural audit

pr_validate scorecard

Correctness checks I ran manually

NITs (4 total — none blocking)

Supply-chain audit (PR Merge SOP §2.5)

Recommendation

Uh oh!

MetricorTechnologies commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MetricorTechnologies commented May 25, 2026 •

edited

Loading

Review — `pr_validate` (full pipeline) + manual structural audit

`pr_validate` scorecard