Skip to content

Expose per-run cached prompt tokens metric#43

Merged
JannikSt merged 1 commit into
mainfrom
improvement/cache-hit-metrics
Jun 12, 2026
Merged

Expose per-run cached prompt tokens metric#43
JannikSt merged 1 commit into
mainfrom
improvement/cache-hit-metrics

Conversation

@JannikSt

@JannikSt JannikSt commented Jun 11, 2026

Copy link
Copy Markdown
Member

Parses usage.prompt_tokens_details.cached_tokens from vLLM responses and emits it as a new per-run counter so the platform can price cached input separately.

  • New metric: vllm_router_run_cached_prompt_tokens_total{run_id}
  • Existing prompt_tokens_total is unchanged (still total input including cached)
  • Field is optional in the parser, so non-vLLM upstreams continue to work

Note

Low Risk
Additive metrics and optional JSON parsing on existing usage extraction; no change to prompt-token totals or request counting semantics.

Overview
Adds per-run billing visibility for KV/prefix-cached prompt tokens from vLLM without changing how total prompt tokens are counted.

Upstream usage.prompt_tokens_details.cached_tokens is parsed (optional; defaults to 0 when missing) in usage_metrics and passed into RouterMetrics::record_run_usage. A new counter vllm_router_run_cached_prompt_tokens_total{run_id} records that subset only when it is > 0; vllm_router_run_prompt_tokens_total still reflects full input tokens. Tests cover present, absent, and empty prompt_tokens_details shapes.

Reviewed by Cursor Bugbot for commit 33c2e41. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Expose per-run cached prompt tokens metric from upstream KV/prefix cache

  • Adds a new vllm_router_run_cached_prompt_tokens_total counter in metrics.rs, incremented per run_id when cached tokens are present.
  • Parses prompt_tokens_details.cached_tokens from upstream usage responses in usage_metrics.rs and passes the value to RouterMetrics::record_run_usage.
  • Adds three unit tests covering presence, absence, and empty prompt_tokens_details cases.

Macroscope summarized 33c2e41.

Parse usage.prompt_tokens_details.cached_tokens from upstream vLLM
responses and emit it as vllm_router_run_cached_prompt_tokens_total
labeled by run_id. The existing prompt_tokens counter is unchanged
(still reports total input tokens including cached); the new counter
is purely additive so downstream billing can apply a separate price
to the cached portion.
@macroscopeapp

macroscopeapp Bot commented Jun 11, 2026

Copy link
Copy Markdown

Approvability

Verdict: Approved

Additive metrics change that exposes a new counter for cached prompt tokens. The implementation is backward-compatible (handles missing fields gracefully), well-tested, and doesn't alter existing billing or processing logic.

You can customize Macroscope's approvability policy. Learn more.

@JannikSt

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. 🎉

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@JannikSt JannikSt merged commit 045cba2 into main Jun 12, 2026
9 checks passed
@JannikSt JannikSt mentioned this pull request Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant