Expose per-run cached prompt tokens metric#43
Merged
Conversation
Parse usage.prompt_tokens_details.cached_tokens from upstream vLLM responses and emit it as vllm_router_run_cached_prompt_tokens_total labeled by run_id. The existing prompt_tokens counter is unchanged (still reports total input tokens including cached); the new counter is purely additive so downstream billing can apply a separate price to the cached portion.
ApprovabilityVerdict: Approved Additive metrics change that exposes a new counter for cached prompt tokens. The implementation is backward-compatible (handles missing fields gracefully), well-tested, and doesn't alter existing billing or processing logic. You can customize Macroscope's approvability policy. Learn more. |
Member
Author
|
@codex review |
|
Codex Review: Didn't find any major issues. 🎉 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parses
usage.prompt_tokens_details.cached_tokensfrom vLLM responses and emits it as a new per-run counter so the platform can price cached input separately.vllm_router_run_cached_prompt_tokens_total{run_id}prompt_tokens_totalis unchanged (still total input including cached)Note
Low Risk
Additive metrics and optional JSON parsing on existing usage extraction; no change to prompt-token totals or request counting semantics.
Overview
Adds per-run billing visibility for KV/prefix-cached prompt tokens from vLLM without changing how total prompt tokens are counted.
Upstream
usage.prompt_tokens_details.cached_tokensis parsed (optional; defaults to 0 when missing) inusage_metricsand passed intoRouterMetrics::record_run_usage. A new countervllm_router_run_cached_prompt_tokens_total{run_id}records that subset only when it is > 0;vllm_router_run_prompt_tokens_totalstill reflects full input tokens. Tests cover present, absent, and emptyprompt_tokens_detailsshapes.Reviewed by Cursor Bugbot for commit 33c2e41. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Expose per-run cached prompt tokens metric from upstream KV/prefix cache
vllm_router_run_cached_prompt_tokens_totalcounter in metrics.rs, incremented perrun_idwhen cached tokens are present.prompt_tokens_details.cached_tokensfrom upstream usage responses in usage_metrics.rs and passes the value toRouterMetrics::record_run_usage.prompt_tokens_detailscases.Macroscope summarized 33c2e41.