Log per-server inference metrics#2650
Conversation
78dd289 to
6e1758e
Compare
|
Can we move this to a separate wandb section? This pollutes the metrics seen in |
Per S1ro1 review feedback: keeps the inference/ W&B section uncluttered by moving per-server scopes (and the .../up liveness gauge) under inference_debug/server/<name>/... instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 852015d. Configure here.
| metrics[f"{prefix}cpu_kv_cache_left_perc_mean"] = cls.cache_left(value) | ||
| elif key.endswith("/cpu_kv_cache_usage_max"): | ||
| prefix = key.removesuffix("cpu_kv_cache_usage_max") | ||
| metrics[f"{prefix}cpu_kv_cache_left_perc_min"] = cls.cache_left(value) |
There was a problem hiding this comment.
Unreachable elif branches in cache alias method
Medium Severity
Three elif branches in add_cache_alias_metrics are unreachable dead code. The cpu_-prefixed metric keys (e.g. inference/agg/cpu_prefix_cache_hit_rate) are always caught by earlier branches because "/cpu_prefix_cache_hit_rate".endswith("/prefix_cache_hit_rate") is True. Branches checking for /cpu_prefix_cache_hit_rate, /cpu_kv_cache_usage_mean, and /cpu_kv_cache_usage_max can never execute. The output happens to be correct today because removesuffix with the shorter suffix preserves the cpu_ prefix, but this is coincidental and fragile.
Reviewed by Cursor Bugbot for commit 852015d. Configure here.
| """Do not keep logging stale per-server metrics when a server fails to respond.""" | ||
| for key in list(smoothed_metrics): | ||
| if key.startswith("inference_debug/server/") and key not in current_metrics: | ||
| del smoothed_metrics[key] |
There was a problem hiding this comment.
Stale history contaminates metrics after server recovery
Low Severity
drop_stale_server_metrics removes stale per-server keys from smoothed_metrics but does not clear the corresponding deques in self.metric_history. When a downed server recovers, smooth_metrics averages the first new value with up to 19 stale pre-outage values still sitting in the deque, producing inaccurate smoothed output for up to 100 seconds. For example, a server that had 100 running requests before an outage and 5 after recovery would initially report ~95.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 852015d. Configure here.
Drop the redundant `server/` segment so per-server metrics are logged under `inference_debug/<server>/...` instead of `inference_debug/server/<server>/...`.
|
|
||
| async def start(self): | ||
| wandb.define_metric("inference/*", step_metric="_timestamp") | ||
| wandb.define_metric("inference_debug/*", step_metric="_timestamp") |


Summary
inference/server/<server>/....../upTesting
/shared/research-prod/prime-rl/.venv/bin/ruff check src/prime_rl/orchestrator/inference_metrics.py tests/unit/orchestrator/test_inference_metrics.pyPYTHONPATH=src:packages/prime-rl-configs/src /shared/research-prod/prime-rl/.venv/bin/python -m pytest tests/unit/orchestrator/test_inference_metrics.pyNote:
uv runwas not usable in the standalone/tmpworktree because workspace submodules were not initialized there, so I used the existing repo venv against the/tmpworktree.Note
Low Risk
Observability-only changes to metrics naming and W&B logging; no changes to training, auth, or request handling paths.
Overview
Extends the orchestrator’s W&B inference metrics so you can see each vLLM admin endpoint separately, without changing existing aggregate or prefill/decode role scopes.
Per-server scopes are emitted under
inference_debug/server_XX_<host>_<port>/...(stable names from client URL + index).inference_debug/<server>/uptracks liveness for every configured server; failed polls still logupand drop smoothed per-server series so stale values don’t linger.build_scope_metricsnow accepts anamespace(defaultinference) so the same metric builder drives both global and per-server keys.Adds KV cache aliases on the logged dict: prefix hit rates mirror as
kv_cache_hit_rate/cpu_kv_cache_hit_rate, and usage mean/max map to remaining cache metrics (kv_cache_left_perc_*,cpu_kv_cache_left_perc_*).Reviewed by Cursor Bugbot for commit 62512a1. Bugbot is set up for automated code reviews on this repo. Configure here.