Skip to content

Log per-server inference metrics#2650

Open
samsja wants to merge 4 commits into
mainfrom
codex/per-server-inference-metrics
Open

Log per-server inference metrics#2650
samsja wants to merge 4 commits into
mainfrom
codex/per-server-inference-metrics

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 27, 2026

Summary

  • add per-server inference metric scopes under inference/server/<server>/...
  • add direct server liveness metrics via .../up
  • add KV cache hit aliases and remaining KV cache metrics
  • keep aggregate and P/D role-aware metrics intact

Testing

  • /shared/research-prod/prime-rl/.venv/bin/ruff check src/prime_rl/orchestrator/inference_metrics.py tests/unit/orchestrator/test_inference_metrics.py
  • PYTHONPATH=src:packages/prime-rl-configs/src /shared/research-prod/prime-rl/.venv/bin/python -m pytest tests/unit/orchestrator/test_inference_metrics.py

Note: uv run was not usable in the standalone /tmp worktree because workspace submodules were not initialized there, so I used the existing repo venv against the /tmp worktree.


Note

Low Risk
Observability-only changes to metrics naming and W&B logging; no changes to training, auth, or request handling paths.

Overview
Extends the orchestrator’s W&B inference metrics so you can see each vLLM admin endpoint separately, without changing existing aggregate or prefill/decode role scopes.

Per-server scopes are emitted under inference_debug/server_XX_<host>_<port>/... (stable names from client URL + index). inference_debug/<server>/up tracks liveness for every configured server; failed polls still log up and drop smoothed per-server series so stale values don’t linger. build_scope_metrics now accepts a namespace (default inference) so the same metric builder drives both global and per-server keys.

Adds KV cache aliases on the logged dict: prefix hit rates mirror as kv_cache_hit_rate / cpu_kv_cache_hit_rate, and usage mean/max map to remaining cache metrics (kv_cache_left_perc_*, cpu_kv_cache_left_perc_*).

Reviewed by Cursor Bugbot for commit 62512a1. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread src/prime_rl/orchestrator/inference_metrics.py
@samsja samsja force-pushed the codex/per-server-inference-metrics branch from 78dd289 to 6e1758e Compare May 27, 2026 05:09
@S1ro1
Copy link
Copy Markdown
Collaborator

S1ro1 commented May 31, 2026

Can we move this to a separate wandb section? This pollutes the metrics seen in inference quite a lot, would rather make this a separate debug section or something like taht

Per S1ro1 review feedback: keeps the inference/ W&B section uncluttered
by moving per-server scopes (and the .../up liveness gauge) under
inference_debug/server/<name>/... instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 852015d. Configure here.

metrics[f"{prefix}cpu_kv_cache_left_perc_mean"] = cls.cache_left(value)
elif key.endswith("/cpu_kv_cache_usage_max"):
prefix = key.removesuffix("cpu_kv_cache_usage_max")
metrics[f"{prefix}cpu_kv_cache_left_perc_min"] = cls.cache_left(value)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unreachable elif branches in cache alias method

Medium Severity

Three elif branches in add_cache_alias_metrics are unreachable dead code. The cpu_-prefixed metric keys (e.g. inference/agg/cpu_prefix_cache_hit_rate) are always caught by earlier branches because "/cpu_prefix_cache_hit_rate".endswith("/prefix_cache_hit_rate") is True. Branches checking for /cpu_prefix_cache_hit_rate, /cpu_kv_cache_usage_mean, and /cpu_kv_cache_usage_max can never execute. The output happens to be correct today because removesuffix with the shorter suffix preserves the cpu_ prefix, but this is coincidental and fragile.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 852015d. Configure here.

"""Do not keep logging stale per-server metrics when a server fails to respond."""
for key in list(smoothed_metrics):
if key.startswith("inference_debug/server/") and key not in current_metrics:
del smoothed_metrics[key]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale history contaminates metrics after server recovery

Low Severity

drop_stale_server_metrics removes stale per-server keys from smoothed_metrics but does not clear the corresponding deques in self.metric_history. When a downed server recovers, smooth_metrics averages the first new value with up to 19 stale pre-outage values still sitting in the deque, producing inaccurate smoothed output for up to 100 seconds. For example, a server that had 100 running requests before an outage and 5 after recovery would initially report ~95.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 852015d. Configure here.

S1ro1
S1ro1 previously approved these changes Jun 1, 2026
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Drop the redundant `server/` segment so per-server metrics are logged
under `inference_debug/<server>/...` instead of
`inference_debug/server/<server>/...`.

async def start(self):
wandb.define_metric("inference/*", step_metric="_timestamp")
wandb.define_metric("inference_debug/*", step_metric="_timestamp")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inference debug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants