feat: add Prometheus metrics collection for gRPC server mode#12760
feat: add Prometheus metrics collection for gRPC server mode#12760ConnorLi96 wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughA new metrics collection feature is introduced for the gRPC server. A background async loop periodically captures iteration statistics from the LLM, while the request manager logs per-request Prometheus metrics. Prometheus multiprocess support is initialized during server startup. Changes
Sequence DiagramsequenceDiagram
participant Server as GrpcServer
participant StatsLoop as Stats Loop
participant LLM as LLM Engine
participant MetricsCollector as Metrics Collector
participant ReqMgr as RequestManager
participant Prometheus as Prometheus
Server->>MetricsCollector: Initialize with model & engine labels
Server->>ReqMgr: Create with metrics_collector
Server->>StatsLoop: Start background task
Note over StatsLoop: Every ~1 second
StatsLoop->>LLM: get_stats_async(timeout=0.5)
LLM-->>StatsLoop: Latest iteration stats
StatsLoop->>MetricsCollector: log_iteration_stats(stat)
MetricsCollector->>Prometheus: Record metrics
Server->>ReqMgr: Process request (generate)
ReqMgr->>LLM: Stream results
LLM-->>ReqMgr: GenerationResult
alt Result finished
ReqMgr->>MetricsCollector: log_request_metrics_dict(result.metrics_dict)
MetricsCollector->>Prometheus: Record request metrics
end
Note over Server: On shutdown
Server->>StatsLoop: Cancel task
StatsLoop-->>Server: Task cancelled
Server->>ReqMgr: Stop
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/grpc/grpc_request_manager.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate copyright year to include 2026.
The file is being meaningfully modified but the copyright header still shows 2024. As per coding guidelines, the copyright year should reflect the latest meaningful modification.
-# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/grpc/grpc_request_manager.py` at line 1, Update the copyright header in the file by changing the year range from "2024" to include 2026 (e.g., "2024-2026" or "2026" per project convention); modify the top-of-file SPDX/comment block (the existing copyright text line) to reflect the new year range so the header matches the current meaningful modification.
🧹 Nitpick comments (3)
tensorrt_llm/grpc/grpc_request_manager.py (1)
56-64: Consider adding type hints formetrics_collectorparameter.The parameter lacks type annotation. Adding
Optional[MetricsCollector]would improve IDE support and documentation, consistent with the docstring mentioning it's optional.+from typing import Optional +from tensorrt_llm.metrics.collector import MetricsCollector as MetricsCollectorType + class GrpcRequestManager: - def __init__(self, llm: Any, metrics_collector=None): + def __init__(self, llm: Any, metrics_collector: Optional[MetricsCollectorType] = None):Alternatively, a forward reference string
"MetricsCollector"could be used to avoid circular imports if needed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/grpc/grpc_request_manager.py` around lines 56 - 64, Add a type annotation for the metrics_collector parameter on the __init__ method of the GRPC request manager so it reads as Optional[MetricsCollector] (or "MetricsCollector" as a forward-reference string to avoid circular imports) and import typing.Optional at top; update the signature def __init__(self, llm: Any, metrics_collector: Optional["MetricsCollector"] = None) -> None and ensure the attribute self._metrics_collector retains the same name—this improves IDE help and matches the docstring.tensorrt_llm/commands/serve.py (2)
321-327: Consider adding type hints for function parameters.The function parameters lack type annotations, which would improve code clarity and IDE support.
-async def _grpc_iteration_stats_loop(llm, metrics_collector) -> None: +async def _grpc_iteration_stats_loop(llm: "LLM | PyTorchLLM", metrics_collector: "MetricsCollector") -> None:Using string literals for forward references avoids import ordering issues.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/commands/serve.py` around lines 321 - 327, The function _grpc_iteration_stats_loop is missing parameter type annotations; add type hints for llm and metrics_collector (e.g., llm: "TensorRTLLM" or a suitable LLM interface and metrics_collector: "MetricsCollector" or typing.Any if types are not available) and keep the return type None as-is; use string literals for forward references to avoid import-order issues and/or import typing.Any from typing as a safe fallback so IDEs and linters get proper signatures without causing circular imports.
335-339: Consider logging atwarninglevel instead ofdebugfor unexpected exceptions.The broad
except Exceptioncatch is acceptable for a resilient background loop, but logging atdebuglevel may hide important errors that operators need to see during troubleshooting. Unexpected exceptions in stats collection should be visible without enabling debug logging.♻️ Proposed fix
except asyncio.CancelledError: raise except Exception as e: - logger.debug(f"Iteration stats collection error: {e}") + logger.warning(f"Iteration stats collection error: {e}") await asyncio.sleep(1.0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/commands/serve.py` around lines 335 - 339, The except Exception handler inside the background loop in serve.py currently logs unexpected errors with logger.debug which can hide problems; change that call to logger.warning and include the exception context (e.g., pass exc_info=True or format the exception) so unexpected iteration stats collection errors are visible to operators while preserving the asyncio.CancelledError re-raise behavior in the surrounding try/except.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Line 1: Update the copyright header in the file by changing the year range
from "2024" to include 2026 (e.g., "2024-2026" or "2026" per project
convention); modify the top-of-file SPDX/comment block (the existing copyright
text line) to reflect the new year range so the header matches the current
meaningful modification.
---
Nitpick comments:
In `@tensorrt_llm/commands/serve.py`:
- Around line 321-327: The function _grpc_iteration_stats_loop is missing
parameter type annotations; add type hints for llm and metrics_collector (e.g.,
llm: "TensorRTLLM" or a suitable LLM interface and metrics_collector:
"MetricsCollector" or typing.Any if types are not available) and keep the return
type None as-is; use string literals for forward references to avoid
import-order issues and/or import typing.Any from typing as a safe fallback so
IDEs and linters get proper signatures without causing circular imports.
- Around line 335-339: The except Exception handler inside the background loop
in serve.py currently logs unexpected errors with logger.debug which can hide
problems; change that call to logger.warning and include the exception context
(e.g., pass exc_info=True or format the exception) so unexpected iteration stats
collection errors are visible to operators while preserving the
asyncio.CancelledError re-raise behavior in the surrounding try/except.
In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Around line 56-64: Add a type annotation for the metrics_collector parameter
on the __init__ method of the GRPC request manager so it reads as
Optional[MetricsCollector] (or "MetricsCollector" as a forward-reference string
to avoid circular imports) and import typing.Optional at top; update the
signature def __init__(self, llm: Any, metrics_collector:
Optional["MetricsCollector"] = None) -> None and ensure the attribute
self._metrics_collector retains the same name—this improves IDE help and matches
the docstring.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2e09fd71-ed00-4d98-94a2-83c53d322980
📒 Files selected for processing (2)
tensorrt_llm/commands/serve.pytensorrt_llm/grpc/grpc_request_manager.py
afd335b to
6d4eb8f
Compare
gRPC mode previously had no Prometheus metrics instrumentation, unlike the OpenAI-compatible HTTP server. This adds a MetricsCollector to the gRPC launch path and a background iteration-stats loop that mirrors the HTTP server's _iteration_stats_collector_loop, exposing KV-cache utilization, hit rate, and per-request latency/throughput metrics. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
|
/bot run --disable-fast |
|
PR_Github #42013 Bot args parsing error: usage: /bot [-h] |
|
/bot run --disable-fail-fast |
|
PR_Github #42189 [ run ] triggered by Bot. Commit: |
|
PR_Github #42189 [ run ] completed with state
|
result.metrics_dict is an empty dict when return_perf_metrics is off (the default), so `if result.metrics_dict` was always False and log_request_metrics_dict() was never called. Populate finished_reason from result.outputs[0].finish_reason directly so the MetricsCollector can record request success counters. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Without return_perf_metrics, the C++ executor does not collect timing data (E2E latency, TTFT, TPOT, queue time), so prometheus histograms remain empty in gRPC mode while they work in HTTP mode. Set return_perf_metrics=True after LLM initialization so all gRPC requests populate metrics_dict with timing data, matching HTTP behavior. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
The OpenAI-compatible HTTP server (
OpenAIServer) instruments every request and engine iteration with Prometheus counters/histograms/gauges viaMetricsCollector. The gRPC server path (launch_grpc_server) had no equivalent instrumentation, so operators runningtrtllm-servein gRPC mode had zero visibility into request latencies, throughput, or KV-cache utilization. This PR closes that gap by wiring the sameMetricsCollectorinto the gRPClaunch path:
GrpcRequestManagercallslog_request_metrics_dict()when aGenerationResultis finishedOpenAIServer._finish_request()_grpc_iteration_stats_loopbackground task pollsllm.get_stats_async()every 1 sOpenAIServer._iteration_stats_collector_loopSummary by CodeRabbit
Description
Test Coverage
trtllm-serve ... --grpc,send requests, and scrape
/metrics(orPROMETHEUS_MULTIPROC_DIR) toconfirm
trtllm_e2e_request_latency_seconds,trtllm_kv_cache_utilization,etc. are populated.
MetricsCollectorunit tests already coverlog_request_metrics_dictandlog_iteration_stats; no new paths areintroduced in the collector itself.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.