feat: add Prometheus metrics collection for gRPC server mode by ConnorLi96 · Pull Request #12760 · NVIDIA/TensorRT-LLM

ConnorLi96 · 2026-04-04T04:22:28Z

The OpenAI-compatible HTTP server (OpenAIServer) instruments every request and engine iteration with Prometheus counters/histograms/gauges via MetricsCollector. The gRPC server path (launch_grpc_server) had no equivalent instrumentation, so operators running trtllm-serve in gRPC mode had zero visibility into request latencies, throughput, or KV-cache utilization. This PR closes that gap by wiring the same MetricsCollector into the gRPC
launch path:

Metric category	How it's collected	Parity with HTTP server
Per-request (E2E latency, TTFT, TPOT, queue time, finish reason)	`GrpcRequestManager` calls `log_request_metrics_dict()` when a `GenerationResult` is finished	Same as `OpenAIServer._finish_request()`
Iteration-level (KV-cache utilization, hit rate, reused/missed blocks)	New `_grpc_iteration_stats_loop` background task polls `llm.get_stats_async()` every 1 s	Mirrors `OpenAIServer._iteration_stats_collector_loop`

Summary by CodeRabbit

New Features
- Added Prometheus metrics collection to the gRPC server with automatic tracking of iteration statistics and per-request performance metrics for improved monitoring and observability.

Description

Test Coverage

Manual verification: launch gRPC server with trtllm-serve ... --grpc,
send requests, and scrape /metrics (or PROMETHEUS_MULTIPROC_DIR) to
confirm trtllm_e2e_request_latency_seconds, trtllm_kv_cache_utilization,
etc. are populated.
Existing MetricsCollector unit tests already cover
log_request_metrics_dict and log_iteration_stats; no new paths are
introduced in the collector itself.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-04-04T04:27:23Z

📝 Walkthrough

Walkthrough

A new metrics collection feature is introduced for the gRPC server. A background async loop periodically captures iteration statistics from the LLM, while the request manager logs per-request Prometheus metrics. Prometheus multiprocess support is initialized during server startup.

Changes

Cohort / File(s)	Summary
Metrics Collection Integration `tensorrt_llm/commands/serve.py`, `tensorrt_llm/grpc/grpc_request_manager.py`	Added background `_grpc_iteration_stats_loop` to continuously collect LLM stats; initialized Prometheus multiprocess mode and `MetricsCollector` with model/engine labels in server startup; passed metrics collector to `GrpcRequestManager` to log per-request metrics when results finish; included cleanup on server shutdown.

Sequence Diagram

sequenceDiagram
    participant Server as GrpcServer
    participant StatsLoop as Stats Loop
    participant LLM as LLM Engine
    participant MetricsCollector as Metrics Collector
    participant ReqMgr as RequestManager
    participant Prometheus as Prometheus

    Server->>MetricsCollector: Initialize with model & engine labels
    Server->>ReqMgr: Create with metrics_collector
    Server->>StatsLoop: Start background task
    Note over StatsLoop: Every ~1 second
    StatsLoop->>LLM: get_stats_async(timeout=0.5)
    LLM-->>StatsLoop: Latest iteration stats
    StatsLoop->>MetricsCollector: log_iteration_stats(stat)
    MetricsCollector->>Prometheus: Record metrics

    Server->>ReqMgr: Process request (generate)
    ReqMgr->>LLM: Stream results
    LLM-->>ReqMgr: GenerationResult
    alt Result finished
        ReqMgr->>MetricsCollector: log_request_metrics_dict(result.metrics_dict)
        MetricsCollector->>Prometheus: Record request metrics
    end

    Note over Server: On shutdown
    Server->>StatsLoop: Cancel task
    StatsLoop-->>Server: Task cancelled
    Server->>ReqMgr: Stop

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding Prometheus metrics collection to the gRPC server mode, matching the PR's core objective.
Description check	✅ Passed	The description explains the issue (gRPC server lacked metrics visibility) and solution (wiring MetricsCollector), includes a comparison table, test coverage details, and addresses the template requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/grpc/grpc_request_manager.py (1)
1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year to include 2026.

The file is being meaningfully modified but the copyright header still shows 2024. As per coding guidelines, the copyright year should reflect the latest meaningful modification.
-# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/grpc/grpc_request_manager.py` at line 1, Update the copyright
header in the file by changing the year range from "2024" to include 2026 (e.g.,
"2024-2026" or "2026" per project convention); modify the top-of-file
SPDX/comment block (the existing copyright text line) to reflect the new year
range so the header matches the current meaningful modification.

🧹 Nitpick comments (3)

tensorrt_llm/grpc/grpc_request_manager.py (1)

56-64: Consider adding type hints for metrics_collector parameter.

The parameter lacks type annotation. Adding Optional[MetricsCollector] would improve IDE support and documentation, consistent with the docstring mentioning it's optional.

+from typing import Optional
+from tensorrt_llm.metrics.collector import MetricsCollector as MetricsCollectorType
+
 class GrpcRequestManager:
-    def __init__(self, llm: Any, metrics_collector=None):
+    def __init__(self, llm: Any, metrics_collector: Optional[MetricsCollectorType] = None):

Alternatively, a forward reference string "MetricsCollector" could be used to avoid circular imports if needed.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/grpc/grpc_request_manager.py` around lines 56 - 64, Add a type
annotation for the metrics_collector parameter on the __init__ method of the
GRPC request manager so it reads as Optional[MetricsCollector] (or
"MetricsCollector" as a forward-reference string to avoid circular imports) and
import typing.Optional at top; update the signature def __init__(self, llm: Any,
metrics_collector: Optional["MetricsCollector"] = None) -> None and ensure the
attribute self._metrics_collector retains the same name—this improves IDE help
and matches the docstring.

tensorrt_llm/commands/serve.py (2)

321-327: Consider adding type hints for function parameters.

The function parameters lack type annotations, which would improve code clarity and IDE support.

-async def _grpc_iteration_stats_loop(llm, metrics_collector) -> None:
+async def _grpc_iteration_stats_loop(llm: "LLM | PyTorchLLM", metrics_collector: "MetricsCollector") -> None:

Using string literals for forward references avoids import ordering issues.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/commands/serve.py` around lines 321 - 327, The function
_grpc_iteration_stats_loop is missing parameter type annotations; add type hints
for llm and metrics_collector (e.g., llm: "TensorRTLLM" or a suitable LLM
interface and metrics_collector: "MetricsCollector" or typing.Any if types are
not available) and keep the return type None as-is; use string literals for
forward references to avoid import-order issues and/or import typing.Any from
typing as a safe fallback so IDEs and linters get proper signatures without
causing circular imports.

335-339: Consider logging at warning level instead of debug for unexpected exceptions.

The broad except Exception catch is acceptable for a resilient background loop, but logging at debug level may hide important errors that operators need to see during troubleshooting. Unexpected exceptions in stats collection should be visible without enabling debug logging.

♻️ Proposed fix

         except asyncio.CancelledError:
             raise
         except Exception as e:
-            logger.debug(f"Iteration stats collection error: {e}")
+            logger.warning(f"Iteration stats collection error: {e}")
         await asyncio.sleep(1.0)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/commands/serve.py` around lines 335 - 339, The except Exception
handler inside the background loop in serve.py currently logs unexpected errors
with logger.debug which can hide problems; change that call to logger.warning
and include the exception context (e.g., pass exc_info=True or format the
exception) so unexpected iteration stats collection errors are visible to
operators while preserving the asyncio.CancelledError re-raise behavior in the
surrounding try/except.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Line 1: Update the copyright header in the file by changing the year range
from "2024" to include 2026 (e.g., "2024-2026" or "2026" per project
convention); modify the top-of-file SPDX/comment block (the existing copyright
text line) to reflect the new year range so the header matches the current
meaningful modification.

---

Nitpick comments:
In `@tensorrt_llm/commands/serve.py`:
- Around line 321-327: The function _grpc_iteration_stats_loop is missing
parameter type annotations; add type hints for llm and metrics_collector (e.g.,
llm: "TensorRTLLM" or a suitable LLM interface and metrics_collector:
"MetricsCollector" or typing.Any if types are not available) and keep the return
type None as-is; use string literals for forward references to avoid
import-order issues and/or import typing.Any from typing as a safe fallback so
IDEs and linters get proper signatures without causing circular imports.
- Around line 335-339: The except Exception handler inside the background loop
in serve.py currently logs unexpected errors with logger.debug which can hide
problems; change that call to logger.warning and include the exception context
(e.g., pass exc_info=True or format the exception) so unexpected iteration stats
collection errors are visible to operators while preserving the
asyncio.CancelledError re-raise behavior in the surrounding try/except.

In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Around line 56-64: Add a type annotation for the metrics_collector parameter
on the __init__ method of the GRPC request manager so it reads as
Optional[MetricsCollector] (or "MetricsCollector" as a forward-reference string
to avoid circular imports) and import typing.Optional at top; update the
signature def __init__(self, llm: Any, metrics_collector:
Optional["MetricsCollector"] = None) -> None and ensure the attribute
self._metrics_collector retains the same name—this improves IDE help and matches
the docstring.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2e09fd71-ed00-4d98-94a2-83c53d322980

📥 Commits

Reviewing files that changed from the base of the PR and between b6c5a71 and afd335b.

📒 Files selected for processing (2)

tensorrt_llm/commands/serve.py
tensorrt_llm/grpc/grpc_request_manager.py

gRPC mode previously had no Prometheus metrics instrumentation, unlike the OpenAI-compatible HTTP server. This adds a MetricsCollector to the gRPC launch path and a background iteration-stats loop that mirrors the HTTP server's _iteration_stats_collector_loop, exposing KV-cache utilization, hit rate, and per-request latency/throughput metrics. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>

juney-nvidia · 2026-04-07T00:56:35Z

/bot run --disable-fast

tensorrt-cicd · 2026-04-07T01:03:04Z

PR_Github #42013 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --disable-fast

Link to invocation

karljang · 2026-04-07T21:39:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-07T21:46:25Z

PR_Github #42189 [ run ] triggered by Bot. Commit: 6d4eb8f Link to invocation

tensorrt-cicd · 2026-04-08T03:15:45Z

PR_Github #42189 [ run ] completed with state SUCCESS. Commit: 6d4eb8f
/LLM/main/L0_MergeRequest_PR pipeline #33013 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

result.metrics_dict is an empty dict when return_perf_metrics is off (the default), so `if result.metrics_dict` was always False and log_request_metrics_dict() was never called. Populate finished_reason from result.outputs[0].finish_reason directly so the MetricsCollector can record request success counters. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>

Without return_perf_metrics, the C++ executor does not collect timing data (E2E latency, TTFT, TPOT, queue time), so prometheus histograms remain empty in gRPC mode while they work in HTTP mode. Set return_perf_metrics=True after LLM initialization so all gRPC requests populate metrics_dict with timing data, matching HTTP behavior. Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>

ConnorLi96 requested a review from a team as a code owner April 4, 2026 04:22

ConnorLi96 requested a review from nv-guomingz April 4, 2026 04:22

github-actions bot assigned ConnorLi96 Apr 4, 2026

coderabbitai bot reviewed Apr 4, 2026

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 4, 2026

ConnorLi96 force-pushed the feature/grpc-prometheus-metrics branch from afd335b to 6d4eb8f Compare April 7, 2026 00:54

ConnorLi96 and others added 3 commits April 8, 2026 14:08

Merge branch 'main' into feature/grpc-prometheus-metrics

d253c82

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Prometheus metrics collection for gRPC server mode#12760

feat: add Prometheus metrics collection for gRPC server mode#12760
ConnorLi96 wants to merge 4 commits intoNVIDIA:mainfrom
ConnorLi96:feature/grpc-prometheus-metrics

ConnorLi96 commented Apr 4, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 4, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

juney-nvidia commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

karljang commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ConnorLi96 commented Apr 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai bot commented Apr 4, 2026

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

juney-nvidia commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

karljang commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 7, 2026

Uh oh!

tensorrt-cicd commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ConnorLi96 commented Apr 4, 2026 •

edited by coderabbitai bot

Loading