Skip to content

feat: add Prometheus metrics collection for gRPC server mode#12760

Open
ConnorLi96 wants to merge 4 commits intoNVIDIA:mainfrom
ConnorLi96:feature/grpc-prometheus-metrics
Open

feat: add Prometheus metrics collection for gRPC server mode#12760
ConnorLi96 wants to merge 4 commits intoNVIDIA:mainfrom
ConnorLi96:feature/grpc-prometheus-metrics

Conversation

@ConnorLi96
Copy link
Copy Markdown

@ConnorLi96 ConnorLi96 commented Apr 4, 2026

The OpenAI-compatible HTTP server (OpenAIServer) instruments every request and engine iteration with Prometheus counters/histograms/gauges via MetricsCollector. The gRPC server path (launch_grpc_server) had no equivalent instrumentation, so operators running trtllm-serve in gRPC mode had zero visibility into request latencies, throughput, or KV-cache utilization. This PR closes that gap by wiring the same MetricsCollector into the gRPC
launch path:

Metric category How it's collected Parity with HTTP server
Per-request (E2E latency, TTFT, TPOT, queue time, finish reason) GrpcRequestManager calls log_request_metrics_dict() when a GenerationResult is finished Same as OpenAIServer._finish_request()
Iteration-level (KV-cache utilization, hit rate, reused/missed blocks) New _grpc_iteration_stats_loop background task polls llm.get_stats_async() every 1 s Mirrors OpenAIServer._iteration_stats_collector_loop

Summary by CodeRabbit

  • New Features
    • Added Prometheus metrics collection to the gRPC server with automatic tracking of iteration statistics and per-request performance metrics for improved monitoring and observability.

Description

Test Coverage

  • Manual verification: launch gRPC server with trtllm-serve ... --grpc,
    send requests, and scrape /metrics (or PROMETHEUS_MULTIPROC_DIR) to
    confirm trtllm_e2e_request_latency_seconds, trtllm_kv_cache_utilization,
    etc. are populated.
  • Existing MetricsCollector unit tests already cover
    log_request_metrics_dict and log_iteration_stats; no new paths are
    introduced in the collector itself.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@ConnorLi96 ConnorLi96 requested a review from a team as a code owner April 4, 2026 04:22
@ConnorLi96 ConnorLi96 requested a review from nv-guomingz April 4, 2026 04:22
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

A new metrics collection feature is introduced for the gRPC server. A background async loop periodically captures iteration statistics from the LLM, while the request manager logs per-request Prometheus metrics. Prometheus multiprocess support is initialized during server startup.

Changes

Cohort / File(s) Summary
Metrics Collection Integration
tensorrt_llm/commands/serve.py, tensorrt_llm/grpc/grpc_request_manager.py
Added background _grpc_iteration_stats_loop to continuously collect LLM stats; initialized Prometheus multiprocess mode and MetricsCollector with model/engine labels in server startup; passed metrics collector to GrpcRequestManager to log per-request metrics when results finish; included cleanup on server shutdown.

Sequence Diagram

sequenceDiagram
    participant Server as GrpcServer
    participant StatsLoop as Stats Loop
    participant LLM as LLM Engine
    participant MetricsCollector as Metrics Collector
    participant ReqMgr as RequestManager
    participant Prometheus as Prometheus

    Server->>MetricsCollector: Initialize with model & engine labels
    Server->>ReqMgr: Create with metrics_collector
    Server->>StatsLoop: Start background task
    Note over StatsLoop: Every ~1 second
    StatsLoop->>LLM: get_stats_async(timeout=0.5)
    LLM-->>StatsLoop: Latest iteration stats
    StatsLoop->>MetricsCollector: log_iteration_stats(stat)
    MetricsCollector->>Prometheus: Record metrics

    Server->>ReqMgr: Process request (generate)
    ReqMgr->>LLM: Stream results
    LLM-->>ReqMgr: GenerationResult
    alt Result finished
        ReqMgr->>MetricsCollector: log_request_metrics_dict(result.metrics_dict)
        MetricsCollector->>Prometheus: Record request metrics
    end

    Note over Server: On shutdown
    Server->>StatsLoop: Cancel task
    StatsLoop-->>Server: Task cancelled
    Server->>ReqMgr: Stop
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding Prometheus metrics collection to the gRPC server mode, matching the PR's core objective.
Description check ✅ Passed The description explains the issue (gRPC server lacked metrics visibility) and solution (wiring MetricsCollector), includes a comparison table, test coverage details, and addresses the template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/grpc/grpc_request_manager.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Update copyright year to include 2026.

The file is being meaningfully modified but the copyright header still shows 2024. As per coding guidelines, the copyright year should reflect the latest meaningful modification.

-# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/grpc/grpc_request_manager.py` at line 1, Update the copyright
header in the file by changing the year range from "2024" to include 2026 (e.g.,
"2024-2026" or "2026" per project convention); modify the top-of-file
SPDX/comment block (the existing copyright text line) to reflect the new year
range so the header matches the current meaningful modification.
🧹 Nitpick comments (3)
tensorrt_llm/grpc/grpc_request_manager.py (1)

56-64: Consider adding type hints for metrics_collector parameter.

The parameter lacks type annotation. Adding Optional[MetricsCollector] would improve IDE support and documentation, consistent with the docstring mentioning it's optional.

+from typing import Optional
+from tensorrt_llm.metrics.collector import MetricsCollector as MetricsCollectorType
+
 class GrpcRequestManager:
-    def __init__(self, llm: Any, metrics_collector=None):
+    def __init__(self, llm: Any, metrics_collector: Optional[MetricsCollectorType] = None):

Alternatively, a forward reference string "MetricsCollector" could be used to avoid circular imports if needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/grpc/grpc_request_manager.py` around lines 56 - 64, Add a type
annotation for the metrics_collector parameter on the __init__ method of the
GRPC request manager so it reads as Optional[MetricsCollector] (or
"MetricsCollector" as a forward-reference string to avoid circular imports) and
import typing.Optional at top; update the signature def __init__(self, llm: Any,
metrics_collector: Optional["MetricsCollector"] = None) -> None and ensure the
attribute self._metrics_collector retains the same name—this improves IDE help
and matches the docstring.
tensorrt_llm/commands/serve.py (2)

321-327: Consider adding type hints for function parameters.

The function parameters lack type annotations, which would improve code clarity and IDE support.

-async def _grpc_iteration_stats_loop(llm, metrics_collector) -> None:
+async def _grpc_iteration_stats_loop(llm: "LLM | PyTorchLLM", metrics_collector: "MetricsCollector") -> None:

Using string literals for forward references avoids import ordering issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/commands/serve.py` around lines 321 - 327, The function
_grpc_iteration_stats_loop is missing parameter type annotations; add type hints
for llm and metrics_collector (e.g., llm: "TensorRTLLM" or a suitable LLM
interface and metrics_collector: "MetricsCollector" or typing.Any if types are
not available) and keep the return type None as-is; use string literals for
forward references to avoid import-order issues and/or import typing.Any from
typing as a safe fallback so IDEs and linters get proper signatures without
causing circular imports.

335-339: Consider logging at warning level instead of debug for unexpected exceptions.

The broad except Exception catch is acceptable for a resilient background loop, but logging at debug level may hide important errors that operators need to see during troubleshooting. Unexpected exceptions in stats collection should be visible without enabling debug logging.

♻️ Proposed fix
         except asyncio.CancelledError:
             raise
         except Exception as e:
-            logger.debug(f"Iteration stats collection error: {e}")
+            logger.warning(f"Iteration stats collection error: {e}")
         await asyncio.sleep(1.0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/commands/serve.py` around lines 335 - 339, The except Exception
handler inside the background loop in serve.py currently logs unexpected errors
with logger.debug which can hide problems; change that call to logger.warning
and include the exception context (e.g., pass exc_info=True or format the
exception) so unexpected iteration stats collection errors are visible to
operators while preserving the asyncio.CancelledError re-raise behavior in the
surrounding try/except.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Line 1: Update the copyright header in the file by changing the year range
from "2024" to include 2026 (e.g., "2024-2026" or "2026" per project
convention); modify the top-of-file SPDX/comment block (the existing copyright
text line) to reflect the new year range so the header matches the current
meaningful modification.

---

Nitpick comments:
In `@tensorrt_llm/commands/serve.py`:
- Around line 321-327: The function _grpc_iteration_stats_loop is missing
parameter type annotations; add type hints for llm and metrics_collector (e.g.,
llm: "TensorRTLLM" or a suitable LLM interface and metrics_collector:
"MetricsCollector" or typing.Any if types are not available) and keep the return
type None as-is; use string literals for forward references to avoid
import-order issues and/or import typing.Any from typing as a safe fallback so
IDEs and linters get proper signatures without causing circular imports.
- Around line 335-339: The except Exception handler inside the background loop
in serve.py currently logs unexpected errors with logger.debug which can hide
problems; change that call to logger.warning and include the exception context
(e.g., pass exc_info=True or format the exception) so unexpected iteration stats
collection errors are visible to operators while preserving the
asyncio.CancelledError re-raise behavior in the surrounding try/except.

In `@tensorrt_llm/grpc/grpc_request_manager.py`:
- Around line 56-64: Add a type annotation for the metrics_collector parameter
on the __init__ method of the GRPC request manager so it reads as
Optional[MetricsCollector] (or "MetricsCollector" as a forward-reference string
to avoid circular imports) and import typing.Optional at top; update the
signature def __init__(self, llm: Any, metrics_collector:
Optional["MetricsCollector"] = None) -> None and ensure the attribute
self._metrics_collector retains the same name—this improves IDE help and matches
the docstring.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2e09fd71-ed00-4d98-94a2-83c53d322980

📥 Commits

Reviewing files that changed from the base of the PR and between b6c5a71 and afd335b.

📒 Files selected for processing (2)
  • tensorrt_llm/commands/serve.py
  • tensorrt_llm/grpc/grpc_request_manager.py

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 4, 2026
@ConnorLi96 ConnorLi96 force-pushed the feature/grpc-prometheus-metrics branch from afd335b to 6d4eb8f Compare April 7, 2026 00:54
gRPC mode previously had no Prometheus metrics instrumentation, unlike the
OpenAI-compatible HTTP server. This adds a MetricsCollector to the gRPC
launch path and a background iteration-stats loop that mirrors the HTTP
server's _iteration_stats_collector_loop, exposing KV-cache utilization,
hit rate, and per-request latency/throughput metrics.

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
@juney-nvidia
Copy link
Copy Markdown
Collaborator

/bot run --disable-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42013 Bot args parsing error: usage: /bot [-h]
{run,kill,skip,submit,reviewers,reuse-pipeline,reuse-review} ...
/bot: error: unrecognized arguments: --disable-fast

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

karljang commented Apr 7, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42189 [ run ] triggered by Bot. Commit: 6d4eb8f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42189 [ run ] completed with state SUCCESS. Commit: 6d4eb8f
/LLM/main/L0_MergeRequest_PR pipeline #33013 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

ConnorLi96 and others added 3 commits April 8, 2026 14:08
result.metrics_dict is an empty dict when return_perf_metrics is off
(the default), so `if result.metrics_dict` was always False and
log_request_metrics_dict() was never called.

Populate finished_reason from result.outputs[0].finish_reason directly
so the MetricsCollector can record request success counters.

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Without return_perf_metrics, the C++ executor does not collect timing
data (E2E latency, TTFT, TPOT, queue time), so prometheus histograms
remain empty in gRPC mode while they work in HTTP mode.

Set return_perf_metrics=True after LLM initialization so all gRPC
requests populate metrics_dict with timing data, matching HTTP behavior.

Signed-off-by: ConnorLi96 <ConnorLi96@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants