From 7adbe8c507f3b840520c75ca51d20392fb8853ad Mon Sep 17 00:00:00 2001
From: Ludwig Schneider <lschneider@nvidia.com>
Date: Mon, 27 Apr 2026 08:40:22 -0500
Subject: [PATCH 1/2] [None][perf] set ncclConfig graphUsageMode=1 on
 communicator init

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
---
 cpp/tensorrt_llm/common/opUtils.cpp | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/cpp/tensorrt_llm/common/opUtils.cpp b/cpp/tensorrt_llm/common/opUtils.cpp
index a738e28377a1..5c2908350c69 100644
--- a/cpp/tensorrt_llm/common/opUtils.cpp
+++ b/cpp/tensorrt_llm/common/opUtils.cpp
@@ -1,5 +1,5 @@
 /*
- * SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: Apache-2.0
  *
  * Licensed under the Apache License, Version 2.0 (the "License");
@@ -161,7 +161,13 @@ std::shared_ptr<ncclComm_t> getComm(std::set<int> const& group)
     setenv("NCCL_RUNTIME_CONNECT", "0", 0);
     setenv("NCCL_GRAPH_REGISTER", "0", 0);
 #endif // _WIN32
+#if NCCL_VERSION_CODE >= NCCL_VERSION(2, 29, 0)
+    ncclConfig_t config = NCCL_CONFIG_INITIALIZER;
+    config.graphUsageMode = 1;
+    NCCLCHECK_THROW(ncclCommInitRankConfig(ncclComm.get(), group.size(), id, groupRank, &config));
+#else
     NCCLCHECK_THROW(ncclCommInitRank(ncclComm.get(), group.size(), id, groupRank));
+#endif // NCCL_VERSION_CODE >= NCCL_VERSION(2, 29, 0)
     commMap[group] = ncclComm;
     TLLM_LOG_TRACE("%s stop for rank %d", __PRETTY_FUNCTION__, rank);
     return ncclComm;

From c2afce65aa65c3dca59b80d41f5c1c5d8b598986 Mon Sep 17 00:00:00 2001
From: Ludwig Schneider <lschneider@nvidia.com>
Date: Wed, 24 Jun 2026 13:00:33 -0500
Subject: [PATCH 2/2] [None][doc] document NCCL graph mixing default

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
---
 docs/source/features/disagg-serving.md              | 13 ++++++++++++-
 .../source/legacy/advanced/disaggregated-service.md | 13 ++++++++++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/docs/source/features/disagg-serving.md b/docs/source/features/disagg-serving.md
index 77987773dc0b..f38e3c818f57 100644
--- a/docs/source/features/disagg-serving.md
+++ b/docs/source/features/disagg-serving.md
@@ -258,7 +258,12 @@ TRT-LLM uses some environment variables to control the behavior of disaggregated
 
 There are some other useful environment variables that may help when encountering failures or performance issues.
 
-* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
+* `NCCL_GRAPH_MIXING_SUPPORT`: TensorRT-LLM now initializes common NCCL communicators with graph
+  mixing support off by default to reduce launch overhead for CUDA graph-captured NCCL operations.
+  This assumes the communicator is not used by parallel graph launches or by uncaptured NCCL calls
+  while a graph launch is outstanding. Set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore NCCL's default
+  graph mixing behavior if your workload needs it. For more details, see the
+  [NCCL_GRAPH_MIXING_SUPPORT documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-graph-mixing-support).
 
 * `UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.
 
@@ -309,6 +314,12 @@ executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendTy
 
 A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
 
+*Q. How do I debug a suspected hang from overlapping NCCL graph operations?*
+
+A. TensorRT-LLM turns graph mixing support off by default for common NCCL communicators. To check if
+a hang might be related to NCCL graph mixing support, set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore
+NCCL's default graph mixing behavior.
+
 *Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*
 
 A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.
diff --git a/docs/source/legacy/advanced/disaggregated-service.md b/docs/source/legacy/advanced/disaggregated-service.md
index ac6cbad472e9..4e23ad5179a8 100644
--- a/docs/source/legacy/advanced/disaggregated-service.md
+++ b/docs/source/legacy/advanced/disaggregated-service.md
@@ -33,7 +33,12 @@ TRT-LLM uses some environment variables to control the behavior of disaggregated
 
 There are some other useful environment variables that may help when encountering failures or performance issues.
 
-* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
+* `NCCL_GRAPH_MIXING_SUPPORT`: TensorRT-LLM now initializes common NCCL communicators with graph
+  mixing support off by default to reduce launch overhead for CUDA graph-captured NCCL operations.
+  This assumes the communicator is not used by parallel graph launches or by uncaptured NCCL calls
+  while a graph launch is outstanding. Set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore NCCL's default
+  graph mixing behavior if your workload needs it. For more details, see the
+  [NCCL_GRAPH_MIXING_SUPPORT documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-graph-mixing-support).
 
 * `UCX_MAX_RNDV_RAILS`: With the default value `2`, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting `UCX_MAX_RNDV_RAILS=1` can reduce contention in this case.
 
@@ -71,6 +76,12 @@ A. Yes, it's recommended that different executor use different GPUs. We support
 
 A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
 
+*Q. How do I debug a suspected hang from overlapping NCCL graph operations?*
+
+A. TensorRT-LLM turns graph mixing support off by default for common NCCL communicators. To check if
+a hang might be related to NCCL graph mixing support, set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore
+NCCL's default graph mixing behavior.
+
 *Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*
 
 A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.