Skip to content

Commit c2afce6

Browse files
committed
[None][doc] document NCCL graph mixing default
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
1 parent 7adbe8c commit c2afce6

2 files changed

Lines changed: 24 additions & 2 deletions

File tree

docs/source/features/disagg-serving.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -258,7 +258,12 @@ TRT-LLM uses some environment variables to control the behavior of disaggregated
258258

259259
There are some other useful environment variables that may help when encountering failures or performance issues.
260260

261-
* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
261+
* `NCCL_GRAPH_MIXING_SUPPORT`: TensorRT-LLM now initializes common NCCL communicators with graph
262+
mixing support off by default to reduce launch overhead for CUDA graph-captured NCCL operations.
263+
This assumes the communicator is not used by parallel graph launches or by uncaptured NCCL calls
264+
while a graph launch is outstanding. Set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore NCCL's default
265+
graph mixing behavior if your workload needs it. For more details, see the
266+
[NCCL_GRAPH_MIXING_SUPPORT documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-graph-mixing-support).
262267

263268
* `UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.
264269

@@ -309,6 +314,12 @@ executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendTy
309314

310315
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
311316

317+
*Q. How do I debug a suspected hang from overlapping NCCL graph operations?*
318+
319+
A. TensorRT-LLM turns graph mixing support off by default for common NCCL communicators. To check if
320+
a hang might be related to NCCL graph mixing support, set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore
321+
NCCL's default graph mixing behavior.
322+
312323
*Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*
313324

314325
A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.

docs/source/legacy/advanced/disaggregated-service.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,12 @@ TRT-LLM uses some environment variables to control the behavior of disaggregated
3333

3434
There are some other useful environment variables that may help when encountering failures or performance issues.
3535

36-
* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
36+
* `NCCL_GRAPH_MIXING_SUPPORT`: TensorRT-LLM now initializes common NCCL communicators with graph
37+
mixing support off by default to reduce launch overhead for CUDA graph-captured NCCL operations.
38+
This assumes the communicator is not used by parallel graph launches or by uncaptured NCCL calls
39+
while a graph launch is outstanding. Set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore NCCL's default
40+
graph mixing behavior if your workload needs it. For more details, see the
41+
[NCCL_GRAPH_MIXING_SUPPORT documentation](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html#nccl-graph-mixing-support).
3742

3843
* `UCX_MAX_RNDV_RAILS`: With the default value `2`, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting `UCX_MAX_RNDV_RAILS=1` can reduce contention in this case.
3944

@@ -71,6 +76,12 @@ A. Yes, it's recommended that different executor use different GPUs. We support
7176

7277
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
7378

79+
*Q. How do I debug a suspected hang from overlapping NCCL graph operations?*
80+
81+
A. TensorRT-LLM turns graph mixing support off by default for common NCCL communicators. To check if
82+
a hang might be related to NCCL graph mixing support, set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore
83+
NCCL's default graph mixing behavior.
84+
7485
*Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*
7586

7687
A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.

0 commit comments

Comments
 (0)