You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/features/disagg-serving.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -258,7 +258,12 @@ TRT-LLM uses some environment variables to control the behavior of disaggregated
258
258
259
259
There are some other useful environment variables that may help when encountering failures or performance issues.
260
260
261
-
* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
261
+
* `NCCL_GRAPH_MIXING_SUPPORT`: TensorRT-LLM now initializes common NCCL communicators with graph
262
+
mixing support off by default to reduce launch overhead for CUDA graph-captured NCCL operations.
263
+
This assumes the communicator is not used by parallel graph launches or by uncaptured NCCL calls
264
+
while a graph launch is outstanding. Set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore NCCL's default
265
+
graph mixing behavior if your workload needs it. For more details, see the
* `UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
311
316
317
+
*Q. How do I debug a suspected hang from overlapping NCCL graph operations?*
318
+
319
+
A. TensorRT-LLM turns graph mixing support off by default for common NCCL communicators. To check if
320
+
a hang might be related to NCCL graph mixing support, set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore
321
+
NCCL's default graph mixing behavior.
322
+
312
323
*Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*
313
324
314
325
A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.
Copy file name to clipboardExpand all lines: docs/source/legacy/advanced/disaggregated-service.md
+12-1Lines changed: 12 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,12 @@ TRT-LLM uses some environment variables to control the behavior of disaggregated
33
33
34
34
There are some other useful environment variables that may help when encountering failures or performance issues.
35
35
36
-
*`NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
36
+
*`NCCL_GRAPH_MIXING_SUPPORT`: TensorRT-LLM now initializes common NCCL communicators with graph
37
+
mixing support off by default to reduce launch overhead for CUDA graph-captured NCCL operations.
38
+
This assumes the communicator is not used by parallel graph launches or by uncaptured NCCL calls
39
+
while a graph launch is outstanding. Set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore NCCL's default
40
+
graph mixing behavior if your workload needs it. For more details, see the
*`UCX_MAX_RNDV_RAILS`: With the default value `2`, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting `UCX_MAX_RNDV_RAILS=1` can reduce contention in this case.
39
44
@@ -71,6 +76,12 @@ A. Yes, it's recommended that different executor use different GPUs. We support
71
76
72
77
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
73
78
79
+
*Q. How do I debug a suspected hang from overlapping NCCL graph operations?*
80
+
81
+
A. TensorRT-LLM turns graph mixing support off by default for common NCCL communicators. To check if
82
+
a hang might be related to NCCL graph mixing support, set `NCCL_GRAPH_MIXING_SUPPORT=1` to restore
83
+
NCCL's default graph mixing behavior.
84
+
74
85
*Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?*
75
86
76
87
A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.
0 commit comments