Commit df78025
Enable P2P transport for AMD systems with >2 GPUs at PHB level
On AMD multi-socket systems (e.g., EPYC Turin), GPUs on the same NUMA
node are connected through separate PCIe root complexes under the same
PCIe Host Bridge, resulting in PATH_PHB topology. The default P2P level
(PATH_PXB) disables P2P for these paths, forcing NCCL to fall back to
shared memory (SHM) transport.
This patch extends the existing AMD P2P exception to allow PHB-level P2P
for configurations with more than 2 GPUs. The original SYS-level P2P for
≤2 GPU configurations is preserved.
Benchmarked on dual-socket AMD EPYC 9575F (Turin) with 4x RTX PRO 6000
(Blackwell) on the same socket:
Transport change: SHM/direct/direct → P2P/direct pointer
Size | Stock (SHM) | Patched (P2P) | Improvement
------|-------------|---------------|------------
32K | 3.44 | 3.78 | +10%
128K | 9.22 | 11.56 | +25%
256K | 11.44 | 15.58 | +36%
512K | 13.16 | 18.69 | +42%
1M | 19.47 | 27.98 | +44%
2M | 24.21 | 34.81 | +44%
4M | 30.56 | 39.66 | +30%
16M | 36.04 | 44.93 | +25%
128M | 37.60 | 46.65 | +24%
(bus bandwidth in GB/s, all_reduce_perf -g 4 -n 500, Ring algorithm)
The only workaround is NCCL_P2P_LEVEL=SYS, which most users are not
aware of, resulting in significant performance loss especially for
latency-sensitive workloads like LLM inference.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 3619159 commit df78025
1 file changed
Lines changed: 7 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
324 | 324 | | |
325 | 325 | | |
326 | 326 | | |
327 | | - | |
328 | | - | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
329 | 334 | | |
330 | 335 | | |
331 | 336 | | |
| |||
0 commit comments