Skip to content

Commit df78025

Browse files
voipmonitorclaude
andcommitted
Enable P2P transport for AMD systems with >2 GPUs at PHB level
On AMD multi-socket systems (e.g., EPYC Turin), GPUs on the same NUMA node are connected through separate PCIe root complexes under the same PCIe Host Bridge, resulting in PATH_PHB topology. The default P2P level (PATH_PXB) disables P2P for these paths, forcing NCCL to fall back to shared memory (SHM) transport. This patch extends the existing AMD P2P exception to allow PHB-level P2P for configurations with more than 2 GPUs. The original SYS-level P2P for ≤2 GPU configurations is preserved. Benchmarked on dual-socket AMD EPYC 9575F (Turin) with 4x RTX PRO 6000 (Blackwell) on the same socket: Transport change: SHM/direct/direct → P2P/direct pointer Size | Stock (SHM) | Patched (P2P) | Improvement ------|-------------|---------------|------------ 32K | 3.44 | 3.78 | +10% 128K | 9.22 | 11.56 | +25% 256K | 11.44 | 15.58 | +36% 512K | 13.16 | 18.69 | +42% 1M | 19.47 | 27.98 | +44% 2M | 24.21 | 34.81 | +44% 4M | 30.56 | 39.66 | +30% 16M | 36.04 | 44.93 | +25% 128M | 37.60 | 46.65 | +24% (bus bandwidth in GB/s, all_reduce_perf -g 4 -n 500, Ring algorithm) The only workaround is NCCL_P2P_LEVEL=SYS, which most users are not aware of, resulting in significant performance loss especially for latency-sensitive workloads like LLM inference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 3619159 commit df78025

1 file changed

Lines changed: 7 additions & 2 deletions

File tree

src/graph/paths.cc

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -324,8 +324,13 @@ ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* syst
324324

325325
int arch, vendor, model;
326326
NCCLCHECK(ncclTopoCpuType(system, &arch, &vendor, &model));
327-
// Allow P2P between pairs of GPUs on AMD systems
328-
if ((arch == NCCL_TOPO_CPU_ARCH_X86 && vendor == NCCL_TOPO_CPU_VENDOR_AMD) && system->nodes[GPU].count <= 2) p2pLevel = PATH_SYS;
327+
// Allow P2P on AMD systems: SYS level for ≤2 GPUs (original behavior),
328+
// PHB level for >2 GPUs to enable same-socket P2P through the PCIe Host Bridge.
329+
// Without this, GPUs under separate root complexes on the same NUMA node
330+
// (PATH_PHB) fall back to shared memory transport, losing 24-46% bandwidth.
331+
if (arch == NCCL_TOPO_CPU_ARCH_X86 && vendor == NCCL_TOPO_CPU_VENDOR_AMD) {
332+
p2pLevel = (system->nodes[GPU].count <= 2) ? PATH_SYS : PATH_PHB;
333+
}
329334

330335
// User override
331336
NCCLCHECK(ncclGetUserP2pLevel(&p2pLevel));

0 commit comments

Comments
 (0)