[Superseded] Early PCIe custom-allreduce enablement by voipmonitor · Pull Request #39040 · vllm-project/vllm

voipmonitor · 2026-04-05T19:47:22Z

Superseded by #39633 and tracked in #37113.

The final change is a smaller correctness fix for the PCIe custom-allreduce path.

gemini-code-assist

Code Review

This pull request enables custom all-reduce for PCIe topologies with more than two GPUs by removing the previous restriction and early return. A suggestion was made to use logger.info_once to prevent redundant logging across multiple ranks in distributed environments.

gemini-code-assist · 2026-04-05T19:49:39Z

+            logger.info(
+                "PCIe topology detected with >2 GPUs. Custom allreduce "
+                "will use P2P cross-device reduce kernels."
            )


This log message will be printed by every rank in the process group, which can be very noisy in distributed settings (e.g., 8 times for an 8-GPU setup). It is better to use logger.info_once to ensure the message is only logged once per node/engine initialization, consistent with other distributed initialization logs in vLLM.

Suggested change

logger.info(

"PCIe topology detected with >2 GPUs. Custom allreduce "

"will use P2P cross-device reduce kernels."

)

logger.info_once(

"PCIe topology detected with >2 GPUs. Custom allreduce "

"will use P2P cross-device reduce kernels."

)

Add opt-in support for custom allreduce on PCIe-only multi-GPU topologies. Previously hard-disabled for >2 GPUs without NVLink. The existing cross_device_reduce kernels work over PCIe P2P and give ~9% decode throughput improvement over NCCL. Set VLLM_ENABLE_PCIE_ALLREDUCE=1 to enable. Requires PCIe P2P capable driver (RTX GPUs need ForceP2P modprobe option). Benchmark (GLM-5 NVFP4, 8xRTX PRO 6000 Blackwell, TP=8, C=1): NCCL: 52.7 / 47.7 / 45.7 tok/s (ctx 0/16k/32k) Custom AR: 57.6 / 51.7 / 48.7 tok/s (+9%/+8%/+7%) Signed-off-by: Martin Vit <martin@voipmonitor.org>

brandonmmusic-max · 2026-04-06T05:09:12Z

Can confirm this works on 4×RTX PRO 6000 Blackwell (TP=4). The ForceP2P modprobe config is critical — without it, custom AR silently falls back to NCCL because the auto-crossover benchmark detects P2P is slower (goes through SysMem staging at ~242μs vs ~17μs with BAR1 P2P).

Setup: Qwen3.5-397B-A17B-NVFP4, 4×RTX PRO 6000 Blackwell, TP=4, MTP=3, FP8 KV, vLLM 0.19.0, driver 595.45.04

I ran three sequential benchmarks to isolate the ForceP2P contribution. The only change between runs 2 and 3 was rebooting with the modprobe config — same image, same code, same launch command.

Config	C=1	C=2	C=4	C=8
Baseline (NCCL SHM, custom AR disabled)	146.0	194.5	266.3	323.1
+ TMA kernel optimization (no ForceP2P)	144.2	215.7	301.1	364.8
+ ForceP2P (custom AR enabled)	149.2	215.5	300.4	378.7

ForceP2P isolated delta (run 2 → run 3, only change was the modprobe config + reboot):

Concurrency	Before	After	Delta
C=1	144.2	149.2	+3.5%
C=2	215.7	215.5	flat
C=4	301.1	300.4	flat
C=8	364.8	378.7	+3.8%

The ~3-4% improvement from custom AR on TP=4 is consistent with AllReduce being a smaller fraction of total decode time at TP=4 vs TP=8 (smaller messages per call). Your TP=8 numbers showing +9% make sense — more AllReduce traffic means more headroom for the faster P2P path.

Key gotcha: nvidia-smi topo -m showing NODE (not PIX/PXB) means direct-attach without a PCIe switch — the driver does NOT enable BAR1 P2P by default. Verify with cat /proc/driver/nvidia/params | grep RegistryDwords — if it shows "", P2P isn't active and custom AR is silently doing nothing.
My benchmark i was was llm decode bench from VOIPMonitor, and here was teh config i ran ## Hardware

4× NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 each, SM 12.0, PCIe Gen5 x16)
Threadripper 24C/48T, Pop!_OS (Ubuntu 24.04 base)
Driver 595.45.04, CUDA 13.2

Driver Config (required for PCIe oneshot AllReduce)

/etc/modprobe.d/nvidia-p2p-override.conf
options nvidia NVreg_RegistryDwords="ForceP2P=0x11;RMForceP2PType=1;RMPcieP2PType=2;GrdmaPciTopoCheckOverride=1;EnableResizableBar=1"

Verify: cat /proc/driver/nvidia/params | grep RegistryDwords should show ForceP2P=0x11

Docker Image

vllm-019-verdict:latest (42GB, vLLM 0.19.0)

Baked-in patches:

PCIe OneShot AllReduce (custom_all_reduce for >2 PCIe GPUs)
SM120 AllReduce entries (12.0 in CUSTOM + SYMM_MEM size tables)
BFD fix (build_for_drafting PREFILL/DECODE mismatch, Sprint 20)
VerdictMoE oracle (VERDICT_MOE backend enum + auto-selection)
VerdictMoE assertion fix (flashinfer_fp4_moe whitelist)
Shared experts aux stream fix
Pre-compiled CUDA kernels (verdict_moe_ext.so + verdict_fused_cooperative_ext.so)

Model

lukealonso-qwen35-nvfp4 (nvidia/Qwen3.5-397B-A17B-NVFP4)

Pre-Launch GPU Setup

sudo nvidia-smi -i 0,1,2,3 --lock-gpu-clocks=2100,3090
sudo nvidia-smi -i 1 -pl 600
Launch Command

docker run -d \
  --name vllm-qwen35 \
  --gpus all --ipc=host --privileged \
  --entrypoint bash \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e VLLM_USE_VERDICT_MOE=1 \
  -e VLLM_VERDICT_MMA=1 \
  -e VERDICT_USE_TMA=1 \
  -p 9200:8000 \
  -v /home/brandonmusic/models/lukealonso-qwen35-nvfp4:/models/qwen35-nvfp4 \
  -v /home/brandonmusic/sm120-moe-bench/fused-moe/verdict_moe.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/verdict_moe.py:ro \
  -v /home/brandonmusic/sm120-moe-bench/fused-moe/csrc:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/csrc:ro \
  vllm-019-verdict:latest \
  -c "python3 -m vllm.entrypoints.openai.api_server \
  --model /models/qwen35-nvfp4 \
  --served-model-name qwen3.5-397b-nvfp4 \
  --tensor-parallel-size 4 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8_e4m3 \
  --speculative-config '{\"method\":\"mtp\",\"num_speculative_tokens\":3}' \
  --trust-remote-code" (i have my own version of kernels)

voipmonitor mentioned this pull request Apr 5, 2026

[SM120][GLM-5.1] NVFP4 DCP/MTP stack tracker #37113

Open

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

voipmonitor force-pushed the fix-pcie-custom-ar branch from 55aa284 to 9be4f13 Compare April 5, 2026 19:53

voipmonitor mentioned this pull request Apr 5, 2026

[superseded] b12x MoE and dense FP4 GEMM backends #39042

Closed

mrigasiyer mentioned this pull request Apr 8, 2026

[Perf] Async GPU P2P access cache precomputation to reduce startup time #39249

Open

5 tasks

voipmonitor changed the title ~~[Perf] Enable custom allreduce on PCIe-only multi-GPU topologies~~ [Superseded] [Perf] Enable custom allreduce on PCIe-only multi-GPU topologies Apr 12, 2026

voipmonitor closed this Apr 12, 2026

voipmonitor deleted the fix-pcie-custom-ar branch April 12, 2026 15:52

voipmonitor changed the title ~~[Superseded] [Perf] Enable custom allreduce on PCIe-only multi-GPU topologies~~ [Superseded] Early PCIe custom-allreduce enablement Apr 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Superseded] Early PCIe custom-allreduce enablement#39040

[Superseded] Early PCIe custom-allreduce enablement#39040
voipmonitor wants to merge 1 commit into
vllm-project:mainfrom
voipmonitor:fix-pcie-custom-ar

voipmonitor commented Apr 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Uh oh!

brandonmmusic-max commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

voipmonitor commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

brandonmmusic-max commented Apr 6, 2026

Driver Config (required for PCIe oneshot AllReduce)

Docker Image

Model

Pre-Launch GPU Setup

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

voipmonitor commented Apr 5, 2026 •

edited

Loading