Skip to content

[Superseded] Early PCIe custom-allreduce enablement#39040

Closed
voipmonitor wants to merge 1 commit into
vllm-project:mainfrom
voipmonitor:fix-pcie-custom-ar
Closed

[Superseded] Early PCIe custom-allreduce enablement#39040
voipmonitor wants to merge 1 commit into
vllm-project:mainfrom
voipmonitor:fix-pcie-custom-ar

Conversation

@voipmonitor
Copy link
Copy Markdown
Contributor

@voipmonitor voipmonitor commented Apr 5, 2026

Superseded by #39633 and tracked in #37113.

The final change is a smaller correctness fix for the PCIe custom-allreduce path.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables custom all-reduce for PCIe topologies with more than two GPUs by removing the previous restriction and early return. A suggestion was made to use logger.info_once to prevent redundant logging across multiple ranks in distributed environments.

Comment on lines +153 to 156
logger.info(
"PCIe topology detected with >2 GPUs. Custom allreduce "
"will use P2P cross-device reduce kernels."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This log message will be printed by every rank in the process group, which can be very noisy in distributed settings (e.g., 8 times for an 8-GPU setup). It is better to use logger.info_once to ensure the message is only logged once per node/engine initialization, consistent with other distributed initialization logs in vLLM.

Suggested change
logger.info(
"PCIe topology detected with >2 GPUs. Custom allreduce "
"will use P2P cross-device reduce kernels."
)
logger.info_once(
"PCIe topology detected with >2 GPUs. Custom allreduce "
"will use P2P cross-device reduce kernels."
)

Add opt-in support for custom allreduce on PCIe-only multi-GPU topologies.
Previously hard-disabled for >2 GPUs without NVLink. The existing
cross_device_reduce kernels work over PCIe P2P and give ~9% decode
throughput improvement over NCCL.

Set VLLM_ENABLE_PCIE_ALLREDUCE=1 to enable. Requires PCIe P2P capable
driver (RTX GPUs need ForceP2P modprobe option).

Benchmark (GLM-5 NVFP4, 8xRTX PRO 6000 Blackwell, TP=8, C=1):
  NCCL:      52.7 / 47.7 / 45.7 tok/s (ctx 0/16k/32k)
  Custom AR:  57.6 / 51.7 / 48.7 tok/s (+9%/+8%/+7%)

Signed-off-by: Martin Vit <martin@voipmonitor.org>
@brandonmmusic-max
Copy link
Copy Markdown

Can confirm this works on 4×RTX PRO 6000 Blackwell (TP=4). The ForceP2P modprobe config is critical — without it, custom AR silently falls back to NCCL because the auto-crossover benchmark detects P2P is slower (goes through SysMem staging at ~242μs vs ~17μs with BAR1 P2P).

Setup: Qwen3.5-397B-A17B-NVFP4, 4×RTX PRO 6000 Blackwell, TP=4, MTP=3, FP8 KV, vLLM 0.19.0, driver 595.45.04

I ran three sequential benchmarks to isolate the ForceP2P contribution. The only change between runs 2 and 3 was rebooting with the modprobe config — same image, same code, same launch command.

Config C=1 C=2 C=4 C=8
Baseline (NCCL SHM, custom AR disabled) 146.0 194.5 266.3 323.1
+ TMA kernel optimization (no ForceP2P) 144.2 215.7 301.1 364.8
+ ForceP2P (custom AR enabled) 149.2 215.5 300.4 378.7

ForceP2P isolated delta (run 2 → run 3, only change was the modprobe config + reboot):

Concurrency Before After Delta
C=1 144.2 149.2 +3.5%
C=2 215.7 215.5 flat
C=4 301.1 300.4 flat
C=8 364.8 378.7 +3.8%

The ~3-4% improvement from custom AR on TP=4 is consistent with AllReduce being a smaller fraction of total decode time at TP=4 vs TP=8 (smaller messages per call). Your TP=8 numbers showing +9% make sense — more AllReduce traffic means more headroom for the faster P2P path.

Key gotcha: nvidia-smi topo -m showing NODE (not PIX/PXB) means direct-attach without a PCIe switch — the driver does NOT enable BAR1 P2P by default. Verify with cat /proc/driver/nvidia/params | grep RegistryDwords — if it shows "", P2P isn't active and custom AR is silently doing nothing.
My benchmark i was was llm decode bench from VOIPMonitor, and here was teh config i ran ## Hardware

  • 4× NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 each, SM 12.0, PCIe Gen5 x16)
  • Threadripper 24C/48T, Pop!_OS (Ubuntu 24.04 base)
  • Driver 595.45.04, CUDA 13.2

Driver Config (required for PCIe oneshot AllReduce)

/etc/modprobe.d/nvidia-p2p-override.conf
options nvidia NVreg_RegistryDwords="ForceP2P=0x11;RMForceP2PType=1;RMPcieP2PType=2;GrdmaPciTopoCheckOverride=1;EnableResizableBar=1"

Verify: cat /proc/driver/nvidia/params | grep RegistryDwords should show ForceP2P=0x11

Docker Image

vllm-019-verdict:latest (42GB, vLLM 0.19.0)

Baked-in patches:

  • PCIe OneShot AllReduce (custom_all_reduce for >2 PCIe GPUs)
  • SM120 AllReduce entries (12.0 in CUSTOM + SYMM_MEM size tables)
  • BFD fix (build_for_drafting PREFILL/DECODE mismatch, Sprint 20)
  • VerdictMoE oracle (VERDICT_MOE backend enum + auto-selection)
  • VerdictMoE assertion fix (flashinfer_fp4_moe whitelist)
  • Shared experts aux stream fix
  • Pre-compiled CUDA kernels (verdict_moe_ext.so + verdict_fused_cooperative_ext.so)

Model

lukealonso-qwen35-nvfp4 (nvidia/Qwen3.5-397B-A17B-NVFP4)

Pre-Launch GPU Setup

sudo nvidia-smi -i 0,1,2,3 --lock-gpu-clocks=2100,3090
sudo nvidia-smi -i 1 -pl 600
Launch Command

docker run -d \
  --name vllm-qwen35 \
  --gpus all --ipc=host --privileged \
  --entrypoint bash \
  -e CUDA_DEVICE_ORDER=PCI_BUS_ID \
  -e VLLM_USE_VERDICT_MOE=1 \
  -e VLLM_VERDICT_MMA=1 \
  -e VERDICT_USE_TMA=1 \
  -p 9200:8000 \
  -v /home/brandonmusic/models/lukealonso-qwen35-nvfp4:/models/qwen35-nvfp4 \
  -v /home/brandonmusic/sm120-moe-bench/fused-moe/verdict_moe.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/verdict_moe.py:ro \
  -v /home/brandonmusic/sm120-moe-bench/fused-moe/csrc:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/csrc:ro \
  vllm-019-verdict:latest \
  -c "python3 -m vllm.entrypoints.openai.api_server \
  --model /models/qwen35-nvfp4 \
  --served-model-name qwen3.5-397b-nvfp4 \
  --tensor-parallel-size 4 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8_e4m3 \
  --speculative-config '{\"method\":\"mtp\",\"num_speculative_tokens\":3}' \
  --trust-remote-code" (i have my own version of kernels)

@voipmonitor voipmonitor changed the title [Perf] Enable custom allreduce on PCIe-only multi-GPU topologies [Superseded] [Perf] Enable custom allreduce on PCIe-only multi-GPU topologies Apr 12, 2026
@voipmonitor voipmonitor deleted the fix-pcie-custom-ar branch April 12, 2026 15:52
@voipmonitor voipmonitor changed the title [Superseded] [Perf] Enable custom allreduce on PCIe-only multi-GPU topologies [Superseded] Early PCIe custom-allreduce enablement Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants