[Superseded] Early PCIe custom-allreduce enablement#39040
Conversation
There was a problem hiding this comment.
Code Review
This pull request enables custom all-reduce for PCIe topologies with more than two GPUs by removing the previous restriction and early return. A suggestion was made to use logger.info_once to prevent redundant logging across multiple ranks in distributed environments.
| logger.info( | ||
| "PCIe topology detected with >2 GPUs. Custom allreduce " | ||
| "will use P2P cross-device reduce kernels." | ||
| ) |
There was a problem hiding this comment.
This log message will be printed by every rank in the process group, which can be very noisy in distributed settings (e.g., 8 times for an 8-GPU setup). It is better to use logger.info_once to ensure the message is only logged once per node/engine initialization, consistent with other distributed initialization logs in vLLM.
| logger.info( | |
| "PCIe topology detected with >2 GPUs. Custom allreduce " | |
| "will use P2P cross-device reduce kernels." | |
| ) | |
| logger.info_once( | |
| "PCIe topology detected with >2 GPUs. Custom allreduce " | |
| "will use P2P cross-device reduce kernels." | |
| ) |
Add opt-in support for custom allreduce on PCIe-only multi-GPU topologies. Previously hard-disabled for >2 GPUs without NVLink. The existing cross_device_reduce kernels work over PCIe P2P and give ~9% decode throughput improvement over NCCL. Set VLLM_ENABLE_PCIE_ALLREDUCE=1 to enable. Requires PCIe P2P capable driver (RTX GPUs need ForceP2P modprobe option). Benchmark (GLM-5 NVFP4, 8xRTX PRO 6000 Blackwell, TP=8, C=1): NCCL: 52.7 / 47.7 / 45.7 tok/s (ctx 0/16k/32k) Custom AR: 57.6 / 51.7 / 48.7 tok/s (+9%/+8%/+7%) Signed-off-by: Martin Vit <martin@voipmonitor.org>
55aa284 to
9be4f13
Compare
|
Can confirm this works on 4×RTX PRO 6000 Blackwell (TP=4). The ForceP2P modprobe config is critical — without it, custom AR silently falls back to NCCL because the auto-crossover benchmark detects P2P is slower (goes through SysMem staging at ~242μs vs ~17μs with BAR1 P2P). Setup: Qwen3.5-397B-A17B-NVFP4, 4×RTX PRO 6000 Blackwell, TP=4, MTP=3, FP8 KV, vLLM 0.19.0, driver 595.45.04 I ran three sequential benchmarks to isolate the ForceP2P contribution. The only change between runs 2 and 3 was rebooting with the modprobe config — same image, same code, same launch command.
ForceP2P isolated delta (run 2 → run 3, only change was the modprobe config + reboot):
The ~3-4% improvement from custom AR on TP=4 is consistent with AllReduce being a smaller fraction of total decode time at TP=4 vs TP=8 (smaller messages per call). Your TP=8 numbers showing +9% make sense — more AllReduce traffic means more headroom for the faster P2P path. Key gotcha:
Driver Config (required for PCIe oneshot AllReduce)/etc/modprobe.d/nvidia-p2p-override.conf Verify: Docker Image
Baked-in patches:
Model
Pre-Launch GPU Setupsudo nvidia-smi -i 0,1,2,3 --lock-gpu-clocks=2100,3090
sudo nvidia-smi -i 1 -pl 600
Launch Command
docker run -d \
--name vllm-qwen35 \
--gpus all --ipc=host --privileged \
--entrypoint bash \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
-e VLLM_USE_VERDICT_MOE=1 \
-e VLLM_VERDICT_MMA=1 \
-e VERDICT_USE_TMA=1 \
-p 9200:8000 \
-v /home/brandonmusic/models/lukealonso-qwen35-nvfp4:/models/qwen35-nvfp4 \
-v /home/brandonmusic/sm120-moe-bench/fused-moe/verdict_moe.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/verdict_moe.py:ro \
-v /home/brandonmusic/sm120-moe-bench/fused-moe/csrc:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/csrc:ro \
vllm-019-verdict:latest \
-c "python3 -m vllm.entrypoints.openai.api_server \
--model /models/qwen35-nvfp4 \
--served-model-name qwen3.5-397b-nvfp4 \
--tensor-parallel-size 4 \
--max-model-len 65536 \
--gpu-memory-utilization 0.90 \
--kv-cache-dtype fp8_e4m3 \
--speculative-config '{\"method\":\"mtp\",\"num_speculative_tokens\":3}' \
--trust-remote-code" (i have my own version of kernels) |
Superseded by #39633 and tracked in #37113.
The final change is a smaller correctness fix for the PCIe custom-allreduce path.