feat: fix the vLLM DP path#2517
Conversation
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
|
/ok to test efc6fc2 |
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
|
/ok to test 9f381f2 |
|
fast CI is failing for |
|
/ok to test d03ceee |
|
@terrykong could you help to take a review as well? |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
|
/ok to test 2272d47 |
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
|
/ok to test 0ea633c |
@terrykong please review |
Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
|
/ok to test 05db6ab |
|
Thanks @guyueh1 . I think there is some patching for fp8 (fp8.py) that patches the raydistributedexecutor. Does this mean that functionality is broken by this PR unless we fix? Or does mp backend natively solve this? |
terrykong
left a comment
There was a problem hiding this comment.
Review: PR #2517 — feat: fix the vLLM DP path
Nice work enabling vLLM native DP (EP > TP) for both sync and async engines. The convergence curves and W&B links are helpful — thanks for the thorough testing.
Backend Selection Flow
For future readers, the distributed_executor_backend is now chosen as:
| Condition | Backend | Example |
|---|---|---|
model_parallel_size > 1 (TP>1 or PP>1) |
"ray" |
TP=4, EP=8 |
elif expert_parallel_size > 1 (EP>1, TP=1, PP=1) |
"mp" |
TP=1, EP=8 |
else |
None |
TP=1, EP=1 |
Then for vLLM-internal DP (when EP > TP):
- Sync engine: env vars
VLLM_DP_SIZE,VLLM_DP_RANK, etc. - Async engine: kwargs
data_parallel_size,data_parallel_rank, etc.
Regarding data_parallel_backend — this is a vLLM serving/online path parameter (used in Gym's local_vllm_model). The offline LLM() path used by NeMo-RL uses distributed_executor_backend + the data_parallel_* kwargs, which is correct.
Generated by Claude Code
| elif self.expert_parallel_size > 1: | ||
| # when there is data parallelism but no model-parallelism, we need to use | ||
| # the mp backend, otherwise it will default to ray and cause the worker to hang. | ||
| vllm_kwargs["distributed_executor_backend"] = "mp" |
There was a problem hiding this comment.
FP8 + "mp" backend concern: When model_parallel_size == 1 and EP > 1, the backend is set to "mp". In fp8.py:82-107, the model_parallel_size > 1 branch patches RayDistributedExecutor.collective_rpc to propagate FP8 patches to remote workers, while the else branch applies patches only in the current process. With "mp" backend, vLLM spawns child processes for DP workers — those child processes may not inherit the monkey-patches depending on fork vs spawn semantics.
Could you confirm that FP8 + EP>TP with TP=1 (i.e., the "mp" backend path) is tested or explicitly unsupported? If unsupported, consider adding an assertion to catch this early.
| "Please update your configuration to set `policy.generation.vllm_cfg.async_engine=false`. " | ||
| "See https://github.com/NVIDIA-NeMo/RL/issues/1101 for more details." | ||
| ) | ||
| self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1 |
There was a problem hiding this comment.
Consider adding a guard for configurations where dp_size is not a multiple of vllm_dp_size:
| self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1 | |
| self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1 | |
| assert self.dp_size % self.vllm_dp_size == 0, ( | |
| f"dp_size ({self.dp_size}) must be a multiple of vllm_dp_size ({self.vllm_dp_size}). " | |
| f"This means world_size / model_parallel_size must be divisible by ep_size / tp_size. " | |
| "Please check your cluster and parallelism configuration." | |
| ) |
Without this, the rank prefix loop at line 437-439 would produce incorrect mappings for a partial vLLM instance.
| 'median(data["train/token_mult_prob_error"]) < 1.1' \ | ||
| 'data["train/token_mult_prob_error"]["10"] < 1.1' \ | ||
| 'mean(data["train/reward"]) > 0.45' \ | ||
| 'mean(data["timing/train/total_step_time"], -11, -1) < 70' |
There was a problem hiding this comment.
Minor: with MAX_STEPS=10, the mean(timing/train/total_step_time, -11, -1) < 70 window covers the entire run including warmup steps. Baseline recipes typically use 30 steps so the timing window skips warmup. Could you confirm this threshold passes reliably, or should it be loosened slightly to account for early compilation/profiling overhead?
|
|
||
| # ===== BEGIN CONFIG ===== | ||
| NUM_NODES=4 | ||
| GPUS_PER_NODE=4 |
There was a problem hiding this comment.
grpo-moonlight-vllm-tp2ep16.sh:6-7
GPU count mismatch: This script sets NUM_NODES=4 and GPUS_PER_NODE=4 (16 GPUs total), but the recipe YAML inherits cluster: {gpus_per_node: 8, num_nodes: 4} from grpo-moonlight-16ba3b-4n8g-megatron.yaml without overriding it. tools/launch uses GPUS_PER_NODE only for SLURM allocation, not to override cluster.gpus_per_node in the config — so the Ray cluster will request 32 GPUs from a 16-GPU allocation, likely causing a resource starvation hang.
Either add cluster.gpus_per_node=4 to the recipe YAML (and rename to 4n4g), or remove the GPUS_PER_NODE=4 override to use the default 8.
Also note: if fixed to 8 GPUs/node (32 total), the nightly GPU-hour budget (currently bumped to 1380) will need to increase further.

What does this PR do ?
Previously nemo-rl doesn't work for vllm's native DP (EP>TP), this PR wants to support this case.
The following basic tests have passed, now trying the nightly test
New nightly test figures:

H100 with EP=8 async engine
https://wandb.ai/nvidia/nemo-rl/runs/4mcplb63
H100 with TP=2 EP=16 sync engine

https://wandb.ai/nvidia/nemo-rl/runs/b3mon8zg
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information