feat: fix the vLLM DP path by guyueh1 · Pull Request #2517 · NVIDIA-NeMo/RL

guyueh1 · 2026-05-18T04:46:25Z

What does this PR do ?

Previously nemo-rl doesn't work for vllm's native DP (EP>TP), this PR wants to support this case.

The following basic tests have passed, now trying the nightly test

# eval test
uv run examples/run_eval.py \
generation.model_name=Qwen/Qwen3-30B-A3B \
cluster.num_nodes=2 \
cluster.gpus_per_node=4 \
generation.vllm_cfg.tensor_parallel_size=4 \
generation.vllm_cfg.expert_parallel_size=8 \
generation.vllm_cfg.async_engine=true \

# grpo test
uv run examples/run_grpo.py \
--config examples/configs/grpo_math_1B_megatron.yaml \
policy.model_name=Qwen/Qwen3-30B-A3B \
cluster.num_nodes=4 \
cluster.gpus_per_node=4 \
policy.generation.colocated.enabled=false \
policy.generation.colocated.resources.num_nodes=2 \
policy.generation.colocated.resources.gpus_per_node=4 \
policy.generation.vllm_cfg.tensor_parallel_size=4 \
policy.generation.vllm_cfg.expert_parallel_size=8 \
policy.generation.vllm_cfg.async_engine=true \
policy.megatron_cfg.expert_model_parallel_size=8 \
policy.sequence_packing.enabled=false \

New nightly test figures:
H100 with EP=8 async engine

https://wandb.ai/nvidia/nemo-rl/runs/4mcplb63

H100 with TP=2 EP=16 sync engine

https://wandb.ai/nvidia/nemo-rl/runs/b3mon8zg

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>

copy-pr-bot · 2026-05-18T04:46:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

guyueh1 · 2026-05-18T04:52:40Z

/ok to test efc6fc2

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>

guyueh1 · 2026-05-18T16:45:07Z

/ok to test 9f381f2

guyueh1 · 2026-05-19T17:33:47Z

fast CI is failing for uuidgen: command not found, trying the full set to see if it helps

guyueh1 · 2026-05-22T23:51:01Z

Ran a test on llama3-8B with vLLM TP=1 & PP=1 to study the impact of changing distributed_executor_backend from None to mp , there is no observable performance difference.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2026-05-24T00:53:38Z

/ok to test d03ceee

yuki-97 · 2026-05-25T07:40:38Z

@terrykong could you help to take a review as well?

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2026-05-25T23:17:54Z

/ok to test 2272d47

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 · 2026-05-26T01:34:51Z

/ok to test 0ea633c

guyueh1 · 2026-05-30T02:29:53Z

@terrykong could you help to take a review as well?

@terrykong please review

Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>

guyueh1 · 2026-05-30T02:31:15Z

/ok to test 05db6ab

terrykong · 2026-05-30T07:29:14Z

Thanks @guyueh1 . I think there is some patching for fp8 (fp8.py) that patches the raydistributedexecutor. Does this mean that functionality is broken by this PR unless we fix? Or does mp backend natively solve this?

terrykong

Review: PR #2517 — feat: fix the vLLM DP path

Nice work enabling vLLM native DP (EP > TP) for both sync and async engines. The convergence curves and W&B links are helpful — thanks for the thorough testing.

Backend Selection Flow

For future readers, the distributed_executor_backend is now chosen as:

Condition	Backend	Example
`model_parallel_size > 1` (TP>1 or PP>1)	`"ray"`	TP=4, EP=8
`elif expert_parallel_size > 1` (EP>1, TP=1, PP=1)	`"mp"`	TP=1, EP=8
`else`	`None`	TP=1, EP=1

Then for vLLM-internal DP (when EP > TP):

Sync engine: env vars VLLM_DP_SIZE, VLLM_DP_RANK, etc.
Async engine: kwargs data_parallel_size, data_parallel_rank, etc.

Regarding data_parallel_backend — this is a vLLM serving/online path parameter (used in Gym's local_vllm_model). The offline LLM() path used by NeMo-RL uses distributed_executor_backend + the data_parallel_* kwargs, which is correct.

Generated by Claude Code

terrykong · 2026-06-01T20:42:55Z

+        elif self.expert_parallel_size > 1:
+            # when there is data parallelism but no model-parallelism, we need to use
+            # the mp backend, otherwise it will default to ray and cause the worker to hang.
+            vllm_kwargs["distributed_executor_backend"] = "mp"


vllm_worker.py:457-460

FP8 + "mp" backend concern: When model_parallel_size == 1 and EP > 1, the backend is set to "mp". In fp8.py:82-107, the model_parallel_size > 1 branch patches RayDistributedExecutor.collective_rpc to propagate FP8 patches to remote workers, while the else branch applies patches only in the current process. With "mp" backend, vLLM spawns child processes for DP workers — those child processes may not inherit the monkey-patches depending on fork vs spawn semantics.

Could you confirm that FP8 + EP>TP with TP=1 (i.e., the "mp" backend path) is tested or explicitly unsupported? If unsupported, consider adding an assertion to catch this early.

terrykong · 2026-06-01T20:42:55Z

-                    "Please update your configuration to set `policy.generation.vllm_cfg.async_engine=false`. "
-                    "See https://github.com/NVIDIA-NeMo/RL/issues/1101 for more details."
-                )
+        self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1


vllm_generation.py:80

Consider adding a guard for configurations where dp_size is not a multiple of vllm_dp_size:

Suggested change

self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1

self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1

assert self.dp_size % self.vllm_dp_size == 0, (

f"dp_size ({self.dp_size}) must be a multiple of vllm_dp_size ({self.vllm_dp_size}). "

f"This means world_size / model_parallel_size must be divisible by ep_size / tp_size. "

"Please check your cluster and parallelism configuration."

)

Without this, the rank prefix loop at line 437-439 would produce incorrect mappings for a partial vLLM instance.

terrykong · 2026-06-01T20:42:55Z

+        'median(data["train/token_mult_prob_error"]) < 1.1' \
+        'data["train/token_mult_prob_error"]["10"] < 1.1' \
+        'mean(data["train/reward"]) > 0.45' \
+        'mean(data["timing/train/total_step_time"], -11, -1) < 70'


grpo-moonlight-vllm-dp8.sh:40

Minor: with MAX_STEPS=10, the mean(timing/train/total_step_time, -11, -1) < 70 window covers the entire run including warmup steps. Baseline recipes typically use 30 steps so the timing window skips warmup. Could you confirm this threshold passes reliably, or should it be loosened slightly to account for early compilation/profiling overhead?

terrykong · 2026-06-01T21:41:54Z

+
+# ===== BEGIN CONFIG =====
+NUM_NODES=4
+GPUS_PER_NODE=4


grpo-moonlight-vllm-tp2ep16.sh:6-7

GPU count mismatch: This script sets NUM_NODES=4 and GPUS_PER_NODE=4 (16 GPUs total), but the recipe YAML inherits cluster: {gpus_per_node: 8, num_nodes: 4} from grpo-moonlight-16ba3b-4n8g-megatron.yaml without overriding it. tools/launch uses GPUS_PER_NODE only for SLURM allocation, not to override cluster.gpus_per_node in the config — so the Ray cluster will request 32 GPUs from a 16-GPU allocation, likely causing a resource starvation hang.

Either add cluster.gpus_per_node=4 to the recipe YAML (and rename to 4n4g), or remove the GPUS_PER_NODE=4 override to use the default 8.

Also note: if fixed to 8 GPUs/node (32 total), the nightly GPU-hour budget (currently bumped to 1380) will need to increase further.

Support vllm dp

efc6fc2

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>

guyueh1 requested review from a team as code owners May 18, 2026 04:46

guyueh1 requested review from yuki-97 and removed request for a team May 18, 2026 04:46

guyueh1 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 18, 2026

copy-pr-bot Bot temporarily deployed to public May 18, 2026 04:52 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 04:53 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 04:53 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 04:53 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 04:57 Inactive

Guyue Huang added 2 commits May 18, 2026 09:36

Add functional test to test suite

436f5ad

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>

Fix lint

9f381f2

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 18, 2026 16:45 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 18, 2026 16:45 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 16:45 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 16:45 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 16:50 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 19, 2026 16:24 Failure

guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels May 19, 2026

copy-pr-bot Bot temporarily deployed to public May 22, 2026 20:31 Inactive

copy-pr-bot Bot temporarily deployed to public May 22, 2026 20:35 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 20:54 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 22:28 Inactive

guyueh1 added 2 commits May 22, 2026 16:57

Merge branch 'save' into support_vllm_dp

71c1c70

Fix

d03ceee

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

guyueh1 force-pushed the support_vllm_dp branch from d0e091a to d03ceee Compare May 24, 2026 00:53

copy-pr-bot Bot temporarily deployed to public May 24, 2026 00:53 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 24, 2026 00:54 Inactive

copy-pr-bot Bot temporarily deployed to public May 24, 2026 00:54 Inactive

copy-pr-bot Bot temporarily deployed to public May 24, 2026 00:58 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 24, 2026 01:31 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 24, 2026 03:06 Inactive

yuki-97 reviewed May 25, 2026

View reviewed changes

yuki-97 requested a review from terrykong May 25, 2026 07:40

guyueh1 added 2 commits May 25, 2026 11:06

Small fixes

1f7c46a

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

add nightly test, review comments

2272d47

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

Add nightly test under H100 because Gb200 is broken

0ea633c

Signed-off-by: Guyue Huang <guyueh@nvidia.com>

yuki-97 previously approved these changes May 29, 2026

View reviewed changes

Merge branch 'main' into support_vllm_dp

05db6ab

Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>

terrykong reviewed Jun 1, 2026

View reviewed changes

-        self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1
+        self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1
+        assert self.dp_size % self.vllm_dp_size == 0, (
+            f"dp_size ({self.dp_size}) must be a multiple of vllm_dp_size ({self.vllm_dp_size}). "
+            f"This means world_size / model_parallel_size must be divisible by ep_size / tp_size. "
+            "Please check your cluster and parallelism configuration."
+        )

Uh oh!

Conversation

guyueh1 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

guyueh1 commented May 18, 2026

Uh oh!

guyueh1 commented May 18, 2026

Uh oh!

guyueh1 commented May 19, 2026

Uh oh!

guyueh1 commented May 22, 2026

Uh oh!

guyueh1 commented May 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuki-97 commented May 25, 2026

Uh oh!

guyueh1 commented May 25, 2026

Uh oh!

guyueh1 commented May 26, 2026

Uh oh!

guyueh1 commented May 30, 2026

Uh oh!

guyueh1 commented May 30, 2026

Uh oh!

terrykong commented May 30, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Review: PR #2517 — feat: fix the vLLM DP path

Backend Selection Flow

Uh oh!

terrykong Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

terrykong Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

guyueh1 commented May 18, 2026 •

edited

Loading