Skip to content

feat: fix the vLLM DP path#2517

Open
guyueh1 wants to merge 22 commits into
NVIDIA-NeMo:mainfrom
guyueh1:support_vllm_dp
Open

feat: fix the vLLM DP path#2517
guyueh1 wants to merge 22 commits into
NVIDIA-NeMo:mainfrom
guyueh1:support_vllm_dp

Conversation

@guyueh1

@guyueh1 guyueh1 commented May 18, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Previously nemo-rl doesn't work for vllm's native DP (EP>TP), this PR wants to support this case.

The following basic tests have passed, now trying the nightly test

# eval test
uv run examples/run_eval.py \
generation.model_name=Qwen/Qwen3-30B-A3B \
cluster.num_nodes=2 \
cluster.gpus_per_node=4 \
generation.vllm_cfg.tensor_parallel_size=4 \
generation.vllm_cfg.expert_parallel_size=8 \
generation.vllm_cfg.async_engine=true \

# grpo test
uv run examples/run_grpo.py \
--config examples/configs/grpo_math_1B_megatron.yaml \
policy.model_name=Qwen/Qwen3-30B-A3B \
cluster.num_nodes=4 \
cluster.gpus_per_node=4 \
policy.generation.colocated.enabled=false \
policy.generation.colocated.resources.num_nodes=2 \
policy.generation.colocated.resources.gpus_per_node=4 \
policy.generation.vllm_cfg.tensor_parallel_size=4 \
policy.generation.vllm_cfg.expert_parallel_size=8 \
policy.generation.vllm_cfg.async_engine=true \
policy.megatron_cfg.expert_model_parallel_size=8 \
policy.sequence_packing.enabled=false \

New nightly test figures:
H100 with EP=8 async engine
image
https://wandb.ai/nvidia/nemo-rl/runs/4mcplb63

H100 with TP=2 EP=16 sync engine
image
https://wandb.ai/nvidia/nemo-rl/runs/b3mon8zg

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners May 18, 2026 04:46
@copy-pr-bot

copy-pr-bot Bot commented May 18, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guyueh1 guyueh1 requested review from yuki-97 and removed request for a team May 18, 2026 04:46
@guyueh1 guyueh1 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 18, 2026
@guyueh1

guyueh1 commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test efc6fc2

Guyue Huang added 2 commits May 18, 2026 09:36
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
Signed-off-by: Guyue Huang <guyueh@login-lyris01.lyris.clusters.nvidia.com>
@guyueh1

guyueh1 commented May 18, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 9f381f2

@guyueh1

guyueh1 commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

fast CI is failing for uuidgen: command not found, trying the full set to see if it helps

@guyueh1 guyueh1 added CI:L2 Run doctests, unit tests, functional tests, and convergence tests and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels May 19, 2026
@guyueh1

guyueh1 commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

Ran a test on llama3-8B with vLLM TP=1 & PP=1 to study the impact of changing distributed_executor_backend from None to mp , there is no observable performance difference.
W B Chart 5_22_2026, 4_49_37 PM

guyueh1 added 2 commits May 22, 2026 16:57
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 force-pushed the support_vllm_dp branch from d0e091a to d03ceee Compare May 24, 2026 00:53
@guyueh1

guyueh1 commented May 24, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test d03ceee

Comment thread nemo_rl/models/generation/vllm/vllm_generation.py Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker.py Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker.py
Comment thread nemo_rl/models/generation/vllm/vllm_worker.py
Comment thread tests/test_suites/nightly.txt Outdated
@yuki-97

yuki-97 commented May 25, 2026

Copy link
Copy Markdown
Contributor

@terrykong could you help to take a review as well?

@yuki-97 yuki-97 requested a review from terrykong May 25, 2026 07:40
guyueh1 added 2 commits May 25, 2026 11:06
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1

guyueh1 commented May 25, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 2272d47

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1

guyueh1 commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 0ea633c

yuki-97
yuki-97 previously approved these changes May 29, 2026
@guyueh1

guyueh1 commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

@terrykong could you help to take a review as well?

@terrykong please review

Signed-off-by: Guyue Huang <140554423+guyueh1@users.noreply.github.com>
@guyueh1

guyueh1 commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 05db6ab

@terrykong

Copy link
Copy Markdown
Collaborator

Thanks @guyueh1 . I think there is some patching for fp8 (fp8.py) that patches the raydistributedexecutor. Does this mean that functionality is broken by this PR unless we fix? Or does mp backend natively solve this?

@terrykong terrykong left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #2517 — feat: fix the vLLM DP path

Nice work enabling vLLM native DP (EP > TP) for both sync and async engines. The convergence curves and W&B links are helpful — thanks for the thorough testing.

Backend Selection Flow

For future readers, the distributed_executor_backend is now chosen as:

Condition Backend Example
model_parallel_size > 1 (TP>1 or PP>1) "ray" TP=4, EP=8
elif expert_parallel_size > 1 (EP>1, TP=1, PP=1) "mp" TP=1, EP=8
else None TP=1, EP=1

Then for vLLM-internal DP (when EP > TP):

  • Sync engine: env vars VLLM_DP_SIZE, VLLM_DP_RANK, etc.
  • Async engine: kwargs data_parallel_size, data_parallel_rank, etc.

Regarding data_parallel_backend — this is a vLLM serving/online path parameter (used in Gym's local_vllm_model). The offline LLM() path used by NeMo-RL uses distributed_executor_backend + the data_parallel_* kwargs, which is correct.

Generated by Claude Code

elif self.expert_parallel_size > 1:
# when there is data parallelism but no model-parallelism, we need to use
# the mp backend, otherwise it will default to ray and cause the worker to hang.
vllm_kwargs["distributed_executor_backend"] = "mp"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm_worker.py:457-460

FP8 + "mp" backend concern: When model_parallel_size == 1 and EP > 1, the backend is set to "mp". In fp8.py:82-107, the model_parallel_size > 1 branch patches RayDistributedExecutor.collective_rpc to propagate FP8 patches to remote workers, while the else branch applies patches only in the current process. With "mp" backend, vLLM spawns child processes for DP workers — those child processes may not inherit the monkey-patches depending on fork vs spawn semantics.

Could you confirm that FP8 + EP>TP with TP=1 (i.e., the "mp" backend path) is tested or explicitly unsupported? If unsupported, consider adding an assertion to catch this early.

"Please update your configuration to set `policy.generation.vllm_cfg.async_engine=false`. "
"See https://github.com/NVIDIA-NeMo/RL/issues/1101 for more details."
)
self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vllm_generation.py:80

Consider adding a guard for configurations where dp_size is not a multiple of vllm_dp_size:

Suggested change
self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1
self.vllm_dp_size = self.ep_size // self.tp_size if self.ep_size > 1 else 1
assert self.dp_size % self.vllm_dp_size == 0, (
f"dp_size ({self.dp_size}) must be a multiple of vllm_dp_size ({self.vllm_dp_size}). "
f"This means world_size / model_parallel_size must be divisible by ep_size / tp_size. "
"Please check your cluster and parallelism configuration."
)

Without this, the rank prefix loop at line 437-439 would produce incorrect mappings for a partial vLLM instance.

'median(data["train/token_mult_prob_error"]) < 1.1' \
'data["train/token_mult_prob_error"]["10"] < 1.1' \
'mean(data["train/reward"]) > 0.45' \
'mean(data["timing/train/total_step_time"], -11, -1) < 70'

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpo-moonlight-vllm-dp8.sh:40

Minor: with MAX_STEPS=10, the mean(timing/train/total_step_time, -11, -1) < 70 window covers the entire run including warmup steps. Baseline recipes typically use 30 steps so the timing window skips warmup. Could you confirm this threshold passes reliably, or should it be loosened slightly to account for early compilation/profiling overhead?


# ===== BEGIN CONFIG =====
NUM_NODES=4
GPUS_PER_NODE=4

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpo-moonlight-vllm-tp2ep16.sh:6-7

GPU count mismatch: This script sets NUM_NODES=4 and GPUS_PER_NODE=4 (16 GPUs total), but the recipe YAML inherits cluster: {gpus_per_node: 8, num_nodes: 4} from grpo-moonlight-16ba3b-4n8g-megatron.yaml without overriding it. tools/launch uses GPUS_PER_NODE only for SLURM allocation, not to override cluster.gpus_per_node in the config — so the Ray cluster will request 32 GPUs from a 16-GPU allocation, likely causing a resource starvation hang.

Either add cluster.gpus_per_node=4 to the recipe YAML (and rename to 4n4g), or remove the GPUS_PER_NODE=4 override to use the default 8.

Also note: if fixed to 8 GPUs/node (32 total), the nightly GPU-hour budget (currently bumped to 1380) will need to increase further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support use vLLM DP + EP in async engine

3 participants