feat: Add HybridEP support for MoE expert parallelism by seonjinn · Pull Request #1942 · NVIDIA-NeMo/RL

seonjinn · 2026-02-13T19:37:01Z

Update DeepEP dependency to hybrid-ep branch for HybridEP support
- automodel, vllm, mcore dependency groups updated
Add HybridEP configuration options in _apply_moe_config():
- moe_flex_dispatcher_backend: Flex dispatcher backend (e.g., 'hybridep')
- moe_hybridep_num_sms: Number of SMs for HybridEP operations

Usage in config:
policy.megatron_cfg.moe_token_dispatcher_type=flex
policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep
policy.megatron_cfg.moe_hybridep_num_sms=32

See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

New Features
- Added configuration support for additional Mixture of Experts (MoE) model parameters, including dispatcher backend and HybridEP settings.
Dependencies
- Updated DeepEP dependency reference to use the hybrid-ep branch across multiple dependency groups.

coderabbitai · 2026-02-13T19:41:08Z

Walkthrough

The changes add two conditional configuration hooks to the MoE setup function for optional dispatcher and HybridEP parameters, and update the DeepEP dependency reference from a specific commit to the hybrid-ep branch across multiple dependency groups in the project configuration.

Changes

Cohort / File(s)	Summary
MoE Configuration Hooks `nemo_rl/models/megatron/setup.py`	Adds conditional assignments for `moe_flex_dispatcher_backend` and `moe_hybridep_num_sms` configuration parameters in the `_apply_moe_config` function.
Dependency Updates `pyproject.toml`	Updates DeepEP git dependency reference from commit `bfded34800dfec415b71503f8205181de90b2480` to the `hybrid-ep` branch across automodel, vllm, and mcore dependency groups with explanatory comments.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch sj/hybridep-support

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pyproject.toml (1)

324-327: ⚠️ Potential issue | 🟠 Major

Stale dependency-metadata version for deep_ep.

The dependency is pinned to the hybrid-ep branch (which dynamically generates its version from the current commit hash via git rev-parse --short HEAD), but the dependency-metadata version is statically set to v1.2.1+bfded34. This means the metadata version will become stale whenever the branch advances, potentially causing uv resolver failures.

Either:

Update to pin to a specific commit hash instead of a branch, or

Update the metadata version to match the current HEAD of hybrid-ep and regenerate it whenever the dependency updates

🤖 Fix all issues with AI agents

In `@nemo_rl/models/megatron/setup.py`:
- Around line 405-412: The new runtime keys moe_flex_dispatcher_backend and
moe_hybridep_num_sms are missing from the MegatronConfig TypedDict and from
example configs; add both to the MegatronConfig definition in
nemo_rl/models/policy/__init__.py as NotRequired entries (use the exact symbol
name MegatronConfig) with short docstrings: "Backend type for MoE flex
dispatcher (HybridEP)" for moe_flex_dispatcher_backend and "Number of SMs for
HybridEP" for moe_hybridep_num_sms, and then update at least one exemplar YAML
in examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.

🧹 Nitpick comments (1)

pyproject.toml (1)

70-72: Branch ref instead of pinned commit reduces build reproducibility.

All three dependency groups now point to @hybrid-ep (a moving branch) instead of a fixed commit hash. This means builds are not reproducible — a force-push or new commit on that branch silently changes what gets installed. Consider pinning to a specific commit on the hybrid-ep branch once it stabilizes.

terrykong

mostly lgtm @seonjinn , but the main thing i'm concerned about is the introduction of conda, we can continue discussing on the thread is tarted

copy-pr-bot · 2026-04-08T18:49:04Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Add deep_ep aarch64 dependency (7febc6e2, hybrid-ep branch) - Add HybridEP setup in megatron setup.py (IMEX env vars, NVLink domain config) - Add Qwen3-30B-A3B 4n4g config with moe_flex_dispatcher_backend=hybridep - Add EP=4, EP=8, sms16 config variants for ablation testing - Add test scripts for EP variants - Update Dockerfile for aarch64 deep_ep build support - Add HybridEP settings to 235B and DeepSeek-V3 performance configs Signed-off-by: sna <sna@nvidia.com>

Resolve conflicts: keep aarch64 deep_ep split, adopt main's flashinfer/cutlass/emerging-optimizers updates. Signed-off-by: sna <sna@nvidia.com>

github-actions · 2026-04-13T02:41:46Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0349e83 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>

github-actions · 2026-04-13T02:52:41Z

❌ Submodule Fast-Forward Check Failed

Check based on commit: ec53cb6 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-05-14T20:59:05Z

/ok to test 18f3318

Napoleon treats the first line after 'Returns:' as the return type description; plain identifiers there still get resolved as cross-refs. Wrap field names in literals in the tuple summary too, not just the bullet list, to suppress the ambiguous 'logprobs' xref warning. Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-05-14T23:56:34Z

/ok to test 9bb1778

Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-05-15T05:36:37Z

/ok to test a630e9e

Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-05-15T18:53:14Z

/ok to test c2fe685

Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-05-15T20:28:20Z

/ok to test 37794a5

seonjinn · 2026-05-15T20:39:16Z

/ok to test 37794a5

seonjinn · 2026-05-15T21:01:48Z

/ok to test 1ed0c10

seonjinn · 2026-05-15T21:34:25Z

/ok to test b9517fa

Signed-off-by: sna <sna@nvidia.com>

seonjinn · 2026-05-16T20:30:35Z

/ok to test a4f0ead

seonjinn · 2026-05-16T20:30:44Z

Empty retry commit to clear known fp8 numeric flake on L0_Unit_Tests_Policy::test_megatron_policy_training[2gpu_tp2_llama_fp8].

Flake evidence (same PR branch, alternating pass/fail):

SHA	L0_Policy
`7008449` (05-07)	PASS
`c8f0121` (05-11)	PASS
`9bb1778` (05-14)	FAIL
`a630e9e` (05-15)	PASS
`b9517fa` (05-15)	FAIL — losses[0]=1.17204e-4 vs losses[-1]=1.17239e-4 (3e-8 abs, fp8 noise floor)

Same code, alternating results → tolerance flake, not a regression. The PR touches DeepEP submodule + monkey-patches, not Megatron policy training paths.

guyueh1 · 2026-05-18T16:46:47Z

@terrykong to take a look, i think this PR is ready

terrykong · 2026-05-18T17:50:30Z

lgtm, with the exception of that differing deep EP commit. asked @seonjinn if that was intentional (if not we should document how to find each commit)

terrykong · 2026-05-18T18:16:10Z

Filed #2522 to track unifying the duplicated deep_ep commit pins in pyproject.toml — the same x86_64/aarch64 hashes are now repeated across 4 dependency groups (automodel, vllm, mcore, override-dependencies). That cleanup can land separately.

seonjinn requested review from a team as code owners February 13, 2026 19:37

coderabbitai Bot reviewed Feb 13, 2026

View reviewed changes

Comment thread nemo_rl/models/megatron/setup.py

seonjinn requested a review from guyueh1 February 13, 2026 19:42

seonjinn requested review from a team as code owners February 13, 2026 23:26

guyueh1 reviewed Feb 14, 2026

View reviewed changes

Comment thread pyproject.toml Outdated

Comment thread pyproject.toml Outdated

Comment thread pyproject.toml Outdated

seonjinn added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Mar 4, 2026

seonjinn requested a review from terrykong March 4, 2026 01:59

guyueh1 added the Performance Related to improving performance label Mar 5, 2026

seonjinn changed the title ~~feat: Add HybridEP support for MoE expert parallelism~~ feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism Mar 5, 2026

terrykong reviewed Mar 6, 2026

View reviewed changes

Comment thread ray.sub Outdated

terrykong reviewed Mar 6, 2026

View reviewed changes

anwithk added this to the v0.6 Release milestone Mar 20, 2026

guyueh1 reviewed Apr 8, 2026

View reviewed changes

Comment thread ray.sub Outdated

Comment thread ray.sub Outdated

guyueh1 changed the title ~~feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism~~ feat: Add HybridEP support for MoE expert parallelism Apr 8, 2026

seonjinn requested a review from a team as a code owner April 9, 2026 20:43

seonjinn force-pushed the sj/hybridep-support branch 2 times, most recently from 9acbdab to 7cc2e65 Compare April 13, 2026 01:23

seonjinn force-pushed the sj/hybridep-support branch from 7cc2e65 to 6c0cd7e Compare April 13, 2026 01:26

Merge latest main into sj/hybridep-support

0349e83

Resolve conflicts: keep aarch64 deep_ep split, adopt main's flashinfer/cutlass/emerging-optimizers updates. Signed-off-by: sna <sna@nvidia.com>

seonjinn added the r0.6.0 label Apr 13, 2026

Revert 3rdparty to match main, no submodule changes

ec53cb6

Signed-off-by: sna <sna@nvidia.com>

Update 3rdparty submodules to match origin/main

2e3b144

Signed-off-by: sna <sna@nvidia.com>

Merge branch 'main' into sj/hybridep-support

f8fa0b0

Signed-off-by: sna <sna@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 14, 2026 20:59 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 14, 2026 21:00 Failure

copy-pr-bot Bot temporarily deployed to public May 14, 2026 23:56 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 14, 2026 23:56 Inactive

copy-pr-bot Bot temporarily deployed to public May 14, 2026 23:57 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 15, 2026 00:38 Inactive

Merge branch 'main' into sj/hybridep-support

5009b0d

Signed-off-by: sna <sna@nvidia.com>

fix: pin NeMo Gym docs URL to v0.2.1 (latest 404)

5987018

Signed-off-by: sna <sna@nvidia.com>

Merge remote-tracking branch 'origin/main' into sj/hybridep-support

b9517fa

Signed-off-by: sna <sna@nvidia.com>

ci: retry to clear L0_Policy fp8 numeric flake

a4f0ead

Signed-off-by: sna <sna@nvidia.com>

guyueh1 approved these changes May 18, 2026

View reviewed changes

terrykong mentioned this pull request May 18, 2026

Unify duplicated deep_ep dependency commits in pyproject.toml #2522

Open

Uh oh!

Conversation

seonjinn commented Feb 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot Bot commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Apr 13, 2026

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

github-actions Bot commented Apr 13, 2026

❌ Submodule Fast-Forward Check Failed

❌ Submodules that need attention:

Uh oh!

seonjinn commented May 14, 2026

Uh oh!

seonjinn commented May 14, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 15, 2026

Uh oh!

seonjinn commented May 16, 2026

Uh oh!

seonjinn commented May 16, 2026

Uh oh!

guyueh1 commented May 18, 2026

Uh oh!

terrykong commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

terrykong commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

seonjinn commented Feb 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 13, 2026 •

edited

Loading

terrykong commented May 18, 2026 •

edited

Loading