Skip to content

feat: Add HybridEP support for MoE expert parallelism#1942

Open
seonjinn wants to merge 33 commits into
mainfrom
sj/hybridep-support
Open

feat: Add HybridEP support for MoE expert parallelism#1942
seonjinn wants to merge 33 commits into
mainfrom
sj/hybridep-support

Conversation

@seonjinn
Copy link
Copy Markdown
Contributor

@seonjinn seonjinn commented Feb 13, 2026

  • Update DeepEP dependency to hybrid-ep branch for HybridEP support
    • automodel, vllm, mcore dependency groups updated
  • Add HybridEP configuration options in _apply_moe_config():
    • moe_flex_dispatcher_backend: Flex dispatcher backend (e.g., 'hybridep')
    • moe_hybridep_num_sms: Number of SMs for HybridEP operations

Usage in config:
policy.megatron_cfg.moe_token_dispatcher_type=flex
policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep
policy.megatron_cfg.moe_hybridep_num_sms=32

See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • New Features

    • Added configuration support for additional Mixture of Experts (MoE) model parameters, including dispatcher backend and HybridEP settings.
  • Dependencies

    • Updated DeepEP dependency reference to use the hybrid-ep branch across multiple dependency groups.

@seonjinn seonjinn requested review from a team as code owners February 13, 2026 19:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Feb 13, 2026

Walkthrough

The changes add two conditional configuration hooks to the MoE setup function for optional dispatcher and HybridEP parameters, and update the DeepEP dependency reference from a specific commit to the hybrid-ep branch across multiple dependency groups in the project configuration.

Changes

Cohort / File(s) Summary
MoE Configuration Hooks
nemo_rl/models/megatron/setup.py
Adds conditional assignments for moe_flex_dispatcher_backend and moe_hybridep_num_sms configuration parameters in the _apply_moe_config function.
Dependency Updates
pyproject.toml
Updates DeepEP git dependency reference from commit bfded34800dfec415b71503f8205181de90b2480 to the hybrid-ep branch across automodel, vllm, and mcore dependency groups with explanatory comments.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch sj/hybridep-support

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pyproject.toml (1)

324-327: ⚠️ Potential issue | 🟠 Major

Stale dependency-metadata version for deep_ep.

The dependency is pinned to the hybrid-ep branch (which dynamically generates its version from the current commit hash via git rev-parse --short HEAD), but the dependency-metadata version is statically set to v1.2.1+bfded34. This means the metadata version will become stale whenever the branch advances, potentially causing uv resolver failures.

Either:

  1. Update to pin to a specific commit hash instead of a branch, or
  2. Update the metadata version to match the current HEAD of hybrid-ep and regenerate it whenever the dependency updates
🤖 Fix all issues with AI agents
In `@nemo_rl/models/megatron/setup.py`:
- Around line 405-412: The new runtime keys moe_flex_dispatcher_backend and
moe_hybridep_num_sms are missing from the MegatronConfig TypedDict and from
example configs; add both to the MegatronConfig definition in
nemo_rl/models/policy/__init__.py as NotRequired entries (use the exact symbol
name MegatronConfig) with short docstrings: "Backend type for MoE flex
dispatcher (HybridEP)" for moe_flex_dispatcher_backend and "Number of SMs for
HybridEP" for moe_hybridep_num_sms, and then update at least one exemplar YAML
in examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.
🧹 Nitpick comments (1)
pyproject.toml (1)

70-72: Branch ref instead of pinned commit reduces build reproducibility.

All three dependency groups now point to @hybrid-ep (a moving branch) instead of a fixed commit hash. This means builds are not reproducible — a force-push or new commit on that branch silently changes what gets installed. Consider pinning to a specific commit on the hybrid-ep branch once it stabilizes.

Comment thread nemo_rl/models/megatron/setup.py
@seonjinn seonjinn requested a review from guyueh1 February 13, 2026 19:42
@seonjinn seonjinn requested review from a team as code owners February 13, 2026 23:26
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
Comment thread pyproject.toml Outdated
@seonjinn seonjinn added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Mar 4, 2026
@seonjinn seonjinn requested a review from terrykong March 4, 2026 01:59
@guyueh1 guyueh1 added the Performance Related to improving performance label Mar 5, 2026
@seonjinn seonjinn changed the title feat: Add HybridEP support for MoE expert parallelism feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism Mar 5, 2026
Comment thread ray.sub Outdated
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm @seonjinn , but the main thing i'm concerned about is the introduction of conda, we can continue discussing on the thread is tarted

@anwithk anwithk added this to the v0.6 Release milestone Mar 20, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread ray.sub Outdated
Comment thread ray.sub Outdated
@guyueh1 guyueh1 changed the title feat: Add HybridEP/Partial CudaGraph support for MoE expert parallelism feat: Add HybridEP support for MoE expert parallelism Apr 8, 2026
@seonjinn seonjinn requested a review from a team as a code owner April 9, 2026 20:43
@seonjinn seonjinn force-pushed the sj/hybridep-support branch 2 times, most recently from 9acbdab to 7cc2e65 Compare April 13, 2026 01:23
- Add deep_ep aarch64 dependency (7febc6e2, hybrid-ep branch)
- Add HybridEP setup in megatron setup.py (IMEX env vars, NVLink domain config)
- Add Qwen3-30B-A3B 4n4g config with moe_flex_dispatcher_backend=hybridep
- Add EP=4, EP=8, sms16 config variants for ablation testing
- Add test scripts for EP variants
- Update Dockerfile for aarch64 deep_ep build support
- Add HybridEP settings to 235B and DeepSeek-V3 performance configs

Signed-off-by: sna <sna@nvidia.com>
@seonjinn seonjinn force-pushed the sj/hybridep-support branch from 7cc2e65 to 6c0cd7e Compare April 13, 2026 01:26
Resolve conflicts: keep aarch64 deep_ep split, adopt main's
flashinfer/cutlass/emerging-optimizers updates.

Signed-off-by: sna <sna@nvidia.com>
@github-actions
Copy link
Copy Markdown

❌ Submodule Fast-Forward Check Failed

Check based on commit: 0349e83 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>
@github-actions
Copy link
Copy Markdown

❌ Submodule Fast-Forward Check Failed

Check based on commit: ec53cb6 (PR #1942 from sj/hybridep-support)

❌ Submodules that need attention:

Megatron-Bridge: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/7110a964272a5c74dcb6b680b691087e190c220c/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA-NeMo/Megatron-Bridge/commits/a2bb70b91b827bd6b085a77442c7cf60cfdb59fe/

Megatron-LM: ❌ PR branch is BEHIND main branch
TARGET (main branch): https://github.com/NVIDIA/Megatron-LM/commits/17a67b9a97fb11a75933fd7f76ad76e1ac98a53d/
CURRENT (PR #1942 from sj/hybridep-support): https://github.com/NVIDIA/Megatron-LM/commits/9e2810417315a7ee93b41d4e234454abd3c16af5/

Please ensure all submodule commits are fast-forwards of the main branch before merging.

Signed-off-by: sna <sna@nvidia.com>
Signed-off-by: sna <sna@nvidia.com>

# Conflicts:
#	docker/Dockerfile
#	pyproject.toml
#	uv.lock
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test ed68bbc

Wrap bullet field names in double backticks so Napoleon/autodoc2 renders
them as literals instead of implicit cross-references. Resolves Sphinx
'more than one target found for cross-reference logprobs' warning that
fails docs build under --fail-on-warning (duplicate targets:
GenerationOutputSpec.logprobs vs LogprobOutputSpec.logprobs).

Signed-off-by: sna <sna@nvidia.com>
@seonjinn seonjinn requested a review from a team as a code owner May 14, 2026 18:36
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test 18f3318

Napoleon treats the first line after 'Returns:' as the return type
description; plain identifiers there still get resolved as cross-refs.
Wrap field names in literals in the tuple summary too, not just the
bullet list, to suppress the ambiguous 'logprobs' xref warning.

Signed-off-by: sna <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test 9bb1778

@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to test a630e9e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests deepseek Related to deepseek 671b Performance Related to improving performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[mcore] HybridEP support for GB200 DeepSeek GRPO and SFT

4 participants