feat: Add HybridEP support for MoE expert parallelism#1942
Conversation
WalkthroughThe changes add two conditional configuration hooks to the MoE setup function for optional dispatcher and HybridEP parameters, and update the DeepEP dependency reference from a specific commit to the hybrid-ep branch across multiple dependency groups in the project configuration. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
324-327:⚠️ Potential issue | 🟠 MajorStale
dependency-metadataversion fordeep_ep.The dependency is pinned to the
hybrid-epbranch (which dynamically generates its version from the current commit hash viagit rev-parse --short HEAD), but thedependency-metadataversion is statically set tov1.2.1+bfded34. This means the metadata version will become stale whenever the branch advances, potentially causing uv resolver failures.Either:
- Update to pin to a specific commit hash instead of a branch, or
- Update the metadata version to match the current HEAD of
hybrid-epand regenerate it whenever the dependency updates
🤖 Fix all issues with AI agents
In `@nemo_rl/models/megatron/setup.py`:
- Around line 405-412: The new runtime keys moe_flex_dispatcher_backend and
moe_hybridep_num_sms are missing from the MegatronConfig TypedDict and from
example configs; add both to the MegatronConfig definition in
nemo_rl/models/policy/__init__.py as NotRequired entries (use the exact symbol
name MegatronConfig) with short docstrings: "Backend type for MoE flex
dispatcher (HybridEP)" for moe_flex_dispatcher_backend and "Number of SMs for
HybridEP" for moe_hybridep_num_sms, and then update at least one exemplar YAML
in examples/configs (e.g., a megatron MoE config) to include these keys with
sensible defaults (recommended defaults) so they are documented and visible to
users.
🧹 Nitpick comments (1)
pyproject.toml (1)
70-72: Branch ref instead of pinned commit reduces build reproducibility.All three dependency groups now point to
@hybrid-ep(a moving branch) instead of a fixed commit hash. This means builds are not reproducible — a force-push or new commit on that branch silently changes what gets installed. Consider pinning to a specific commit on thehybrid-epbranch once it stabilizes.
9acbdab to
7cc2e65
Compare
- Add deep_ep aarch64 dependency (7febc6e2, hybrid-ep branch) - Add HybridEP setup in megatron setup.py (IMEX env vars, NVLink domain config) - Add Qwen3-30B-A3B 4n4g config with moe_flex_dispatcher_backend=hybridep - Add EP=4, EP=8, sms16 config variants for ablation testing - Add test scripts for EP variants - Update Dockerfile for aarch64 deep_ep build support - Add HybridEP settings to 235B and DeepSeek-V3 performance configs Signed-off-by: sna <sna@nvidia.com>
7cc2e65 to
6c0cd7e
Compare
Resolve conflicts: keep aarch64 deep_ep split, adopt main's flashinfer/cutlass/emerging-optimizers updates. Signed-off-by: sna <sna@nvidia.com>
❌ Submodule Fast-Forward Check FailedCheck based on commit: 0349e83 (PR #1942 from ❌ Submodules that need attention:Megatron-Bridge: ❌ PR branch is BEHIND main branch Megatron-LM: ❌ PR branch is BEHIND main branch Please ensure all submodule commits are fast-forwards of the main branch before merging. |
Signed-off-by: sna <sna@nvidia.com>
❌ Submodule Fast-Forward Check FailedCheck based on commit: ec53cb6 (PR #1942 from ❌ Submodules that need attention:Megatron-Bridge: ❌ PR branch is BEHIND main branch Megatron-LM: ❌ PR branch is BEHIND main branch Please ensure all submodule commits are fast-forwards of the main branch before merging. |
Signed-off-by: sna <sna@nvidia.com>
Signed-off-by: sna <sna@nvidia.com> # Conflicts: # docker/Dockerfile # pyproject.toml # uv.lock
|
/ok to test ed68bbc |
Wrap bullet field names in double backticks so Napoleon/autodoc2 renders them as literals instead of implicit cross-references. Resolves Sphinx 'more than one target found for cross-reference logprobs' warning that fails docs build under --fail-on-warning (duplicate targets: GenerationOutputSpec.logprobs vs LogprobOutputSpec.logprobs). Signed-off-by: sna <sna@nvidia.com>
|
/ok to test 18f3318 |
Napoleon treats the first line after 'Returns:' as the return type description; plain identifiers there still get resolved as cross-refs. Wrap field names in literals in the tuple summary too, not just the bullet list, to suppress the ambiguous 'logprobs' xref warning. Signed-off-by: sna <sna@nvidia.com>
|
/ok to test 9bb1778 |
|
/ok to test a630e9e |
Usage in config:
policy.megatron_cfg.moe_token_dispatcher_type=flex
policy.megatron_cfg.moe_flex_dispatcher_backend=hybridep
policy.megatron_cfg.moe_hybridep_num_sms=32
See: https://github.com/deepseek-ai/DeepEP/tree/hybrid-ep
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
New Features
Dependencies