[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support#15597
[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support#15597nv-lschneider wants to merge 4 commits into
Conversation
Squashes GitLab MR 10228 into one commit. Co-authored-by: Bo Li <lbo@nvidia.com> Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
|
This PR is blocked from running CI and merging until the NCCL version is updated to 2.30.4 with PR: #15087 Please take the opportunity to review and bring up concerns, so we can merge as soon as dependencies are met. |
📝 WalkthroughWalkthroughAdds ChangesNcclEP MoE Expert Parallelism Communication Backend
Sequence Diagram(s)sequenceDiagram
participant MoEPipeline
participant CommunicationFactory
participant NcclEP
participant NcclEpContext
participant nccl_ep_lib as nccl.ep
MoEPipeline->>CommunicationFactory: create_strategy(act_dtype=bfloat16, ...)
CommunicationFactory->>NcclEP: __init__(mapping, num_slots, hidden_size, ...)
NcclEP->>nccl_ep_lib: is_nccl_ep_installed() check
NcclEP->>NcclEpContext: get_nccl_ep_context(mapping, experts, tokens, hidden, top_k)
NcclEpContext->>nccl_ep_lib: MPI split → Communicator → Group.create
NcclEpContext->>NcclEpContext: allocate persistent buffers (output_tokens, recv_topk_idx, recv_topk_weights, ...)
NcclEpContext-->>NcclEP: context
NcclEP-->>CommunicationFactory: NcclEP instance
CommunicationFactory-->>MoEPipeline: NcclEP strategy
MoEPipeline->>NcclEP: dispatch(hidden_states, token_slots, token_weights, all_rank_num_tokens)
NcclEP->>NcclEP: _setup_handle (create or rebind topk_nd)
NcclEP->>nccl_ep_lib: handle.dispatch(DispatchInputs, DispatchOutputs, stream)
nccl_ep_lib-->>NcclEP: dispatched tokens in output_tokens_buf
NcclEP-->>MoEPipeline: (dispatched_hidden, scales, recv_topk_idx, recv_weights)
MoEPipeline->>NcclEP: combine(final_hidden_states)
NcclEP->>nccl_ep_lib: handle.combine(inputs, output_buf, stream)
nccl_ep_lib-->>NcclEP: combined [num_tokens, hidden]
NcclEP-->>MoEPipeline: combined hidden states
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@requirements.txt`:
- Around line 29-33: The NCCL wheel constraint is too low for the new NCCL EP
path, so standard installs keep the backend disabled. Update the NCCL dependency
floor in requirements.txt by raising the nvidia-nccl-cu13 version range to a
release at or above the minimum enforced in nccl_ep_utils.py, and keep the
nccl4py dependency aligned with that backend requirement. Use the existing
checks in tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py as the source
of truth when choosing the new minimum.
In
`@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`:
- Around line 186-202: Validate the NCCL EP preconditions before instantiating
NcclEP in the communication selection logic. In communication_factory.py, update
the auto-selection path and the forced "NCCL_EP" branch to explicitly check that
act_dtype is torch.bfloat16 and that moe_max_num_tokens is provided before
constructing NcclEP; if the prerequisites are not met, skip to the fallback
strategy or raise a clear validation error. Use the existing NcclEP selection
points and the strategy dispatch in the factory to keep the behavior consistent
across both paths.
In `@tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py`:
- Around line 79-80: Replace the assert-based guards in the NCCL EP path with
explicit runtime exceptions so they cannot be stripped under optimization.
Update the checks in __init__(), dispatch(), and combine() in nccl_ep.py to
raise ValueError or RuntimeError for invalid moe_max_num_tokens,
all_rank_num_tokens, top_k, and the tensor-size validation in combine(), keeping
the existing validation logic but using exceptions instead of assert.
In `@tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py`:
- Around line 273-276: The FP8 hidden-size check in `nccl_ep_utils.py` currently
uses an `assert`, which can be removed under optimized Python and allow invalid
shapes to reach `torch.empty(...)`. Replace the `assert` inside the `use_fp8`
branch with an explicit `ValueError` in the same validation path so
`get_dispatch_layout` (or the surrounding FP8 dispatch logic) fails fast before
buffer allocation.
- Around line 74-78: Narrow the broad NCCL EP error handling in the probing and
teardown paths so unrelated bugs are not swallowed. In nccl_ep_utils, replace
each except Exception block with the specific nccl4py/MPI/runtime exception
types actually raised by the corresponding calls, keeping the existing fallback
behavior only for those expected failures. Apply the same tightening in
communication/nccl_ep.py around the NCCL EP setup/cleanup logic, and preserve
the current logging/return flow in the affected helper functions.
In `@tests/microbenchmarks/bench_moe_comm.py`:
- Around line 837-838: The benchmark workload in bench_moe_comm is capped by the
default local_num_tokens, which can produce fewer routes than ep_size and make
--verify fail even when the setup is valid. Update the workload sizing logic in
the benchmark entry points around the functions that build the MoE communication
patterns so the sentinel work scales up to cover every receiver when verify is
enabled, instead of relying on a fixed 16-token default. Keep the change
consistent across the affected call sites in the benchmark helpers so
verification and timing use the same scaled token count.
- Around line 895-897: Apply the Ruff-related cleanup in bench_moe_comm by
updating the affected comprehensions and list construction: add strict=True to
each zip() call referenced in the benchmark helpers, including the histogram
comprehension in the unique/counts mapping, and replace the [-1] +
list(range(ep_size)) pattern with [-1, *range(ep_size)] in the relevant routing
setup. Use the nearby symbols and benchmark functions in bench_moe_comm to
locate the exact zip() calls and the ep_size list expression.
In `@tests/unittest/_torch/modules/moe/test_moe_comm.py`:
- Around line 1306-1308: The representative-row filtering in the DeepEPLL
reference is only checking recv_slots[i, 0], which can drop tokens routed to
this rank when their first slot is outside the local range. Update the logic in
the reference path around the recv_slots/slot_start/slot_end check to inspect
all top-k slots for each row and keep the row if any slot falls within the
rank’s slot range, so ref is not undercounted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5b9d7635-e151-4cbc-992b-652f735f3268
📒 Files selected for processing (8)
requirements.txttensorrt_llm/_torch/modules/fused_moe/communication/__init__.pytensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.pytensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.pytensorrt_llm/_torch/modules/fused_moe/interface.pytensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.pytests/microbenchmarks/bench_moe_comm.pytests/unittest/_torch/modules/moe/test_moe_comm.py
Tighten NCCL-EP validation, teardown handling, benchmark verification sizing, and DeepEPLL reference checks based on CodeRabbit review. Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
Summary by CodeRabbit
New Features
Bug Fixes
Description
This PR adds NCCL-EP v0.1 as a supported MoE communication option through
nccl4py.The scope is limited to enabling and validating NCCL-EP v0.1 support. Updates for NCCL-EP v0.2 are intentionally left out and will be handled in separate follow-up PRs.
Test Coverage
Relevant coverage:
tests/unittest/_torch/modules/moe/test_moe_comm.pytest_moe_commtest_moe_comm_boundarytest_moe_comm_postquanttest_moe_comm_non_divisible_epThe MoE communication validation path also supports NCCL-EP selection through
tests/microbenchmarks/bench_moe_comm.py --verify.GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.