Skip to content

[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support#15597

Draft
nv-lschneider wants to merge 4 commits into
NVIDIA:mainfrom
nv-lschneider:tekit-mr-10228-squashed
Draft

[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support#15597
nv-lschneider wants to merge 4 commits into
NVIDIA:mainfrom
nv-lschneider:tekit-mr-10228-squashed

Conversation

@nv-lschneider

@nv-lschneider nv-lschneider commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added a new NCCL-based expert-parallel communication option for MoE workloads.
    • Expanded automatic backend selection to use the new option when supported, with fallback to existing methods.
    • Added support for the new backend in benchmark and test coverage.
  • Bug Fixes

    • Improved handling of token dispatch and combine flows for this communication path.
    • Added a verification mode in the benchmark to help catch backend routing issues before timing runs.

Description

This PR adds NCCL-EP v0.1 as a supported MoE communication option through nccl4py.

The scope is limited to enabling and validating NCCL-EP v0.1 support. Updates for NCCL-EP v0.2 are intentionally left out and will be handled in separate follow-up PRs.

Test Coverage

Relevant coverage:

  • tests/unittest/_torch/modules/moe/test_moe_comm.py
    • test_moe_comm
    • test_moe_comm_boundary
    • test_moe_comm_postquant
    • test_moe_comm_non_divisible_ep

The MoE communication validation path also supports NCCL-EP selection through tests/microbenchmarks/bench_moe_comm.py --verify.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Squashes GitLab MR 10228 into one commit.

Co-authored-by: Bo Li <lbo@nvidia.com>
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
@nv-lschneider nv-lschneider requested review from a team as code owners June 24, 2026 16:02
@nv-lschneider nv-lschneider requested a review from xxi-nv June 24, 2026 16:02
@nv-lschneider nv-lschneider marked this pull request as draft June 24, 2026 16:06
@nv-lschneider

Copy link
Copy Markdown
Collaborator Author

This PR is blocked from running CI and merging until the NCCL version is updated to 2.30.4 with PR: #15087

Please take the opportunity to review and bring up concerns, so we can merge as soon as dependencies are met.

@nv-lschneider nv-lschneider changed the title [NVBUG-6020038][feat] Use nccl4py for NCCL EP integration [NVBUG-6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026
@nv-lschneider nv-lschneider changed the title [NVBUG-6020038][feat] Add NCCL-EP v0.1 MoE communication support [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026
@nv-lschneider nv-lschneider changed the title [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds NcclEP, a new NCCL-backed expert-parallelism dispatch/combine communication strategy for fused MoE. The implementation spans a persistent context utility module (nccl_ep_utils.py), a NcclEP class implementing rank-major dispatch/combine, factory auto-selection and forced-method support, a new AlltoallMethodType.NcclEP = 5 enum variant, the nccl4py>=0.3 dependency, and corresponding unit test and microbenchmark additions.

Changes

NcclEP MoE Expert Parallelism Communication Backend

Layer / File(s) Summary
Interface enum, dependency, and public API exports
tensorrt_llm/_torch/modules/fused_moe/interface.py, requirements.txt, tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py
Adds AlltoallMethodType.NcclEP = 5, pins nccl4py>=0.3, and exports NcclEP from the communication package.
NcclEpContext persistent context and singleton cache
tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py
Implements availability/version checks, NcclEpContext with MPI sub-communicator, NCCL communicator, EP group, and all persistent receive buffers (including optional VMM/zero-copy path), plus singleton cache (get_nccl_ep_context) and process teardown (destroy_all_nccl_ep_contexts).
NcclEP dispatch/combine class
tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py
Implements NcclEP(Communication): constructor, _setup_handle (persistent NCCL EP handle with CUDA-graph-safe update), dispatch (validation, sentinel fill, buffer assembly, rank-major output with local→global expert-id translation), combine (3D reshape, handle.combine invocation), and destroy.
CommunicationFactory auto-selection and forced-method support
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
Adds NcclEP to the documented priority order, inserts a bfloat16 auto-selection step with RuntimeError fallthrough, and handles "NCCL_EP" in _create_forced_method().
Unit tests and microbenchmark coverage
tests/unittest/_torch/modules/moe/test_moe_comm.py, tests/microbenchmarks/bench_moe_comm.py
Unit tests add COMM_NCCL_EP to the test matrix with platform/feasibility gating and rework the COMM_DEEP_EP_LL combine reference. The microbenchmark adds NCCL_EP to defaults, a --verify flag, a sentinel-based _verify_dispatch_sentinel helper, and all-gather verification logic.

Sequence Diagram(s)

sequenceDiagram
  participant MoEPipeline
  participant CommunicationFactory
  participant NcclEP
  participant NcclEpContext
  participant nccl_ep_lib as nccl.ep

  MoEPipeline->>CommunicationFactory: create_strategy(act_dtype=bfloat16, ...)
  CommunicationFactory->>NcclEP: __init__(mapping, num_slots, hidden_size, ...)
  NcclEP->>nccl_ep_lib: is_nccl_ep_installed() check
  NcclEP->>NcclEpContext: get_nccl_ep_context(mapping, experts, tokens, hidden, top_k)
  NcclEpContext->>nccl_ep_lib: MPI split → Communicator → Group.create
  NcclEpContext->>NcclEpContext: allocate persistent buffers (output_tokens, recv_topk_idx, recv_topk_weights, ...)
  NcclEpContext-->>NcclEP: context
  NcclEP-->>CommunicationFactory: NcclEP instance
  CommunicationFactory-->>MoEPipeline: NcclEP strategy

  MoEPipeline->>NcclEP: dispatch(hidden_states, token_slots, token_weights, all_rank_num_tokens)
  NcclEP->>NcclEP: _setup_handle (create or rebind topk_nd)
  NcclEP->>nccl_ep_lib: handle.dispatch(DispatchInputs, DispatchOutputs, stream)
  nccl_ep_lib-->>NcclEP: dispatched tokens in output_tokens_buf
  NcclEP-->>MoEPipeline: (dispatched_hidden, scales, recv_topk_idx, recv_weights)

  MoEPipeline->>NcclEP: combine(final_hidden_states)
  NcclEP->>nccl_ep_lib: handle.combine(inputs, output_buf, stream)
  nccl_ep_lib-->>NcclEP: combined [num_tokens, hidden]
  NcclEP-->>MoEPipeline: combined hidden states
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

  • xxi-nv
  • hyukn
  • yuxianq
  • longcheng-nv
  • pengbowang-nv
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title matches the main change and follows the requested ticket/type/summary format.
Description check ✅ Passed The PR description includes the required Description and Test Coverage sections and the checklist confirmation, with only the optional title left out.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@requirements.txt`:
- Around line 29-33: The NCCL wheel constraint is too low for the new NCCL EP
path, so standard installs keep the backend disabled. Update the NCCL dependency
floor in requirements.txt by raising the nvidia-nccl-cu13 version range to a
release at or above the minimum enforced in nccl_ep_utils.py, and keep the
nccl4py dependency aligned with that backend requirement. Use the existing
checks in tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py as the source
of truth when choosing the new minimum.

In
`@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`:
- Around line 186-202: Validate the NCCL EP preconditions before instantiating
NcclEP in the communication selection logic. In communication_factory.py, update
the auto-selection path and the forced "NCCL_EP" branch to explicitly check that
act_dtype is torch.bfloat16 and that moe_max_num_tokens is provided before
constructing NcclEP; if the prerequisites are not met, skip to the fallback
strategy or raise a clear validation error. Use the existing NcclEP selection
points and the strategy dispatch in the factory to keep the behavior consistent
across both paths.

In `@tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py`:
- Around line 79-80: Replace the assert-based guards in the NCCL EP path with
explicit runtime exceptions so they cannot be stripped under optimization.
Update the checks in __init__(), dispatch(), and combine() in nccl_ep.py to
raise ValueError or RuntimeError for invalid moe_max_num_tokens,
all_rank_num_tokens, top_k, and the tensor-size validation in combine(), keeping
the existing validation logic but using exceptions instead of assert.

In `@tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py`:
- Around line 273-276: The FP8 hidden-size check in `nccl_ep_utils.py` currently
uses an `assert`, which can be removed under optimized Python and allow invalid
shapes to reach `torch.empty(...)`. Replace the `assert` inside the `use_fp8`
branch with an explicit `ValueError` in the same validation path so
`get_dispatch_layout` (or the surrounding FP8 dispatch logic) fails fast before
buffer allocation.
- Around line 74-78: Narrow the broad NCCL EP error handling in the probing and
teardown paths so unrelated bugs are not swallowed. In nccl_ep_utils, replace
each except Exception block with the specific nccl4py/MPI/runtime exception
types actually raised by the corresponding calls, keeping the existing fallback
behavior only for those expected failures. Apply the same tightening in
communication/nccl_ep.py around the NCCL EP setup/cleanup logic, and preserve
the current logging/return flow in the affected helper functions.

In `@tests/microbenchmarks/bench_moe_comm.py`:
- Around line 837-838: The benchmark workload in bench_moe_comm is capped by the
default local_num_tokens, which can produce fewer routes than ep_size and make
--verify fail even when the setup is valid. Update the workload sizing logic in
the benchmark entry points around the functions that build the MoE communication
patterns so the sentinel work scales up to cover every receiver when verify is
enabled, instead of relying on a fixed 16-token default. Keep the change
consistent across the affected call sites in the benchmark helpers so
verification and timing use the same scaled token count.
- Around line 895-897: Apply the Ruff-related cleanup in bench_moe_comm by
updating the affected comprehensions and list construction: add strict=True to
each zip() call referenced in the benchmark helpers, including the histogram
comprehension in the unique/counts mapping, and replace the [-1] +
list(range(ep_size)) pattern with [-1, *range(ep_size)] in the relevant routing
setup. Use the nearby symbols and benchmark functions in bench_moe_comm to
locate the exact zip() calls and the ep_size list expression.

In `@tests/unittest/_torch/modules/moe/test_moe_comm.py`:
- Around line 1306-1308: The representative-row filtering in the DeepEPLL
reference is only checking recv_slots[i, 0], which can drop tokens routed to
this rank when their first slot is outside the local range. Update the logic in
the reference path around the recv_slots/slot_start/slot_end check to inspect
all top-k slots for each row and keep the row if any slot falls within the
rank’s slot range, so ref is not undercounted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5b9d7635-e151-4cbc-992b-652f735f3268

📥 Commits

Reviewing files that changed from the base of the PR and between eb0cbdb and ad688ac.

📒 Files selected for processing (8)
  • requirements.txt
  • tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
  • tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py
  • tensorrt_llm/_torch/modules/fused_moe/interface.py
  • tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py
  • tests/microbenchmarks/bench_moe_comm.py
  • tests/unittest/_torch/modules/moe/test_moe_comm.py

Comment thread requirements.txt
Comment thread tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py Outdated
Comment thread tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py Outdated
Comment thread tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py Outdated
Comment thread tests/microbenchmarks/bench_moe_comm.py Outdated
Comment thread tests/microbenchmarks/bench_moe_comm.py
Comment thread tests/unittest/_torch/modules/moe/test_moe_comm.py Outdated
@nv-lschneider nv-lschneider changed the title [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026
Tighten NCCL-EP validation, teardown handling, benchmark verification sizing, and DeepEPLL reference checks based on CodeRabbit review.

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant