[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support by nv-lschneider · Pull Request #15597 · NVIDIA/TensorRT-LLM

nv-lschneider · 2026-06-24T16:02:03Z

Summary by CodeRabbit

New Features
- Added a new NCCL-based expert-parallel communication option for MoE workloads.
- Expanded automatic backend selection to use the new option when supported, with fallback to existing methods.
- Added support for the new backend in benchmark and test coverage.
Bug Fixes
- Improved handling of token dispatch and combine flows for this communication path.
- Added a verification mode in the benchmark to help catch backend routing issues before timing runs.

Description

This PR adds NCCL-EP v0.1 as a supported MoE communication option through nccl4py.

The scope is limited to enabling and validating NCCL-EP v0.1 support. Updates for NCCL-EP v0.2 are intentionally left out and will be handled in separate follow-up PRs.

Test Coverage

Relevant coverage:

tests/unittest/_torch/modules/moe/test_moe_comm.py
- test_moe_comm
- test_moe_comm_boundary
- test_moe_comm_postquant
- test_moe_comm_non_divisible_ep

The MoE communication validation path also supports NCCL-EP selection through tests/microbenchmarks/bench_moe_comm.py --verify.

Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Squashes GitLab MR 10228 into one commit. Co-authored-by: Bo Li <lbo@nvidia.com> Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

nv-lschneider · 2026-06-24T16:09:31Z

This PR is blocked from running CI and merging until the NCCL version is updated to 2.30.4 with PR: #15087

Please take the opportunity to review and bring up concerns, so we can merge as soon as dependencies are met.

coderabbitai · 2026-06-24T16:14:30Z

📝 Walkthrough

Walkthrough

Adds NcclEP, a new NCCL-backed expert-parallelism dispatch/combine communication strategy for fused MoE. The implementation spans a persistent context utility module (nccl_ep_utils.py), a NcclEP class implementing rank-major dispatch/combine, factory auto-selection and forced-method support, a new AlltoallMethodType.NcclEP = 5 enum variant, the nccl4py>=0.3 dependency, and corresponding unit test and microbenchmark additions.

Changes

NcclEP MoE Expert Parallelism Communication Backend

Layer / File(s)	Summary
Interface enum, dependency, and public API exports `tensorrt_llm/_torch/modules/fused_moe/interface.py`, `requirements.txt`, `tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py`	Adds `AlltoallMethodType.NcclEP = 5`, pins `nccl4py>=0.3`, and exports `NcclEP` from the communication package.
NcclEpContext persistent context and singleton cache `tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py`	Implements availability/version checks, `NcclEpContext` with MPI sub-communicator, NCCL communicator, EP group, and all persistent receive buffers (including optional VMM/zero-copy path), plus singleton cache (`get_nccl_ep_context`) and process teardown (`destroy_all_nccl_ep_contexts`).
NcclEP dispatch/combine class `tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py`	Implements `NcclEP(Communication)`: constructor, `_setup_handle` (persistent NCCL EP handle with CUDA-graph-safe update), `dispatch` (validation, sentinel fill, buffer assembly, rank-major output with local→global expert-id translation), `combine` (3D reshape, `handle.combine` invocation), and `destroy`.
CommunicationFactory auto-selection and forced-method support `tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`	Adds `NcclEP` to the documented priority order, inserts a bfloat16 auto-selection step with `RuntimeError` fallthrough, and handles `"NCCL_EP"` in `_create_forced_method()`.
Unit tests and microbenchmark coverage `tests/unittest/_torch/modules/moe/test_moe_comm.py`, `tests/microbenchmarks/bench_moe_comm.py`	Unit tests add `COMM_NCCL_EP` to the test matrix with platform/feasibility gating and rework the `COMM_DEEP_EP_LL` combine reference. The microbenchmark adds `NCCL_EP` to defaults, a `--verify` flag, a sentinel-based `_verify_dispatch_sentinel` helper, and all-gather verification logic.

Sequence Diagram(s)

sequenceDiagram
  participant MoEPipeline
  participant CommunicationFactory
  participant NcclEP
  participant NcclEpContext
  participant nccl_ep_lib as nccl.ep

  MoEPipeline->>CommunicationFactory: create_strategy(act_dtype=bfloat16, ...)
  CommunicationFactory->>NcclEP: __init__(mapping, num_slots, hidden_size, ...)
  NcclEP->>nccl_ep_lib: is_nccl_ep_installed() check
  NcclEP->>NcclEpContext: get_nccl_ep_context(mapping, experts, tokens, hidden, top_k)
  NcclEpContext->>nccl_ep_lib: MPI split → Communicator → Group.create
  NcclEpContext->>NcclEpContext: allocate persistent buffers (output_tokens, recv_topk_idx, recv_topk_weights, ...)
  NcclEpContext-->>NcclEP: context
  NcclEP-->>CommunicationFactory: NcclEP instance
  CommunicationFactory-->>MoEPipeline: NcclEP strategy

  MoEPipeline->>NcclEP: dispatch(hidden_states, token_slots, token_weights, all_rank_num_tokens)
  NcclEP->>NcclEP: _setup_handle (create or rebind topk_nd)
  NcclEP->>nccl_ep_lib: handle.dispatch(DispatchInputs, DispatchOutputs, stream)
  nccl_ep_lib-->>NcclEP: dispatched tokens in output_tokens_buf
  NcclEP-->>MoEPipeline: (dispatched_hidden, scales, recv_topk_idx, recv_weights)

  MoEPipeline->>NcclEP: combine(final_hidden_states)
  NcclEP->>nccl_ep_lib: handle.combine(inputs, output_buf, stream)
  nccl_ep_lib-->>NcclEP: combined [num_tokens, hidden]
  NcclEP-->>MoEPipeline: combined hidden states

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

xxi-nv
hyukn
yuxianq
longcheng-nv
pengbowang-nv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title matches the main change and follows the requested ticket/type/summary format.
Description check	✅ Passed	The PR description includes the required Description and Test Coverage sections and the checklist confirmation, with only the optional title left out.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 8

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@requirements.txt`:
- Around line 29-33: The NCCL wheel constraint is too low for the new NCCL EP
path, so standard installs keep the backend disabled. Update the NCCL dependency
floor in requirements.txt by raising the nvidia-nccl-cu13 version range to a
release at or above the minimum enforced in nccl_ep_utils.py, and keep the
nccl4py dependency aligned with that backend requirement. Use the existing
checks in tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py as the source
of truth when choosing the new minimum.

In
`@tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py`:
- Around line 186-202: Validate the NCCL EP preconditions before instantiating
NcclEP in the communication selection logic. In communication_factory.py, update
the auto-selection path and the forced "NCCL_EP" branch to explicitly check that
act_dtype is torch.bfloat16 and that moe_max_num_tokens is provided before
constructing NcclEP; if the prerequisites are not met, skip to the fallback
strategy or raise a clear validation error. Use the existing NcclEP selection
points and the strategy dispatch in the factory to keep the behavior consistent
across both paths.

In `@tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py`:
- Around line 79-80: Replace the assert-based guards in the NCCL EP path with
explicit runtime exceptions so they cannot be stripped under optimization.
Update the checks in __init__(), dispatch(), and combine() in nccl_ep.py to
raise ValueError or RuntimeError for invalid moe_max_num_tokens,
all_rank_num_tokens, top_k, and the tensor-size validation in combine(), keeping
the existing validation logic but using exceptions instead of assert.

In `@tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py`:
- Around line 273-276: The FP8 hidden-size check in `nccl_ep_utils.py` currently
uses an `assert`, which can be removed under optimized Python and allow invalid
shapes to reach `torch.empty(...)`. Replace the `assert` inside the `use_fp8`
branch with an explicit `ValueError` in the same validation path so
`get_dispatch_layout` (or the surrounding FP8 dispatch logic) fails fast before
buffer allocation.
- Around line 74-78: Narrow the broad NCCL EP error handling in the probing and
teardown paths so unrelated bugs are not swallowed. In nccl_ep_utils, replace
each except Exception block with the specific nccl4py/MPI/runtime exception
types actually raised by the corresponding calls, keeping the existing fallback
behavior only for those expected failures. Apply the same tightening in
communication/nccl_ep.py around the NCCL EP setup/cleanup logic, and preserve
the current logging/return flow in the affected helper functions.

In `@tests/microbenchmarks/bench_moe_comm.py`:
- Around line 837-838: The benchmark workload in bench_moe_comm is capped by the
default local_num_tokens, which can produce fewer routes than ep_size and make
--verify fail even when the setup is valid. Update the workload sizing logic in
the benchmark entry points around the functions that build the MoE communication
patterns so the sentinel work scales up to cover every receiver when verify is
enabled, instead of relying on a fixed 16-token default. Keep the change
consistent across the affected call sites in the benchmark helpers so
verification and timing use the same scaled token count.
- Around line 895-897: Apply the Ruff-related cleanup in bench_moe_comm by
updating the affected comprehensions and list construction: add strict=True to
each zip() call referenced in the benchmark helpers, including the histogram
comprehension in the unique/counts mapping, and replace the [-1] +
list(range(ep_size)) pattern with [-1, *range(ep_size)] in the relevant routing
setup. Use the nearby symbols and benchmark functions in bench_moe_comm to
locate the exact zip() calls and the ep_size list expression.

In `@tests/unittest/_torch/modules/moe/test_moe_comm.py`:
- Around line 1306-1308: The representative-row filtering in the DeepEPLL
reference is only checking recv_slots[i, 0], which can drop tokens routed to
this rank when their first slot is outside the local range. Update the logic in
the reference path around the recv_slots/slot_start/slot_end check to inspect
all top-k slots for each row and keep the row if any slot falls within the
rank’s slot range, so ref is not undercounted.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5b9d7635-e151-4cbc-992b-652f735f3268

📥 Commits

Reviewing files that changed from the base of the PR and between eb0cbdb and ad688ac.

📒 Files selected for processing (8)

requirements.txt
tensorrt_llm/_torch/modules/fused_moe/communication/__init__.py
tensorrt_llm/_torch/modules/fused_moe/communication/communication_factory.py
tensorrt_llm/_torch/modules/fused_moe/communication/nccl_ep.py
tensorrt_llm/_torch/modules/fused_moe/interface.py
tensorrt_llm/_torch/modules/fused_moe/nccl_ep_utils.py
tests/microbenchmarks/bench_moe_comm.py
tests/unittest/_torch/modules/moe/test_moe_comm.py

Tighten NCCL-EP validation, teardown handling, benchmark verification sizing, and DeepEPLL reference checks based on CodeRabbit review. Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

[NVBUG-6020038][feat] Use nccl4py for NCCL EP integration

ad688ac

Squashes GitLab MR 10228 into one commit. Co-authored-by: Bo Li <lbo@nvidia.com> Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

nv-lschneider requested review from a team as code owners June 24, 2026 16:02

nv-lschneider requested a review from xxi-nv June 24, 2026 16:02

github-actions Bot assigned nv-lschneider Jun 24, 2026

nv-lschneider marked this pull request as draft June 24, 2026 16:06

Merge branch 'main' into tekit-mr-10228-squashed

59f276c

nv-lschneider changed the title ~~[NVBUG-6020038][feat] Use nccl4py for NCCL EP integration~~ [NVBUG-6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026

nv-lschneider changed the title ~~[NVBUG-6020038][feat] Add NCCL-EP v0.1 MoE communication support~~ [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026

nv-lschneider changed the title ~~[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support~~ [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026

nv-lschneider requested review from Tabrizian and bobboli June 24, 2026 16:13

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

nv-lschneider changed the title ~~[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support~~ [https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support Jun 24, 2026

nv-lschneider added 2 commits June 24, 2026 14:32

[NVBUG-6020038][chore] Address CodeRabbit NCCL-EP review feedback

2c9ebd2

Tighten NCCL-EP validation, teardown handling, benchmark verification sizing, and DeepEPLL reference checks based on CodeRabbit review. Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

[NVBUG-6020038][fix] fix NCCL EP teardown and folded weights

e6c35cb

Signed-off-by: Ludwig Schneider <lschneider@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support#15597

[https://nvbugs/6020038][feat] Add NCCL-EP v0.1 MoE communication support#15597
nv-lschneider wants to merge 4 commits into
NVIDIA:mainfrom
nv-lschneider:tekit-mr-10228-squashed

nv-lschneider commented Jun 24, 2026 •

edited

Loading

Uh oh!

nv-lschneider commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

nv-lschneider commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

Uh oh!

nv-lschneider commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nv-lschneider commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading