TE EP integration to MoEBlock by tdophung · Pull Request #3116 · NVIDIA/TransformerEngine

tdophung · 2026-06-10T21:02:50Z

Description

Will rebase and squash the commits on this branch once about to merge
Will also change the JAX APIs if needed when TE EP JAX merge

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…rce at dispatch Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

… static layer registration Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…er + NVTEEpHandle struct (NVTE_EP_HANDLE_CACHE_SIZE=-1 disables eviction) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…CCL_EP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…hout it Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ogging.h Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…_COPY_{ON,OFF} Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…tyAllSymm Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…CUDA Toolkit) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

… for wheel install Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…bmodules Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…rop submodule header mirror Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…al CommWindow Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…espace Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…I in EP files Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…headers Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…lint runtime/int) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…pe lifetime) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

for more information, see https://pre-commit.ci

…16 max_token_dtype Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

… with_sharding_constraint Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…trap Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…EpLayerConfig type) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ives (lint 10.00) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…; define NVTE_WITH_NCCL_EP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

…ract, drop dead helpers Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Reset 33 local commits onto phuong/ep-3-jax @ c34771d (her latest with EpConfig + EpLayerConfig API, NCCL bumped to 808d2433) and re-applied the three deltas uniquely ours: * transformer_engine/jax/moe.py: replaces upstream's multi-backend MoE block with our TE-EP-only single-custom-vjp rewrite. Adapted to her new API surface: tex.EpLayerConfig replaces tex.ep_make_handle (no more EpHandle pool/cache); 5 EP callsites rewired (cfg passed in place of handle, ep_prepare arg order swapped, top_k= dropped from ep_dispatch_bwd since it's now in cfg. * tests/jax/test_te_ep_moe.py: TE-EP MoE test (kept), with ep_bootstrap kwargs ep_size= and allow_handle_mem_reloc= dropped (no longer supported; ep_size is derived from mesh axes and the handle_mem reloc gating is gone). * tests/jax/run_te_ep_moe.sh: multi-process launcher (kept). Pre-sync state preserved at branch teddy/te_ep_integration.backup-pre-phuong-sync. EOF )

for more information, see https://pre-commit.ci

jberchtold-nvidia · 2026-06-10T21:53:31Z

    use_bias: bool = False
+    # Per-expert router bias added before the top-k. Only meaningful when
+    # score_function='sigmoid'.
+    use_expert_bias: bool = False


nit SR: can we rename use_bias -> use_ffn_bias and use_expert_bias -> use_expert_routing_bias?

jberchtold-nvidia · 2026-06-10T21:54:07Z

+    # Minimum per-expert slot alignment fed to ``tex.ep_prepare``. Default 0
+    # uses the natural slot count; set to e.g. 128 to satisfy FP8 grouped-GEMM
+    # tile alignment.
+    align_size: int = 0


Placeholder comment for me to fix this so align_size is inferred automatically based on the recipe and doesn't need to be specified by the user

jberchtold-nvidia · 2026-06-10T21:55:52Z

+                nn.with_logical_partitioning(self.bias_init, ("exp",)),
                (self.num_experts,),
-                self.dtype,
+                jnp.float32,


Is the router always in fp32 so this expert bias must also be? If so, can we add a small comment indicating this

jberchtold-nvidia · 2026-06-10T22:19:29Z



-__all__ = ["moe", "PermutationBackend"]
+def _with_sharding_constraint_cast_bwd(x: jnp.ndarray, sharding) -> jnp.ndarray:


Why do we need this utility function? I haven't seen something like this required for our other VJPs

jberchtold-nvidia · 2026-06-10T22:22:22Z

+# is a frozen dataclass of ints); the rest are jnp.ndarray,
+# GroupedNoScaleTensor (already a pytree), or None when aux_loss_coeff == 0.
+@register_pytree_node_class
+@dataclass


I think this tree_flatten was from my patch, but looking at the diff I think it'd be better to use the @flax_struct.dataclass you were using on the permutation dataclasses since that seems to auto-populate a default pytree flatten/unflatten for us

jberchtold-nvidia · 2026-06-10T22:25:19Z

+    else:
+        d_recv_w_from_intermediate = jnp.zeros_like(recv_w_flat)
+
+    # Activation bwd. Mirror the fwd's fp32 promotion of silu+multiply


Why is this dtype casting required? I don't recall us needing it for the non-MoE LNMLP block

jberchtold-nvidia · 2026-06-10T22:29:24Z

+    # local expert. We must size to that worst case or NCCL EP's HT kernel
+    # rejects the dispatch buffer with ``invalid argument``.
+    natural_spe = num_ep * max_tokens_per_rank  # = (B // dp_size) * S
+    # NCCL EP requires each expert-major output block to be at least


Do we have a use-case for user-specified alignments beyond 128 currently? If NCCL EP requires an alignment of at least 128, and since an alignment of 128 is sufficient for all TE grouped GEMM types, would it make sense to instead hardcode _ALIGN_SIZE = 128 as a constant at the top of the file for now to simplify this MoEBlock API.

We can always expand the API to support a user-specified align size in the future

jberchtold-nvidia · 2026-06-10T22:32:31Z

+        batch_pspec_axis = (*data_parallelism_axes, ep_axis)
+    ep3_spec = P(batch_pspec_axis, None, None)
+    ep2_spec = P(batch_pspec_axis, None)
+    x = jax.lax.with_sharding_constraint(x, NamedSharding(mesh, ep3_spec))


Which axis name inputs are physical mesh axes and why can be logical axes? I see above x = with_sharding_constraint_by_logical_axes(x, input_axes) but here we directly use jax.lax.with_sharding_constraint which only supports mesh axes.

No need to make any changes for now, I just want to assess which are which and then we can discuss if it makes sense to support logical on some/all or if some are required to be physical axes. Thanks!

jberchtold-nvidia · 2026-06-10T22:33:22Z

+    # `grad_pre_combine * w` sees them. Padded positions in sparse_probs
+    # are already zero (routing_map is False there); only the rare
+    # underflow path emits NaN.
+    sparse_probs = jnp.where(jnp.isnan(sparse_probs), 0, sparse_probs).astype(dtype)


Is this NaN filtering a debugging artifact or something we need in the final version?

phu0ngng and others added 30 commits June 9, 2026 18:27

Expert Parallelism: common C API + NCCL EP v0.1 backend

ee7dfff

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Expert Parallelism: persistent ncclEpHandle cache with allow_handle_m…

de64b7c

…em_reloc gating Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5242ad7

for more information, see https://pre-commit.ci

Build: NCCL_HOME discovery supports Debian/Ubuntu multiarch lib paths

f90ec21

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

bump NCCL

37890dc

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Expert Parallelism: require token_dtype in NVTEEpGroupConfig and enfo…

ac4be5d

…rce at dispatch Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Expert Parallelism: document ep_comm lifetime, v0.1 single-GPU scope,…

5c09ef6

… static layer registration Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Expert Parallelism: drop version label from initialize scope note

7adcd54

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

Expert Parallelism: pointer-keyed LRU handle cache; drop register_lay…

a6ad78a

…er + NVTEEpHandle struct (NVTE_EP_HANDLE_CACHE_SIZE=-1 disables eviction) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

bump nccl to latest v0.1

7efb23c

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

tests/cpp_distributed: drop unused NCCL EP header include path

191a6f8

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: fold nvte_ep_* stubs into ep_api.cpp under #if NVTE_WITH_N…

20e61f2

…CCL_EP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: dlopen libnccl_ep.so so libtransformer_engine.so loads wit…

af23caa

…hout it Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: add BUILD_RPATH=NCCL_EP_LIB_DIR for in-tree dev builds

6a74916

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: polish ep.h docstrings; drop unused NVTE_CHECK_NCCL from l…

9fd036d

…ogging.h Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: expose zero_copy in NVTEEpGroupConfig; map to NCCL_EP_ZERO…

e6ea573

…_COPY_{ON,OFF} Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

tests/cpp_distributed: exercise zero_copy=ON in EPZeroCopyTest.Identi…

e49c6f9

…tyAllSymm Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

tests/cpp_distributed: tighten EPZeroCopyTest comments

3ba2243

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/CMakeLists: correct NCCL resolution comment (not bundled with …

f822f9c

…CUDA Toolkit) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/CMakeLists: shorten NCCL/GIN headers comments

bd6b81e

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

setup,common: bundle libnccl_ep.so.0 next to libtransformer_engine.so…

3060b6f

… for wheel install Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

.gitmodules: drop nccl branch pin and align indentation with other su…

2ba4795

…bmodules Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

setup: gate NCCL EP build on arch >= 90 or native; drop sm_90 fallback

a3e12eb

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common,setup,tests: discover nccl.h via find_path/NCCL_INCLUDE_DIR; d…

9821ffd

…rop submodule header mirror Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: simplify make_nccl_ep_tensor to take NVTETensor and option…

ad857d6

…al CommWindow Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: move te_dtype_to_nccl_dtype out of EPBackend into anon nam…

3de3f89

…espace Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: reword multicast check; drop NVLS framing

c343871

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common,tests: replace unicode em-dash and box-drawing chars with ASCI…

420f534

…I in EP files Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

bump nccl to latest v0.1

147c765

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2c38a50

for more information, see https://pre-commit.ci

phu0ngng and others added 21 commits June 9, 2026 18:27

nccl commit to 2.31.0a4-1

f3b3250

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/CMakeLists: point NCCL_EP_INCLUDE_DIR at build/include staged …

50314e8

…headers Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/CMakeLists: clarify NCCL EP missing-header instructions

1f3793a

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: use int64_t instead of long for handle-cache size env (cpp…

335d387

…lint runtime/int) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

common/ep: fix dangling sizes pointer in make_nccl_ep_tensor (NVTESha…

2611724

…pe lifetime) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6dd5ffc

for more information, see https://pre-commit.ci

jax: add EP bindings on pointer-keyed cache with EpLayerConfig and bf…

68dd8ce

…16 max_token_dtype Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: drop topk_weights from ep_combine; caller must pre-multiply

1a0e0df

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

tests/jax/ep: mask uninitialized recv_tokens tail in dispatch_vjp

0e108a7

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

examples/jax/ep: add ep_bench.py + run_ep_bench.sh

3e6ac88

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

examples/jax/ep: ep_moe.py runs --iters fwd+bwd steps (default 3)

4a7c8a2

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: tighten sharding contract, drop helpers, route bwd through TE…

f087725

… with_sharding_constraint Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: derive ep_size and num_ep_groups from active mesh in ep_boots…

1e6b44c

…trap Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

examples/jax/ep: rename ep_handle to layer_cfg in ep_moe.py (matches …

512c63b

…EpLayerConfig type) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: add primitive docstrings and silence missing-kwoa false posit…

8a5666e

…ives (lint 10.00) Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: apply black formatting (pre-commit hook output)

099185a

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

build_tools/jax: gate NCCL EP on NVTE_BUILD_WITH_NCCL_EP (default on)…

90ee6b6

…; define NVTE_WITH_NCCL_EP Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: collapse 5 FFI attr structs into single EpConfig

fd55ef2

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: dedup _ep_outer_axis, normalize _ep_spec_ok, unify outer_abst…

07f928c

…ract, drop dead helpers Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

jax/ep: apply clang-format and silence pylint unused-arg in lowering

c34771d

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

tdophung force-pushed the teddy/te_ep_integration branch from 0ff3bff to bd14fe6 Compare June 10, 2026 21:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

716d615

for more information, see https://pre-commit.ci

jberchtold-nvidia reviewed Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TE EP integration to MoEBlock#3116

TE EP integration to MoEBlock#3116
tdophung wants to merge 52 commits into
NVIDIA:mainfrom
tdophung:teddy/te_ep_integration

tdophung commented Jun 10, 2026 •

edited

Loading

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

jberchtold-nvidia Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		__all__ = ["moe", "PermutationBackend"]
		def _with_sharding_constraint_cast_bwd(x: jnp.ndarray, sharding) -> jnp.ndarray:

Conversation

tdophung commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdophung commented Jun 10, 2026 •

edited

Loading