[Common/PyTorch/JAX] make offset of ClampedSwiGLU configurable by hxbai · Pull Request #2938 · NVIDIA/TransformerEngine

hxbai · 2026-04-28T08:39:23Z

Description

The previous ClampedSwiGLU follows GPT-OSS, which hard-coded the offset 1.0.
DeepSeek-V4 uses ClampedSwiGLU without alpha and offset.
This PR makes the offset of ClampedSwiGLU configurable to support DeepSeek-V4.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

greptile-apps · 2026-04-28T08:44:38Z

Greptile Summary

This PR makes the glu_linear_offset parameter of ClampedSwiGLU configurable (previously hardcoded to 1.0 following GPT-OSS), enabling support for DeepSeek-V4 which uses glu_linear_offset=0.0. The public C API is preserved via a clean _v2 versioning pattern while the original symbols are deprecated in-place.

C layer: nvte_clamped_swiglu_v2/nvte_clamped_dswiglu_v2 added to activation.h; ClampedSwiGLUParam gains glu_linear_offset; all CUDA kernels (vectorized_pointwise.h, gated_fp8.cuh, gated_mxfp8.cuh) correctly apply the offset after clamping and set the dgate boolean mask before adding offset in backward.
PyTorch/JAX bindings: ClampedSwiGLU, ScaledClampedQGeGLU, ClampedSwigluParams (JAX) all updated with default 1.0 preserving backward compatibility; pybind defaults and ONNX export path are consistent.
Fused grouped MLP (cuDNN path): linear_offset is forwarded to cuDNN only when FE ≥ 1.24.0 (_pass_geglu_runtime_params); no guard exists for non-default offsets when FE is in the [1.23, 1.24) range, which silently produces incorrect results in both forward and backward passes.

Confidence Score: 4/5

The core CUDA kernels, PyTorch/JAX bindings, and the standard ClampedSwiGLU path are all correct and backward-compatible. The fused grouped MLP cuDNN path has a gap where a non-default glu_linear_offset is silently ignored when cuDNN FE is in the [1.23, 1.24) range.

The ClampedSwiGLU standard path is mathematically correct across forward and backward for all kernel types. The cuDNN fused grouped MLP path does not pass linear_offset to the cuDNN kernel when _pass_geglu_runtime_params is False, meaning a user on cuDNN FE >= 1.23.0 but < 1.24.0 with a non-default glu_linear_offset gets silently wrong results.

transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py and backward_grouped_mlp.py need a guard or warning when non-default parameters are configured but the cuDNN FE version cannot support them.

Important Files Changed

Filename	Overview
transformer_engine/common/activation/swiglu.cu	Adds nvte_clamped_swiglu_v2/nvte_clamped_dswiglu_v2 with configurable offset; old API preserved with hardcoded 1.0 offset and deprecation notice.
transformer_engine/common/include/transformer_engine/activation.h	Properly deprecates nvte_clamped_swiglu/nvte_clamped_dswiglu in favor of new v2 functions; old symbols preserved for ABI compatibility.
transformer_engine/common/util/vectorized_pointwise.h	Forward and backward CUDA kernel correctly applies glu_linear_offset after clamping; backward computes dgate_in before adding offset (correct) and uses offset-adjusted gate_in for the activation gradient (mathematically correct).
transformer_engine/pytorch/ops/basic/swiglu.py	Adds glu_linear_offset param to ClampedSwiGLU and ScaledClampedQGeGLU with correct default 1.0; helper methods properly forward the offset.
transformer_engine/pytorch/ops/fused/forward_grouped_mlp.py	Passes glu_linear_offset to cuDNN only when FE >= 1.24.0; no guard exists when FE is in [1.23, 1.24) and a non-default offset is configured, leading to silent incorrect behavior.
transformer_engine/pytorch/ops/fused/backward_grouped_mlp.py	Same gap as forward_grouped_mlp: cuDNN backward kernel is not given linear_offset when _pass_geglu_runtime_params is False, causing silent numerical mismatch for non-default offsets.
transformer_engine/jax/cpp_extensions/activation.py	Adds glu_linear_offset to ClampedSwigluParams and correctly serializes it for XLA FFI. Reference function for clamped_linear correctly applies the offset.

_{Reviews (18): Last reviewed commit: "Merge branch 'main' into swiglu_offset" | Re-trigger Greptile}

greptile-apps · 2026-04-28T08:44:42Z

+ *  \param[in]     glu_linear_offset  Offset added to the linear component after clamping (default 1.0).
 *  \param[in]     stream    CUDA stream used for the operation.
 */


Breaking public C API change

nvte_clamped_swiglu and nvte_clamped_dswiglu are public symbols declared in a versioned public header. Inserting glu_linear_offset before cudaStream_t is an ABI-breaking change: any external binary or shared library compiled against the old header will silently pass the stream pointer as the offset and a garbage value as the stream, leading to undefined behavior at runtime rather than a clean compile error if called via a pre-compiled library. This should be acknowledged as a breaking change in the PR checklist, and — if this library follows semantic versioning or a compatibility guarantee — a deprecation/transition path or version bump is needed.

timmoon10

The fused op for grouped MLP is hard-coded for GPT-OSS, so we should make sure not to fuse if glu_linear_offset != 1:

TransformerEngine/transformer_engine/pytorch/ops/_common.py

Lines 180 to 183 in df0025b

    
           elif isinstance(window[1], ScaledClampedQGeGLU) and ( 
        
               abs(window[1]._clamped.alpha - 1.702) > 0.001 
        
               or not _nvidia_cudnn_frontend_supports_scaled_clamped_qgeglu() 
        
           ):

timmoon10 · 2026-04-28T18:15:43Z

/te-ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

vthumbe1503 · 2026-05-06T18:36:11Z


 void nvte_clamped_swiglu(const NVTETensor input, NVTETensor output, float limit, float alpha,
-                         cudaStream_t stream) {
+                         float glu_linear_offset, cudaStream_t stream) {


Can we define new APIs named nvte_clamped_swiglu_v2 and nvte_clamped_dswiglu_v2
and deprecate this API here to not break backward compatibility?

rewrited this part

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

vthumbe1503 · 2026-05-12T06:01:20Z

/te-ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

vthumbe1503 · 2026-05-13T15:20:18Z

/te-ci

jberchtold-nvidia

Overall looks pretty good from the JAX side, thanks for adding the JAX changes too! Left a couple small comments

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

jberchtold-nvidia · 2026-05-15T15:52:41Z

/te-ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-16T06:45:02Z

Want your agent to iterate on Greptile's feedback? Try greploops.

jberchtold-nvidia

LGTM from JAX perspective! Once Tim/Varun approve fpr PyTorch changes and CI passes you can merge it. Thanks!

jberchtold-nvidia · 2026-05-18T15:30:03Z

/te-ci

vthumbe1503 · 2026-05-19T03:53:22Z

/te-ci pytorch

vthumbe1503 · 2026-05-20T19:32:54Z

/te-ci

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10 · 2026-05-22T19:11:12Z

/te-ci

timmoon10

LGTM, pending CI

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2026-05-26T16:22:14Z

Pipeline 52303026

greptile-apps Bot reviewed Apr 28, 2026

View reviewed changes

timmoon10 reviewed Apr 28, 2026

View reviewed changes

swiglu offset

86b9199

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai force-pushed the swiglu_offset branch from 1ed113b to 86b9199 Compare April 29, 2026 00:05

hxbai marked this pull request as draft April 29, 2026 00:28

fix fusion pattern check

1eab899

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai marked this pull request as ready for review April 29, 2026 01:01

vthumbe1503 reviewed May 6, 2026

View reviewed changes

vthumbe1503 and others added 3 commits May 6, 2026 11:38

Merge branch 'main' into swiglu_offset

aec7013

Signed-off-by: vthumbe1503 <vthumbe@nvidia.com>

use swiglu_v2

2aca498

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

add default value to v1

4d7be63

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai added 2 commits May 12, 2026 15:13

Merge branch 'main' into swiglu_offset

96d99ba

fix test

9a77ebf

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai requested review from Oleg-Goncharov, jberchtold-nvidia, ksivaman and ptrendx as code owners May 13, 2026 08:02

Merge branch 'main' into swiglu_offset

72f2a57

jberchtold-nvidia reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/jax/csrc/extensions.h

Comment thread transformer_engine/jax/cpp_extensions/activation.py

add default value to jax version

1d323d4

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap NVIDIA/Megatron-LM#4815

Open

71 tasks

hxbai and others added 2 commits May 16, 2026 06:38

revert the default value change

07a69a7

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e1fec14

for more information, see https://pre-commit.ci

Merge branch 'main' into swiglu_offset

3dd37ea

jberchtold-nvidia previously approved these changes May 18, 2026

View reviewed changes

Merge branch 'main' into swiglu_offset

447f2c4

Merge branch 'main' into swiglu_offset

1e9cdac

hxbai added 2 commits May 20, 2026 23:14

update the fusion path

257253a

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

update cudnn-frontend to 1.24.0

996f24c

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai dismissed jberchtold-nvidia’s stale review via 996f24c May 21, 2026 06:17

Merge remote-tracking branch 'origin/main' into swiglu_offset

c87565e

Signed-off-by: Hongxiao Bai <hongxiaob@nvidia.com>

hxbai force-pushed the swiglu_offset branch from ef6d635 to c87565e Compare May 22, 2026 01:00

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 22, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

84e1ef1

for more information, see https://pre-commit.ci

vthumbe1503 reviewed May 22, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/ops/_common.py

Merge branch 'main' into swiglu_offset

2b2ae22

timmoon10 previously approved these changes May 22, 2026

View reviewed changes

Merge branch 'main' into swiglu_offset

508b9c8

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 dismissed their stale review via 508b9c8 May 22, 2026 23:23

timmoon10 approved these changes May 26, 2026

View reviewed changes

timmoon10 merged commit 7e6ffcc into NVIDIA:main May 26, 2026
12 of 13 checks passed

	elif isinstance(window[1], ScaledClampedQGeGLU) and (
	abs(window[1]._clamped.alpha - 1.702) > 0.001
	or not _nvidia_cudnn_frontend_supports_scaled_clamped_qgeglu()
	):

Conversation

hxbai commented Apr 28, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Apr 28, 2026

Uh oh!

vthumbe1503 May 6, 2026

Choose a reason for hiding this comment

Uh oh!

hxbai May 12, 2026

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented May 12, 2026

Uh oh!

vthumbe1503 commented May 13, 2026

Uh oh!

jberchtold-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jberchtold-nvidia commented May 15, 2026

Uh oh!

greptile-apps Bot commented May 16, 2026

Uh oh!

jberchtold-nvidia left a comment

Choose a reason for hiding this comment

Uh oh!

jberchtold-nvidia commented May 18, 2026

Uh oh!

vthumbe1503 commented May 19, 2026

Uh oh!

vthumbe1503 commented May 20, 2026

Uh oh!

Uh oh!

timmoon10 commented May 22, 2026

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

greptile-apps Bot commented Apr 28, 2026 •

edited

Loading