Skip to content

feat(CK): Add gelu_tanh_and_mul to MoE GEMM#8886

Open
jonahbernard wants to merge 4 commits into
ROCm:developfrom
jonahbernard:gelu-tanh-moe-bf16
Open

feat(CK): Add gelu_tanh_and_mul to MoE GEMM#8886
jonahbernard wants to merge 4 commits into
ROCm:developfrom
jonahbernard:gelu-tanh-moe-bf16

Conversation

@jonahbernard

Copy link
Copy Markdown

Motivation

The fused MoE GEMM only supported gelu_and_mul (exact, erf-based) and
silu_and_mul activations. Models that use the tanh approximation of GELU
in their MoE experts (e.g. Gemma-family architectures) had no matching
activation in the CK fused MoE path. This PR adds gelu_tanh_and_mul so
those models can run their MoE layers on the CK gridwise MoE GEMM.

Technical Details

Adds a new activation gelu_tanh_and_mul = 3 to the MoE GEMM, computed via
the existing FastGelu functor (tanh approximation), alongside the existing
exact Gelu (gelu_and_mul = 0).

  • Enum (gridwise_gemm_xdl_cshuffle_common.hpp): add
    gelu_tanh_and_mul = 3 to the activation enum.
  • Kernel (gridwise_moe_gemm.hpp): add gelu_tanh_and_mul branches
    next to each existing gelu_and_mul site (4 sites: scaled/unscaled ×
    gate/up), applying FastGelu to the gate before the elementwise multiply.
    Reuses the same scale and pk_i4 handling as the silu/gelu paths.
  • CPU reference (reference_moe_gemm.hpp): add an ActivationType == 3
    branch using the host FastGelu, mirroring the existing Silu/Gelu
    branches, so the new activation can be verified against an independent
    oracle. The static_assert is updated to allow {0, 1, 3}.
  • Example/test (moe_gemm1_xdl_fp8_gelu_tanh.cpp + CMakeLists): a
    self-verifying example with ActOP = 3, registered as a smoke ctest.

Device and host both use FastGelu (the tanh approximation), so the
verification compares matching math.

Test Plan

Added example_moe_gemm1_xdl_fp8_gelu_tanh, a self-verifying example
(registered as a SMOKE_TEST ctest) that runs the gelu_tanh MoE GEMM on
device and checks the result against the CPU reference.

Built and run on gfx950 (ROCm 7.2.3):

  ctest -R example_moe_gemm1_xdl_fp8_gelu_tanh

  1/1 Test #259: example_moe_gemm1_xdl_fp8_gelu_tanh ... Passed  65.92 sec
  100% tests passed, 0 tests failed out of 1

Test Result

 ctest -R example_moe_gemm1_xdl_fp8_gelu_tanh -V
UpdateCTestConfiguration  from :/app/rocm-libraries-jonah/projects/composablekernel/build/DartConfiguration.tcl
Parse Config file:/app/rocm-libraries-jonah/projects/composablekernel/build/DartConfiguration.tcl
UpdateCTestConfiguration  from :/app/rocm-libraries-jonah/projects/composablekernel/build/DartConfiguration.tcl
Parse Config file:/app/rocm-libraries-jonah/projects/composablekernel/build/DartConfiguration.tcl
Test project /app/rocm-libraries-jonah/projects/composablekernel/build
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 259
    Start 259: example_moe_gemm1_xdl_fp8_gelu_tanh

259: Test command: /app/rocm-libraries-jonah/projects/composablekernel/build/bin/example_moe_gemm1_xdl_fp8_gelu_tanh
259: Working Directory: /app/rocm-libraries-jonah/projects/composablekernel/build/example/65_gemm_multiply_multiply
259: Test timeout computed to be: 1500
259: a0_t_k: dim 2, lengths {16384, 6144}, strides {6144, 1} 
259: b0_e_n_k: dim 3, lengths {8, 6144, 8192}, strides {50331648, 1, 6144} 
259: d1_e_n: dim 2, lengths {8, 8192}, strides {8192, 1} 
259: d2_e_n: dim 2, lengths {32768, 4096}, strides {1, 0} 
259: d0_t_n: dim 2, lengths {16384, 4096}, strides {1, 16384} 
259: d2_e_n: dim 2, lengths {32768, 4096}, strides {1, 0} 
259: e_t_n: dim 3, lengths {16384, 2, 4096}, strides {8192, 4096, 1} 
1/1 Test #259: example_moe_gemm1_xdl_fp8_gelu_tanh ...   Passed   63.15 sec

The following tests passed:
        example_moe_gemm1_xdl_fp8_gelu_tanh

100% tests passed, 0 tests failed out of 1

Label Time Summary:
SMOKE_TEST    =  63.15 sec*proc (1 test)

Total Test time (real) =  63.17 sec

Submission Checklist

Add a gelu_tanh_and_mul (FastGelu) enumerator to the shared Activation
enum and implement it in the no-quant gridwise MoE kernel
(gridwise_moe_gemm.hpp) at all four activation sites, mirroring the
existing gelu_and_mul handling but using element_wise::FastGelu (the
tanh approximation) instead of element_wise::Gelu (exact erf).
Extend ReferenceMoeGemm to support gelu_tanh_and_mul (ActivationType==3)
via FastGelu, mirroring the existing Silu/Gelu paths, so the new gelu_tanh
activation can be verified against the same host oracle as silu/gelu.
Add moe_gemm1 example with ActOP=3 (gelu_tanh_and_mul), cloned from the
fp8 stage1 example. Registered via add_example_executable so it runs as a
SMOKE_TEST ctest, verifying the gelu_tanh kernel branch against the CPU
reference (ReferenceMoeGemm ActivationType==3 / FastGelu).
@therock-pr-bot

therock-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

❌ PR Check — Action Required

Check Status Details
🌿 Branch Name ✅ Pass
📝 PR Title/Description ❌ Fail Error: Title does not follow Conventional Commits style.
Expected: start with a valid type (feat, fix, docs, …).
Desired format: type(optional-scope): short description
Forbidden Files ✅ Pass
🧪 Unit Test ❌ Fail Error: Source/code files changed without an accompanying unit test.
Expected: add at least one test file named like test_<name>.py / test_<name>.cpp (or <name>_test.*).
Current: code file(s) changed: projects/composablekernel/example/65_gemm_multiply_multiply/moe_gemm1_xdl_fp8_gelu_tanh.cpp, projects/composablekernel/include/ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_common.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/grid/gridwise_moe_gemm.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/grid/gridwise_moe_mx_gemm.hpp, projects/composablekernel/include/ck/tensor_operation/gpu/grid/gridwise_moe_mx_gemm_bpreshuffle.hpp (+1 more); no test file found
🔎 pre-commit ✅ Pass
🚫 Draft PR 🔜 To Be Enabled
🚩 Feature Flag 🔜 To Be Enabled
📊 Code Coverage 🔜 To Be Enabled

⚠️ 2 policy check(s) failed. Please address the issues above before this PR can be Reviewed.

🚫 Please fix the failed policies

  • ❌ PR Title/Description
  • ❌ Unit Test

The Not ready to Review label was added to this PR. Once all policies pass, the label is removed automatically.

📖 Need help? See the Policy FAQ for details on every check and how to fix failures.

@therock-pr-bot

Copy link
Copy Markdown

🚫 Please fix the failed policies before requesting reviews.

The following policy checks failed:

  • ❌ PR Title/Description
  • ❌ Unit Test

The Not ready to Review label has been added to this PR.
Once all policies pass, the label will be removed automatically.

@jonahbernard jonahbernard changed the title [CK] Add gelu_tanh_and_mul to MoE GEMM feat(CK): Add gelu_tanh_and_mul to MoE GEMM Jun 27, 2026
@aosewski

Copy link
Copy Markdown
Contributor

@jonahbernard I've reached out to you internally. Basically I'd propose to use CK-Tile's moe gemm since it provides much more flexibility for customizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants