feat(CK): Add gelu_tanh_and_mul to MoE GEMM#8886
Open
jonahbernard wants to merge 4 commits into
Open
Conversation
Add a gelu_tanh_and_mul (FastGelu) enumerator to the shared Activation enum and implement it in the no-quant gridwise MoE kernel (gridwise_moe_gemm.hpp) at all four activation sites, mirroring the existing gelu_and_mul handling but using element_wise::FastGelu (the tanh approximation) instead of element_wise::Gelu (exact erf).
Extend ReferenceMoeGemm to support gelu_tanh_and_mul (ActivationType==3) via FastGelu, mirroring the existing Silu/Gelu paths, so the new gelu_tanh activation can be verified against the same host oracle as silu/gelu.
Add moe_gemm1 example with ActOP=3 (gelu_tanh_and_mul), cloned from the fp8 stage1 example. Registered via add_example_executable so it runs as a SMOKE_TEST ctest, verifying the gelu_tanh kernel branch against the CPU reference (ReferenceMoeGemm ActivationType==3 / FastGelu).
❌ PR Check — Action Required
📖 Need help? See the Policy FAQ for details on every check and how to fix failures. |
|
🚫 Please fix the failed policies before requesting reviews. The following policy checks failed:
The |
1 task
Contributor
|
@jonahbernard I've reached out to you internally. Basically I'd propose to use CK-Tile's moe gemm since it provides much more flexibility for customizations. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The fused MoE GEMM only supported
gelu_and_mul(exact, erf-based) andsilu_and_mulactivations. Models that use the tanh approximation of GELUin their MoE experts (e.g. Gemma-family architectures) had no matching
activation in the CK fused MoE path. This PR adds
gelu_tanh_and_mulsothose models can run their MoE layers on the CK gridwise MoE GEMM.
Technical Details
Adds a new activation
gelu_tanh_and_mul = 3to the MoE GEMM, computed viathe existing
FastGelufunctor (tanh approximation), alongside the existingexact
Gelu(gelu_and_mul = 0).gridwise_gemm_xdl_cshuffle_common.hpp): addgelu_tanh_and_mul = 3to the activation enum.gridwise_moe_gemm.hpp): addgelu_tanh_and_mulbranchesnext to each existing
gelu_and_mulsite (4 sites: scaled/unscaled ×gate/up), applying
FastGeluto the gate before the elementwise multiply.Reuses the same scale and
pk_i4handling as the silu/gelu paths.reference_moe_gemm.hpp): add anActivationType == 3branch using the host
FastGelu, mirroring the existing Silu/Gelubranches, so the new activation can be verified against an independent
oracle. The
static_assertis updated to allow{0, 1, 3}.moe_gemm1_xdl_fp8_gelu_tanh.cpp+ CMakeLists): aself-verifying example with
ActOP = 3, registered as a smoke ctest.Device and host both use
FastGelu(the tanh approximation), so theverification compares matching math.
Test Plan
Added
example_moe_gemm1_xdl_fp8_gelu_tanh, a self-verifying example(registered as a
SMOKE_TESTctest) that runs the gelu_tanh MoE GEMM ondevice and checks the result against the CPU reference.
Built and run on gfx950 (ROCm 7.2.3):
Test Result
Submission Checklist