[MoE] Align Swiglu MXFP4 fused quant paths by XiaobingSuper · Pull Request #3123 · ROCm/aiter

XiaobingSuper · 2026-05-11T07:52:24Z

Summary

Keep FlyDSL Swiglu MXFP4 fused quantization on the f32 activation path by removing the bf16 round-trip before FP4 quantization.
Preserve the requested Swiglu limit branch structure while keeping GPT-OSS Swiglu MXFP4 on the direct quantization path.
Align test_moe_2stage.py references with runtime Swiglu MXFP4 fused quant semantics by using an f32 stage1 reference for FP4 fused-quant cases.
Infer CSV gateMode from dtype/layout because tuned rows do not carry an explicit gateMode field.

Test plan

podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && python3 -m py_compile op_tests/test_moe_2stage.py aiter/ops/flydsl/kernels/mixed_moe_gemm_2stage.py aiter/ops/flydsl/kernels/silu_and_mul_fq.py && git diff --check'
podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_fp4 AITER_CONFIG_FMOE=/workdir/aiter_main/aiter/configs/model_configs/gptoss_fp4_tuned_fmoe.csv python3 -m op_tests.test_moe_2stage --no-legacy'
podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_fp8fp4 AITER_CONFIG_FMOE=/workdir/aiter_main/aiter/configs/model_configs/gptoss_fp8fp4_tuned_fmoe.csv python3 -m op_tests.test_moe_2stage --no-legacy'
podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_legacy python3 -m op_tests.test_moe_2stage --no-flydsl-csv -t 1024 -dim 3072,3072 -e 128 -k 4 -q 4 -a swiglu -s f -p t -hip 0,0'

Test result

gptoss_fp4_tuned_fmoe.csv --no-legacy: passed 8 strict CSV cases, command exit code 0.
gptoss_fp8fp4_tuned_fmoe.csv --no-legacy: passed 7 strict CSV cases, command exit code 0.
Legacy Swiglu MXFP4 target case: passed, command exit code 0.

Made with Cursor

Remove the GPT-OSS Swiglu layout env switch in favor of GateMode, align the CSV test filter with runtime dtype selection, and restore FlyDSL Swiglu _fp4 fused quant accuracy by matching the non-fused bf16 stage1 semantics. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-05-11T07:53:05Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3123 --add-label <label>

Copilot

Pull request overview

This PR updates the Swiglu MXFP4 MoE codepaths to remove the legacy GPT-OSS layout environment switch, align runtime q_dtype_a selection with GateMode, and restore FlyDSL fused-quant numerical behavior to match the non-fused bf16 materialization/clamp semantics.

Changes:

Switch Swiglu MXFP4 q_dtype_a selection to be driven by GateMode.SEPARATED vs non-separated modes, and thread gate_mode through the 2-stage config path.
Update CSV-driven MoE 2-stage tests to skip cases whose q_dtype_a no longer matches the runtime Swiglu MXFP4 selection logic (now including gate_mode).
Adjust FlyDSL fused quant kernels to apply the Swiglu alpha/clamp path and bf16 round-trip prior to MXFP4 quantization to match the non-fused semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`op_tests/test_moe_2stage.py`	Updates CSV-case filtering to match runtime Swiglu MXFP4 `q_dtype_a` selection, now factoring in `gateMode`.
`aiter/ops/flydsl/kernels/silu_and_mul_fq.py`	Aligns fused activation/clamp behavior for Swiglu and adds bf16 round-trip to match non-fused quant semantics.
`aiter/ops/flydsl/kernels/mixed_moe_gemm_2stage.py`	Adds bf16 materialization before MXFP4 quantization in the fused stage1 store path for Swiglu FP4.
`aiter/fused_moe.py`	Removes the GPT-OSS Swiglu MXFP4 layout env switch and keys runtime dtype selection/config dispatch off `gate_mode`.

Comments suppressed due to low confidence (1)

aiter/fused_moe.py:827

get_2stage_cfgs() now accepts gate_mode, but the tuned-config lookup keys (_INDEX_COLS / keys) do not incorporate it. If SEPARATED vs INTERLEAVE share the same q_dtype_a/q_dtype_w (e.g. Swiglu MXFP4 small-M where both may be bf16+fp4), this can cause the wrong tuned kernel to be selected or make it impossible to keep separate tuned entries. Consider threading gate_mode through the config index (and logging) so the selected kernel is unambiguous across gate layouts.

def get_2stage_cfgs(
    token,
    model_dim,
    inter_dim,
    expert,
    topk,
    dtype,
    q_dtype_a,
    q_dtype_w,
    q_type,
    use_g1u1,
    activation,
    doweight_stage1,
    hidden_pad,
    intermediate_pad,
    is_shuffled=True,
    gate_mode=GateMode.SEPARATED.value,
):
    gate_mode = GateMode(gate_mode)
    _INDEX_COLS = [
        "cu_num",
        "token",
        "model_dim",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        q_dtype_a = dtypes.bf16
    elif quant_type == QuantType.per_1x32:
-        if activation == ActivationType.Swiglu and _USE_GENERIC_SWIGLU_MXFP4_LAYOUT:
+        if activation == ActivationType.Swiglu and gate_mode == GateMode.SEPARATED:


XiaobingSuper requested review from a team and Copilot May 11, 2026 07:52

Copilot started reviewing on behalf of XiaobingSuper May 11, 2026 07:54 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

Comment thread aiter/fused_moe.py

q_dtype_a = dtypes.bf16

elif quant_type == QuantType.per_1x32:

if activation == ActivationType.Swiglu and _USE_GENERIC_SWIGLU_MXFP4_LAYOUT:

if activation == ActivationType.Swiglu and gate_mode == GateMode.SEPARATED:

[MoE] keep Swiglu MXFP4 fused quant in fp32

ad01c66

XiaobingSuper requested review from coderfeli and valarLip May 11, 2026 10:12

Merge branch 'main' into xiaobing/siglu_moe_new

6115eb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MoE] Align Swiglu MXFP4 fused quant paths#3123

[MoE] Align Swiglu MXFP4 fused quant paths#3123
XiaobingSuper wants to merge 3 commits intoROCm:mainfrom
XiaobingSuper:xiaobing/siglu_moe_new

XiaobingSuper commented May 11, 2026 •

edited by wuhuikx

Loading

Uh oh!

github-actions Bot commented May 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XiaobingSuper commented May 11, 2026 • edited by wuhuikx Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Test result

Uh oh!

github-actions Bot commented May 11, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XiaobingSuper commented May 11, 2026 •

edited by wuhuikx

Loading