Skip to content

[MoE] Align Swiglu MXFP4 fused quant paths#3123

Open
XiaobingSuper wants to merge 3 commits intoROCm:mainfrom
XiaobingSuper:xiaobing/siglu_moe_new
Open

[MoE] Align Swiglu MXFP4 fused quant paths#3123
XiaobingSuper wants to merge 3 commits intoROCm:mainfrom
XiaobingSuper:xiaobing/siglu_moe_new

Conversation

@XiaobingSuper
Copy link
Copy Markdown
Contributor

@XiaobingSuper XiaobingSuper commented May 11, 2026

Summary

  • Keep FlyDSL Swiglu MXFP4 fused quantization on the f32 activation path by removing the bf16 round-trip before FP4 quantization.
  • Preserve the requested Swiglu limit branch structure while keeping GPT-OSS Swiglu MXFP4 on the direct quantization path.
  • Align test_moe_2stage.py references with runtime Swiglu MXFP4 fused quant semantics by using an f32 stage1 reference for FP4 fused-quant cases.
  • Infer CSV gateMode from dtype/layout because tuned rows do not carry an explicit gateMode field.

Test plan

  • podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && python3 -m py_compile op_tests/test_moe_2stage.py aiter/ops/flydsl/kernels/mixed_moe_gemm_2stage.py aiter/ops/flydsl/kernels/silu_and_mul_fq.py && git diff --check'
  • podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_fp4 AITER_CONFIG_FMOE=/workdir/aiter_main/aiter/configs/model_configs/gptoss_fp4_tuned_fmoe.csv python3 -m op_tests.test_moe_2stage --no-legacy'
  • podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_fp8fp4 AITER_CONFIG_FMOE=/workdir/aiter_main/aiter/configs/model_configs/gptoss_fp8fp4_tuned_fmoe.csv python3 -m op_tests.test_moe_2stage --no-legacy'
  • podman exec zxb_vllm_gptoss bash -lc 'cd /workdir/aiter_main && HIP_VISIBLE_DEVICES=1 FLYDSL_RUNTIME_CACHE_DIR=/tmp/flydsl_pr3123_test_legacy python3 -m op_tests.test_moe_2stage --no-flydsl-csv -t 1024 -dim 3072,3072 -e 128 -k 4 -q 4 -a swiglu -s f -p t -hip 0,0'

Test result

  • gptoss_fp4_tuned_fmoe.csv --no-legacy: passed 8 strict CSV cases, command exit code 0.
  • gptoss_fp8fp4_tuned_fmoe.csv --no-legacy: passed 7 strict CSV cases, command exit code 0.
  • Legacy Swiglu MXFP4 target case: passed, command exit code 0.

Made with Cursor

Remove the GPT-OSS Swiglu layout env switch in favor of GateMode, align the CSV test filter with runtime dtype selection, and restore FlyDSL Swiglu _fp4 fused quant accuracy by matching the non-fused bf16 stage1 semantics.

Co-authored-by: Cursor <cursoragent@cursor.com>
@XiaobingSuper XiaobingSuper requested review from a team and Copilot May 11, 2026 07:52
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3123 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Swiglu MXFP4 MoE codepaths to remove the legacy GPT-OSS layout environment switch, align runtime q_dtype_a selection with GateMode, and restore FlyDSL fused-quant numerical behavior to match the non-fused bf16 materialization/clamp semantics.

Changes:

  • Switch Swiglu MXFP4 q_dtype_a selection to be driven by GateMode.SEPARATED vs non-separated modes, and thread gate_mode through the 2-stage config path.
  • Update CSV-driven MoE 2-stage tests to skip cases whose q_dtype_a no longer matches the runtime Swiglu MXFP4 selection logic (now including gate_mode).
  • Adjust FlyDSL fused quant kernels to apply the Swiglu alpha/clamp path and bf16 round-trip prior to MXFP4 quantization to match the non-fused semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
op_tests/test_moe_2stage.py Updates CSV-case filtering to match runtime Swiglu MXFP4 q_dtype_a selection, now factoring in gateMode.
aiter/ops/flydsl/kernels/silu_and_mul_fq.py Aligns fused activation/clamp behavior for Swiglu and adds bf16 round-trip to match non-fused quant semantics.
aiter/ops/flydsl/kernels/mixed_moe_gemm_2stage.py Adds bf16 materialization before MXFP4 quantization in the fused stage1 store path for Swiglu FP4.
aiter/fused_moe.py Removes the GPT-OSS Swiglu MXFP4 layout env switch and keys runtime dtype selection/config dispatch off gate_mode.
Comments suppressed due to low confidence (1)

aiter/fused_moe.py:827

  • get_2stage_cfgs() now accepts gate_mode, but the tuned-config lookup keys (_INDEX_COLS / keys) do not incorporate it. If SEPARATED vs INTERLEAVE share the same q_dtype_a/q_dtype_w (e.g. Swiglu MXFP4 small-M where both may be bf16+fp4), this can cause the wrong tuned kernel to be selected or make it impossible to keep separate tuned entries. Consider threading gate_mode through the config index (and logging) so the selected kernel is unambiguous across gate layouts.
def get_2stage_cfgs(
    token,
    model_dim,
    inter_dim,
    expert,
    topk,
    dtype,
    q_dtype_a,
    q_dtype_w,
    q_type,
    use_g1u1,
    activation,
    doweight_stage1,
    hidden_pad,
    intermediate_pad,
    is_shuffled=True,
    gate_mode=GateMode.SEPARATED.value,
):
    gate_mode = GateMode(gate_mode)
    _INDEX_COLS = [
        "cu_num",
        "token",
        "model_dim",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/fused_moe.py
q_dtype_a = dtypes.bf16
elif quant_type == QuantType.per_1x32:
if activation == ActivationType.Swiglu and _USE_GENERIC_SWIGLU_MXFP4_LAYOUT:
if activation == ActivationType.Swiglu and gate_mode == GateMode.SEPARATED:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants