Skip to content

[MOE] Special path for moe ptpc fp8, batch 1~32 with hsaco#3103

Open
sammysun0711 wants to merge 11 commits intoROCm:Qwen3.5_devfrom
sammysun0711:fused_moe_ptpc_fp8
Open

[MOE] Special path for moe ptpc fp8, batch 1~32 with hsaco#3103
sammysun0711 wants to merge 11 commits intoROCm:Qwen3.5_devfrom
sammysun0711:fused_moe_ptpc_fp8

Conversation

@sammysun0711
Copy link
Copy Markdown
Contributor

@sammysun0711 sammysun0711 commented May 9, 2026

Motivation

Based on #2961
Integrate Qwen3.5 397B FP8 PTPC MOE in decoding phase for batch size 1-32 with FP8 weight decompression via hsaco.

Technical Details

Major optimization:

  • Batch1: a new special kernel moe_gemm_batch1 can directly use topk_ids and topk_weight and there is no dependency on moe_sorting
  • Batch 2~15: stage 1: moe_gemm_batch + stage 2: moe_2stage_splitk
  • Batch16~32: stage 1: moe_gemm_batch + stage 2: moe_2stage_down_loopn
  • FP8: since it's memory bound, a new kernel will do weight decompression instead of additional dynamic quantization kernels.

The function is disabled by default, should use AITER_MOE_SMALL_BATCH=1 to enable.
The corresponding source file for .co is moe.py.

Test Plan

AITER_MOE_SMALL_BATCH=1 python3 op_tests/test_moe_2stage.py -q 2 -a silu -e 512 -k 10 -dim 4096,128 -p t  -t 1

Test Result

Test results on MI308X (non-PTL)
AITER_MOE_SMALL_BATCH=1 for batch size 1/2/4/8/10/12/16/32

AITER_MOE_SMALL_BATCH=1 python3 op_tests/test_moe_2stage.py -q 2 -a silu -e 512 -k 10 -dim 4096,128 -p t  -t 1
dtype token model_dim inter_dim E topk actType qType AQDType WQDType use_g1u1 doweight_stage1 hidden_pad intermediate_pad preshuffle us err
torch.bfloat16 1 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 20.727 0.85498
torch.bfloat16 2 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 47.237 0.866455
torch.bfloat16 4 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 62.437 0.858887
torch.bfloat16 8 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 86.897 0.865326
torch.bfloat16 10 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 95.316 0.853809
torch.bfloat16 12 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 100.756 0.859517
torch.bfloat16 16 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 115.835 0.8564
torch.bfloat16 32 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 176.555 0.856827

AITER_MOE_SMALL_BATCH=0 for batch size 1/2/4/8/10/12/16/32

AITER_MOE_SMALL_BATCH=0 python3 op_tests/test_moe_2stage.py -q 2 -a silu -e 512 -k 10 -dim 4096,128 -p t  -t 1
dtype token model_dim inter_dim E topk actType qType AQDType WQDType use_g1u1 doweight_stage1 hidden_pad intermediate_pad preshuffle us err
torch.bfloat16 1 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 56.085 0.193604
torch.bfloat16 2 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 71.335 0.189941
torch.bfloat16 4 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 88.3952 0.179138
torch.bfloat16 8 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 110.335 0.172913
torch.bfloat16 10 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 145.506 0.552686
torch.bfloat16 12 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 149.487 0.544291
torch.bfloat16 16 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 151.616 0.533279
torch.bfloat16 32 4096 128 512 10 0 2 torch.float8_e4m3fnuz torch.float8_e4m3fnuz True False 0 0 False 215.407 0.511322

Submission Checklist

Signed-off-by: Xiake Sun <xiake.sun@amd.com>
…s, replace with torch.randperm to generate combination without repetition

Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Comment thread aiter/fused_moe_ptpc_fp8.py Outdated
Comment thread aiter/fused_moe.py Outdated
Comment thread aiter/fused_moe_ptpc_fp8.py Outdated
Comment thread csrc/cpp_itfs/hsaco_tools.py Outdated
Comment thread aiter/fused_moe_ptpc_fp8.py Outdated
…ebug print

Signed-off-by: Xiake Sun <xiake.sun@amd.com>
@sammysun0711 sammysun0711 marked this pull request as ready for review May 11, 2026 09:18
Signed-off-by: Xiake Sun <xiake.sun@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant