[MOE] Special path for moe ptpc fp8, batch 1~32 with hsaco by sammysun0711 · Pull Request #3103 · ROCm/aiter

sammysun0711 · 2026-05-09T07:49:20Z

Motivation

Based on #2961
Integrate Qwen3.5 397B FP8 PTPC MOE in decoding phase for batch size 1-32 with FP8 weight decompression via hsaco.

Technical Details

Major optimization:

Batch1: a new special kernel moe_gemm_batch1 can directly use topk_ids and topk_weight and there is no dependency on moe_sorting
Batch 2~15: stage 1: moe_gemm_batch + stage 2: moe_2stage_splitk
Batch16~32: stage 1: moe_gemm_batch + stage 2: moe_2stage_down_loopn
FP8: since it's memory bound, a new kernel will do weight decompression instead of additional dynamic quantization kernels.

The function is disabled by default, should use AITER_MOE_SMALL_BATCH=1 to enable.
The corresponding source file for .co is moe.py.

Test Plan

AITER_MOE_SMALL_BATCH=1 python3 op_tests/test_moe_2stage.py -q 2 -a silu -e 512 -k 10 -dim 4096,128 -p t  -t 1

Test Result

Test results on MI308X (non-PTL)
AITER_MOE_SMALL_BATCH=1 for batch size 1/2/4/8/10/12/16/32

AITER_MOE_SMALL_BATCH=1 python3 op_tests/test_moe_2stage.py -q 2 -a silu -e 512 -k 10 -dim 4096,128 -p t  -t 1

dtype	token	model_dim	inter_dim	E	topk	qType	AQDType	WQDType	use_g1u1	doweight_stage1	preshuffle	us	err
torch.bfloat16	1	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	20.727	0.85498
torch.bfloat16	2	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	47.237	0.866455
torch.bfloat16	4	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	62.437	0.858887
torch.bfloat16	8	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	86.897	0.865326
torch.bfloat16	10	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	95.316	0.853809
torch.bfloat16	12	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	100.756	0.859517
torch.bfloat16	16	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	115.835	0.8564
torch.bfloat16	32	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	176.555	0.856827

AITER_MOE_SMALL_BATCH=0 for batch size 1/2/4/8/10/12/16/32

AITER_MOE_SMALL_BATCH=0 python3 op_tests/test_moe_2stage.py -q 2 -a silu -e 512 -k 10 -dim 4096,128 -p t  -t 1

dtype	token	model_dim	inter_dim	E	topk	qType	AQDType	WQDType	use_g1u1	doweight_stage1	preshuffle	us	err
torch.bfloat16	1	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	56.085	0.193604
torch.bfloat16	2	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	71.335	0.189941
torch.bfloat16	4	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	88.3952	0.179138
torch.bfloat16	8	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	110.335	0.172913
torch.bfloat16	10	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	145.506	0.552686
torch.bfloat16	12	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	149.487	0.544291
torch.bfloat16	16	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	151.616	0.533279
torch.bfloat16	32	4096	128	512	10	2	torch.float8_e4m3fnuz	torch.float8_e4m3fnuz	True	False	False	215.407	0.511322

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

…s, replace with torch.randperm to generate combination without repetition Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

…ebug print Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

sammysun0711 added 9 commits May 7, 2026 04:17

Initial commit

c15a84d

integrate moe_2stage_down_loopn experiments

ed88702

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Add moe_2stage_down_loopn co

1b9a9cf

Add test_moe_ptpc_fp8.py optest script

963c1c3

[Bugfix] torch.randint generated topk_ids contains repeated expert id…

5a44814

…s, replace with torch.randperm to generate combination without repetition Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Refactor fused_moe_ptpc_fp8

7fbf625

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Remove backup co

8a968fe

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Remove unused torch reference test

27ad1c9

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Minior refactor, comment out debug info

2212eec

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

sammysun0711 mentioned this pull request May 9, 2026

[MOE][WIP] Integrate Qwen3.5 397B FP8 PTPC MOE Optimization for BS 1-32 #2949

Closed

1 task

sammysun0711 commented May 9, 2026

View reviewed changes

Comment thread aiter/fused_moe_ptpc_fp8.py Outdated

Comment thread aiter/fused_moe.py Outdated

Comment thread aiter/fused_moe_ptpc_fp8.py Outdated

Comment thread csrc/cpp_itfs/hsaco_tools.py Outdated

Comment thread aiter/fused_moe_ptpc_fp8.py Outdated

Apply review comments, add try-catch and excpetion fallback, remove d…

e722a02

…ebug print Signed-off-by: Xiake Sun <xiake.sun@amd.com>

sammysun0711 marked this pull request as ready for review May 11, 2026 09:18

Remove debug print

526e09e

Signed-off-by: Xiake Sun <xiake.sun@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MOE] Special path for moe ptpc fp8, batch 1~32 with hsaco#3103

[MOE] Special path for moe ptpc fp8, batch 1~32 with hsaco#3103
sammysun0711 wants to merge 11 commits intoROCm:Qwen3.5_devfrom
sammysun0711:fused_moe_ptpc_fp8

sammysun0711 commented May 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sammysun0711 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sammysun0711 commented May 9, 2026 •

edited

Loading