[Gluon] fused_mxfp4_quant for gfx1250 by amd-jrosas · Pull Request #3093 · ROCm/aiter

amd-jrosas · 2026-05-08T21:29:03Z

Motivation

Create gluon version of fused_mxfp4_quant for gfx1250.

Technical Details

New gluon kernel to include TDM functions and updated logic for gfx1250. Gluon Kernel written separately and contained in _gluon_kernels/quant/ folder and updated made to existing API in triton to default to gluon kernel when gfx1250 is detected. Updates were also made to the test file to verify gluon kernel.

Test Plan

Verified testing through existing test_fused_mxfp4_quant.py and successfully pass across various shapes and feature set.

Test Result

All tests were successful with various features set to true/false. More details below.

Main features

Parameter	Values Tested
scale_shuffle_padding	True, False
shuffle	True, False
dtype	float16, bfloat16
res1	True, False
inp2	True, False

Shape Details

N1	M Values	Results
256	1, 4, 33, 64,132, 256	All Pass
200	1, 4, 33, 64, 132* 256	All Pass

*132 value had a single FP4 element mismatch by 0.25 which appears to be a rounding boundary case. Occurs with shuffle=True, and dtype=float16 but appears to be a rounding artifact.

Submission Checklist

[✓] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

… gfx1250

…1250, replaced aiter kernels with torch in reference functions

This reverts commit 149446f.

* add test gluon kernel * add test gluon kernel * add gluon kernel * gluon ut pass * comment and formatting * comment * move gluon kernel to new file and create tdm pipeline version * update * update * update * non-tdm async copy UT passed, tdm asycn copy UT not passing * update * tmp * update * update * gfx950 compatible * update * always max size_per_thread x threads_per_warp along the fastest dim * redesign tdm desc and offsets * update * TDM random gather UA3D UT passed

Add 2d unified attention kernel in gluon, with async. and TDM support, supports gfx950 and gfx1250

* update * update UT * fix async_copy bug, add kv_cache_shuffle torch * update * add key_cache shuffling, fix gluon async_copy bug * tmp * update * tmp * translate kernel to new style * update baseline * add V_BLOCKED_LAYOUT * instr_shape update * add load_shared_relaxed to async kernel * update * add make dll * update * gluon key shuffle * update * add value cache shuffle support, but only for block_size > 64 * change name * change preshuffle load write logic * shuffle for async kernel * update * update * update * update * shuffle for tdm tmp * gfx1250 async shuffle fix * updates * tdm gather shuffle ut pass * skip ut if LDS requirement exceeds 320 kB * update * update test scripts * clean up * revert chip_info hack * update * update * clean up

Port changes from #2282. Adds default Triton kernel tuning configs for gfx1250 covering all GEMM variants and MHA (fwd + bwd).

…#2316) * Add gfx1250 arch enablement: fp8 support + test refactoring - Add gfx1250 to CDNA_ARCHS and FP8_ARCHS in flash attention utils - Refactor fp8 dtype selection in tests to use aiter.utility.dtypes.fp8 instead of hardcoded per-arch checks, enabling gfx1250 support - Note: gemm_config_utils left unchanged to preserve failure on missing gfx1250 configs (no fallback to gfx950) Port of #2284 with intentional deviation on gemm_config_utils. * Address comments * Fix pytest mark skipif * Address comments

github-actions · 2026-05-08T21:29:19Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 3093 --add-label <label>

vgokhale and others added 28 commits March 24, 2026 02:32

Add initial config for A16W16

9d393ba

Fix for pytorch manual_seed

5f85097

test_moe_gemm_a8w4, test_fused_qkv_split_qk_rope changes to work with…

5c5eccb

… gfx1250

fixes needed for test_fused_fp8_quant: added fp8 defaultdtype for gfx…

681d10b

…1250, replaced aiter kernels with torch in reference functions

shapes tested and passing for fused_fp8_quant

c753970

Remove manual seeds. Add gfx12 to device list

e0230cf

removing the hip workaround added earlier in gemm_a8w8 kernel

9467ee3

rmsnorm and fused_mul_add

d273c20

moe_gemm_a8w8

e9e4aa9

added gfx1250 to get_fp8_dtypes() function

b250948

gluon gemm a8w8 in progress, slice layout issue

c386748

Revert "gluon gemm a8w8 in progress, slice layout issue"

d808afe

This reverts commit 149446f.

[TRITON] [GLUON] Adding 2d unified attention Gluon kernel (#2112)

6d56150

Add 2d unified attention kernel in gluon, with async. and TDM support, supports gfx950 and gfx1250

[TRITON][GLUON] add TDM Gather to 2D Attention (#2155)

e46d13d

Add gfx1250 support: GFX_MAP + default GEMM/MHA configs (#2315)

6fa3e3f

Port changes from #2282. Adds default Triton kernel tuning configs for gfx1250 covering all GEMM variants and MHA (fwd + bwd).

Add gfx1250 to GFX_MAP in chip_info.py

4e8856a

Merge branch 'main' into shared/triton-gfx12

09837f8

Merge branch 'main' into shared/triton-gfx12

1b94b36

Merge branch 'main' into shared/triton-gfx12

fe37ad7

Merge branch 'main' into shared/triton-gfx12

73c3f69

Merge branch 'main' into shared/triton-gfx12

bfcb3e7

Merge branch 'main' into shared/triton-gfx12

1c89036

Merge branch 'main' into shared/triton-gfx12

9347aad

fuse_mxfp4_quant gluon kernel for gfx1250

3cab881

Format changes and removed unused variables

cb11cb6

amd-jrosas assigned vgokhale and azaidy May 8, 2026

amd-jrosas requested a review from a team May 8, 2026 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gluon] fused_mxfp4_quant for gfx1250#3093

[Gluon] fused_mxfp4_quant for gfx1250#3093
amd-jrosas wants to merge 28 commits intomainfrom
jrosas_fused_mxfp4_quant

amd-jrosas commented May 8, 2026

Uh oh!

github-actions Bot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

amd-jrosas commented May 8, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented May 8, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants