Skip to content

[Gluon] fused_mxfp4_quant for gfx1250#3093

Open
amd-jrosas wants to merge 28 commits intomainfrom
jrosas_fused_mxfp4_quant
Open

[Gluon] fused_mxfp4_quant for gfx1250#3093
amd-jrosas wants to merge 28 commits intomainfrom
jrosas_fused_mxfp4_quant

Conversation

@amd-jrosas
Copy link
Copy Markdown

Motivation

Create gluon version of fused_mxfp4_quant for gfx1250.

Technical Details

New gluon kernel to include TDM functions and updated logic for gfx1250. Gluon Kernel written separately and contained in _gluon_kernels/quant/ folder and updated made to existing API in triton to default to gluon kernel when gfx1250 is detected. Updates were also made to the test file to verify gluon kernel.

Test Plan

Verified testing through existing test_fused_mxfp4_quant.py and successfully pass across various shapes and feature set.

Test Result

All tests were successful with various features set to true/false. More details below.

Main features

Parameter Values Tested
scale_shuffle_padding True, False
shuffle True, False
dtype float16, bfloat16
res1 True, False
inp2 True, False

Shape Details

N1 M Values Results
256 1, 4, 33, 64,132, 256 All Pass
200 1, 4, 33, 64, 132* 256 All Pass

*132 value had a single FP4 element mismatch by 0.25 which appears to be a rounding boundary case. Occurs with shuffle=True, and dtype=float16 but appears to be a rounding artifact.

Submission Checklist

vgokhale and others added 28 commits March 24, 2026 02:32
…1250, replaced aiter kernels with torch in reference functions
* add test gluon kernel

* add test gluon kernel

* add gluon kernel

* gluon ut pass

* comment and formatting

* comment

* move gluon kernel to new file and create tdm pipeline version

* update

* update

* update

* non-tdm async copy UT passed, tdm asycn copy UT not passing

* update

* tmp

* update

* update

* gfx950 compatible

* update

* always max size_per_thread x threads_per_warp along the fastest dim

* redesign tdm desc and offsets

* update

* TDM random gather UA3D UT passed
Add 2d unified attention kernel in gluon, with async. and TDM support, supports gfx950 and gfx1250
* update

* update UT

* fix async_copy bug, add kv_cache_shuffle torch

* update

* add key_cache shuffling, fix gluon async_copy bug

* tmp

* update

* tmp

* translate kernel to new style

* update baseline

* add V_BLOCKED_LAYOUT

* instr_shape update

* add load_shared_relaxed to async kernel

* update

* add make dll

* update

* gluon key shuffle

* update

* add value cache shuffle support, but only for block_size > 64

* change name

* change preshuffle load write logic

* shuffle for async kernel

* update

* update

* update

* update

* shuffle for tdm tmp

* gfx1250 async shuffle fix

* updates

* tdm gather shuffle ut pass

* skip ut if LDS requirement exceeds 320 kB

* update

* update test scripts

* clean up

* revert chip_info hack

* update

* update

* clean up
Port changes from #2282. Adds default Triton kernel tuning
configs for gfx1250 covering all GEMM variants and MHA (fwd + bwd).
…#2316)

* Add gfx1250 arch enablement: fp8 support + test refactoring

- Add gfx1250 to CDNA_ARCHS and FP8_ARCHS in flash attention utils
- Refactor fp8 dtype selection in tests to use aiter.utility.dtypes.fp8
  instead of hardcoded per-arch checks, enabling gfx1250 support
- Note: gemm_config_utils left unchanged to preserve failure on missing
  gfx1250 configs (no fallback to gfx950)

Port of #2284 with intentional deviation on gemm_config_utils.

* Address comments

* Fix pytest mark skipif

* Address comments
@amd-jrosas amd-jrosas requested a review from a team May 8, 2026 21:29
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 3093 --add-label <label>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants