[Kernel][gfx1250] Add FlyDSL MXScale FP8/A8W4 GEMM by aoli26 · Pull Request #3106 · ROCm/aiter

aoli26 · 2026-05-09T12:55:59Z

Motivation

Add MXFP8 and A8W4 dense GEMM support for gfx1250 to AITER.

Technical Details

Vendored FlyDSL compile_mxscale_gemm kernel into aiter/ops/flydsl/kernels/.
New public API:
- aiter.gemm_mxfp8 — MXFP8 (E4M3 + E8M0 1×32)
- aiter.gemm_mxa8w4 — A8W4 (FP8 act, FP4 weight, E8M0 1×32)
- aiter.flydsl_mxscale_gemm — low-level entry with all codegen knobs.
Wired into AOT (aiter/aot/flydsl/gemm.py) for CSV-driven precompilation.

Test Plan

pytest -q aiter/ops/flydsl/test_flydsl_mxscale_gemm.py

Test Result

51 tests passed on gfx1250.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

github-actions · 2026-05-09T12:56:17Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3106 --add-label <label>

Copilot

Pull request overview

Adds FlyDSL-backed MXScale dense GEMM support for gfx1250 (MXFP8 and A8W4) to AITER, including a public Python wrapper, host-side layout helpers, kernel code, and an AOT CSV parsing/compilation hook.

Changes:

Introduces flydsl_mxscale_gemm plus format-named wrappers gemm_mxfp8 / gemm_mxa8w4, including kernel-name encode/parse utilities.
Adds host-side padding + preshuffle utilities for B and E8M0 scales, and vendors the gfx1250 MXScale GEMM kernel implementation.
Adds a dedicated pytest suite and extends the FlyDSL GEMM AOT pipeline to recognize flydsl_mxscale_* kernels.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
aiter/ops/flydsl/test_flydsl_mxscale_gemm.py	New unit + correctness tests for MXScale GEMM (gated to gfx1250 + flydsl).
aiter/ops/flydsl/mxscale_layout.py	Host-side padding and preshuffle helpers for MXScale A/B and E8M0 scales.
aiter/ops/flydsl/mxscale_gemm.py	Public wrapper API, kernel-name encode/parse, and runtime launch path.
aiter/ops/flydsl/kernels/pipeline_utils.py	Shared pipeline helper utilities used by the gfx1250 kernel.
aiter/ops/flydsl/kernels/gemm_fp8fp4_gfx1250.py	Vendored unified MXFP4/MXFP8/A8W4 gfx1250 kernel with MXScale support.
aiter/ops/flydsl/kernels/gemm_common_gfx1250.py	Shared gfx1250 GEMM helpers (LDS/pipeline/epilogue utilities).
aiter/aot/flydsl/gemm.py	Extends AOT CSV parsing/dispatch to recognize MXScale kernels and compile them.
aiter/init.py	Exposes the new optional FlyDSL-backed public entry points at top-level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Vendor the FlyDSL gfx1250 mxscale GEMM kernel into aiter and expose two data-format-named public entries: - aiter.gemm_mxfp8 MXFP8 (E4M3 + E8M0 1x32) - aiter.gemm_mxa8w4 A8W4 (FP8 act, FP4 weight, E8M0 1x32) aiter.flydsl_mxscale_gemm remains exported as the low-level entry that pins to the FlyDSL backend and exposes all codegen knobs. Files added: - aiter/ops/flydsl/kernels/{gemm_fp8fp4_gfx1250,gemm_common_gfx1250, pipeline_utils}.py vendored from FlyDSL main; only the two `from kernels.X` imports are rewritten to relative form. - aiter/ops/flydsl/mxscale_layout.py host helper for pad + E8M0(127) scale fill + B 16x16 preshuffle + WMMA-friendly E8M0 scale preshuffle. - aiter/ops/flydsl/mxscale_gemm.py public wrappers, kernelName encode/parse, format-named entries, runtime arch guard, lazy flydsl import. - aiter/ops/flydsl/test_flydsl_mxscale_gemm.py unit tests, gated on CUDA + flydsl + gfx1250. Wires into AOT: - aiter/aot/flydsl/gemm.py adds a `flydsl_mxscale_*` parser branch and `_compile_mxscale_to_cache` for CSV-driven AOT precompilation; mxscale-kind jobs hard-pin gfx1250 regardless of cu_num. Public surface: - aiter/__init__.py exports gemm_mxfp8, gemm_mxa8w4, and flydsl_mxscale_gemm when flydsl is importable. This path is intentionally independent from gemm_a8w8_blockscale and gemm_a8w8_bpreshuffle: the OCP MX scale (E8M0 1x32) is not interchangeable with the existing per-1x128/128x128 FP32 or PTPC FP32 scale layouts. Future Gluon/CK MXFP8 backends can land behind the format-named entries without changing the call sites.

…n device/FP4 padding

aoli26 requested review from a team and Copilot May 9, 2026 12:56

Copilot started reviewing on behalf of aoli26 May 9, 2026 12:57 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread aiter/ops/flydsl/mxscale_gemm.py Outdated

Comment thread aiter/ops/flydsl/mxscale_layout.py

Comment thread aiter/ops/flydsl/kernels/gemm_fp8fp4_gfx1250.py

Comment thread aiter/aot/flydsl/gemm.py

Comment thread aiter/aot/flydsl/gemm.py Outdated

aoli26 force-pushed the aoli/flydsl_mxscale_gfx1250 branch from 4c9c06b to 53d5b74 Compare May 11, 2026 07:38

aoli26 added 4 commits May 11, 2026 17:56

[Kernel][gfx1250] MXScale: validate num_buffers, cache compile, harde…

f673aa5

…n device/FP4 padding

[Kernel][gfx1250] Fix FlyDSL MXScale GEMM CI coverage

cd8989b

[Kernel][gfx1250] Fix FlyDSL MXScale import and arch gates

8eb18c5

aoli26 force-pushed the aoli/flydsl_mxscale_gfx1250 branch from cdfb6aa to 8eb18c5 Compare May 11, 2026 09:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][gfx1250] Add FlyDSL MXScale FP8/A8W4 GEMM#3106

[Kernel][gfx1250] Add FlyDSL MXScale FP8/A8W4 GEMM#3106
aoli26 wants to merge 4 commits intomainfrom
aoli/flydsl_mxscale_gfx1250

aoli26 commented May 9, 2026

Uh oh!

github-actions Bot commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aoli26 commented May 9, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

github-actions Bot commented May 9, 2026

🏷️ CI Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants