Skip to content

[Kernel][gfx1250] Add FlyDSL MXScale FP8/A8W4 GEMM#3106

Open
aoli26 wants to merge 4 commits intomainfrom
aoli/flydsl_mxscale_gfx1250
Open

[Kernel][gfx1250] Add FlyDSL MXScale FP8/A8W4 GEMM#3106
aoli26 wants to merge 4 commits intomainfrom
aoli/flydsl_mxscale_gfx1250

Conversation

@aoli26
Copy link
Copy Markdown

@aoli26 aoli26 commented May 9, 2026

Motivation

Add MXFP8 and A8W4 dense GEMM support for gfx1250 to AITER.

Technical Details

  • Vendored FlyDSL compile_mxscale_gemm kernel into aiter/ops/flydsl/kernels/.
  • New public API:
    • aiter.gemm_mxfp8 — MXFP8 (E4M3 + E8M0 1×32)
    • aiter.gemm_mxa8w4 — A8W4 (FP8 act, FP4 weight, E8M0 1×32)
    • aiter.flydsl_mxscale_gemm — low-level entry with all codegen knobs.
  • Wired into AOT (aiter/aot/flydsl/gemm.py) for CSV-driven precompilation.

Test Plan

pytest -q aiter/ops/flydsl/test_flydsl_mxscale_gemm.py

Test Result

51 tests passed on gfx1250.

Submission Checklist

@aoli26 aoli26 requested review from a team and Copilot May 9, 2026 12:56
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 9, 2026

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3106 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds FlyDSL-backed MXScale dense GEMM support for gfx1250 (MXFP8 and A8W4) to AITER, including a public Python wrapper, host-side layout helpers, kernel code, and an AOT CSV parsing/compilation hook.

Changes:

  • Introduces flydsl_mxscale_gemm plus format-named wrappers gemm_mxfp8 / gemm_mxa8w4, including kernel-name encode/parse utilities.
  • Adds host-side padding + preshuffle utilities for B and E8M0 scales, and vendors the gfx1250 MXScale GEMM kernel implementation.
  • Adds a dedicated pytest suite and extends the FlyDSL GEMM AOT pipeline to recognize flydsl_mxscale_* kernels.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
aiter/ops/flydsl/test_flydsl_mxscale_gemm.py New unit + correctness tests for MXScale GEMM (gated to gfx1250 + flydsl).
aiter/ops/flydsl/mxscale_layout.py Host-side padding and preshuffle helpers for MXScale A/B and E8M0 scales.
aiter/ops/flydsl/mxscale_gemm.py Public wrapper API, kernel-name encode/parse, and runtime launch path.
aiter/ops/flydsl/kernels/pipeline_utils.py Shared pipeline helper utilities used by the gfx1250 kernel.
aiter/ops/flydsl/kernels/gemm_fp8fp4_gfx1250.py Vendored unified MXFP4/MXFP8/A8W4 gfx1250 kernel with MXScale support.
aiter/ops/flydsl/kernels/gemm_common_gfx1250.py Shared gfx1250 GEMM helpers (LDS/pipeline/epilogue utilities).
aiter/aot/flydsl/gemm.py Extends AOT CSV parsing/dispatch to recognize MXScale kernels and compile them.
aiter/init.py Exposes the new optional FlyDSL-backed public entry points at top-level.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/ops/flydsl/mxscale_gemm.py Outdated
Comment thread aiter/ops/flydsl/mxscale_layout.py
Comment thread aiter/ops/flydsl/kernels/gemm_fp8fp4_gfx1250.py
Comment thread aiter/aot/flydsl/gemm.py
Comment thread aiter/aot/flydsl/gemm.py Outdated
@aoli26 aoli26 force-pushed the aoli/flydsl_mxscale_gfx1250 branch from 4c9c06b to 53d5b74 Compare May 11, 2026 07:38
aoli26 added 4 commits May 11, 2026 17:56
Vendor the FlyDSL gfx1250 mxscale GEMM kernel into aiter and expose two
data-format-named public entries:
  - aiter.gemm_mxfp8     MXFP8 (E4M3 + E8M0 1x32)
  - aiter.gemm_mxa8w4    A8W4 (FP8 act, FP4 weight, E8M0 1x32)

aiter.flydsl_mxscale_gemm remains exported as the low-level entry that
pins to the FlyDSL backend and exposes all codegen knobs.

Files added:
  - aiter/ops/flydsl/kernels/{gemm_fp8fp4_gfx1250,gemm_common_gfx1250,
    pipeline_utils}.py    vendored from FlyDSL main; only the two
                          `from kernels.X` imports are rewritten to
                          relative form.
  - aiter/ops/flydsl/mxscale_layout.py   host helper for pad +
                          E8M0(127) scale fill + B 16x16 preshuffle +
                          WMMA-friendly E8M0 scale preshuffle.
  - aiter/ops/flydsl/mxscale_gemm.py     public wrappers, kernelName
                          encode/parse, format-named entries, runtime
                          arch guard, lazy flydsl import.
  - aiter/ops/flydsl/test_flydsl_mxscale_gemm.py   unit tests, gated
                          on CUDA + flydsl + gfx1250.

Wires into AOT:
  - aiter/aot/flydsl/gemm.py adds a `flydsl_mxscale_*` parser branch
    and `_compile_mxscale_to_cache` for CSV-driven AOT precompilation;
    mxscale-kind jobs hard-pin gfx1250 regardless of cu_num.

Public surface:
  - aiter/__init__.py exports gemm_mxfp8, gemm_mxa8w4, and
    flydsl_mxscale_gemm when flydsl is importable.

This path is intentionally independent from gemm_a8w8_blockscale and
gemm_a8w8_bpreshuffle: the OCP MX scale (E8M0 1x32) is not
interchangeable with the existing per-1x128/128x128 FP32 or PTPC FP32
scale layouts. Future Gluon/CK MXFP8 backends can land behind the
format-named entries without changing the call sites.
@aoli26 aoli26 force-pushed the aoli/flydsl_mxscale_gfx1250 branch from cdfb6aa to 8eb18c5 Compare May 11, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants