[Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X by Oseltamivir · Pull Request #1753 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-14T06:40:44Z

Summary

Apply a runtime patch to the pinned ROCm image revision that converts
checkpoint MXFP8 MoE weights once at load time into 128x128 block FP8.
Normalize OCP E4M3 values to the gfx942 FNUZ representation and run the
regular Triton block-FP8 backend.
Use measured low-token TP tiles and a tuned E16 local-expert table.
Preserve the existing TP8 and TP8+EP8 parallelism and concurrency matrix.
Fail the launcher if patch application fails or the expected backend marker
is absent.

The patch is scoped to
benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh. The separate
MTP recipe remains unchanged.

Validation

Runtime patch applies cleanly to image revision
4a560dd8db67c270f5e2afb614558271b76f2294.
Patched Python files compile and both tuning JSON files parse.
Upstream changed-file pre-commit hooks pass, including mypy.
Targeted MI300X kernel module: 48 passed, 5 skipped.
python -m pytest utils/matrix_logic/ utils/test_process_changelog.py -q:
158 passed.
MI300X matrix generation includes c1 through c128 TP8, c256 TP8+EP8,
both sequence lengths, and the expected eval points.
Launcher shell validation and patch reverse-application checks pass.
Full MI300X TP8 and TP8+EP8 sweep:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27725228435
- All 18 serving points and both eval jobs passed after retrying three
  node-local Pyxis failures that occurred before model startup.
Independent accuracy run:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27725256963
- Across the two full 1,319-example GSM8K runs, strict exact match was
  95.53%-95.98% for TP8 and 95.30%-95.91% for TP8+EP8.
MI355X matched control/patched guard:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27733137495
- Both variants retained the native gfx950 MXFP8 backend; the numerical
  smoke test reported 0.037185 relative error.
- Patched aggregate throughput was within 0.43% at concurrency 4
  (453.13 vs. 455.10 tok/s) and 0.84% at concurrency 64
  (4018.45 vs. 4052.69 tok/s) of the matched control.

The E16 table was measured twice. Its independent 100-iteration rerun reduced
kernel latency by 13.5% to 26.4% versus the built-in fallback for every tested
batch size from 64 through 8192.

End-to-End Interpretation

The unofficial chart overlay is not a before/after comparison. The branch
series uses MI300X with TP8 or TP8+EP8, while the adjacent MI355X series uses
TP4 on a different GPU generation.

Against the previous MI300X result with the same 8K/1K TP/EP and concurrency
shapes
(https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27510667862),
the patched path improves total throughput per GPU:

Parallelism	Concurrency	Previous	Patched	Change
TP8	1	99.88	104.94	+5.1%
TP8	8	463.69	502.55	+8.4%
TP8	32	880.10	981.21	+11.5%
TP8	64	976.30	1236.57	+26.7%
TP8+EP8	128	1110.16	1273.92	+14.8%
TP8+EP8	256	1199.22	1469.05	+22.5%

These are real same-hardware gains, but they do not close the end-to-end gap
to the MI355X TP4 curve in the throughput-oriented region. An earlier MI300X
TP4/DP2 experiment
(https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27664746568)
reached 1393.42 tok/s/GPU at 8K/1K concurrency 256, below the patched
TP8+EP8 result of 1469.05 tok/s/GPU, so changing parallelism alone does not
close that gap.

The runtime patch optimizes MoE weight representation and MoE kernel dispatch.
The chart also includes attention, sparse indexing, KV-cache handling,
scheduling, prefill/decode balance, collectives, and hardware differences.
The supported performance claim is therefore improved MI300X performance
relative to its previous path, not parity with MI355X.

MI300X end-to-end serving results, in total tokens/s/GPU:

Sequence	Parallelism	Concurrency	Total tok/s/GPU
1K/1K	TP8	1	24.37
1K/1K	TP8	2	46.85
1K/1K	TP8	4	83.22
1K/1K	TP8	8	136.60
1K/1K	TP8	16	219.91
1K/1K	TP8	32	346.28
1K/1K	TP8	64	522.52
1K/1K	TP8	128	738.25
1K/1K	TP8+EP8	256	917.31
8K/1K	TP8	1	104.94
8K/1K	TP8	2	197.59
8K/1K	TP8	4	336.79
8K/1K	TP8	8	502.55
8K/1K	TP8	16	715.23
8K/1K	TP8	32	981.21
8K/1K	TP8	64	1236.57
8K/1K	TP8+EP8	128	1273.92
8K/1K	TP8+EP8	256	1469.05

Note

Medium Risk
Runtime patching of vLLM quantization/MoE affects numerical behavior on MI300X serving paths until upstream lands; benchmark scope is limited but load-time requantization is accuracy-sensitive.

Overview
Enables the MiniMax-M3 MXFP8 MI300X vLLM benchmark by patching the pinned ROCm image at job start instead of waiting on upstream. minimaxm3_fp8_mi300x.sh locates the installed vllm tree, idempotently applies minimaxm3_mi300x_mxfp8.patch, verifies it, and fails the run if patch state is ambiguous.

The patch adds load-time conversion of checkpoint MXFP8 MoE weights to 128×128 block FP8 on gfx942 (OCP → FNUZ), routes MoE through the Triton block-FP8 backend, and ships retuned MI300X fused-MoE JSON (including a new E=16,N=3072 table). BF16 KV and the existing TP8 / TP8+EP8 sweep matrix are unchanged; amd-master.yaml comments now describe this path.

Also records the change in perf-changelog.yaml and hardens get_added_lines so append-only changelog edits (e.g. missing trailing newline) are detected correctly, with new tests in utils/test_process_changelog.py.

^{Reviewed by Cursor Bugbot for commit 6f5a399. Bugbot is set up for automated code reviews on this repo. Configure here.}

github-actions · 2026-06-14T06:40:53Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-14T06:43:16Z

Claude finished @Oseltamivir's task in 3m 50s —— View job

Review of PR #1753

Gather context and read changed files
Validate master config changes
Validate perf-changelog entries
Validate benchmark script
Review runtime patch (Triton kernels, normalization, oracle changes)
Post review results

LGTM — no blocking issues found.

Summary: This PR adds a fused CDNA3 MXFP8 MoE backend for MiniMax-M3 on MI300X via a runtime vLLM patch. I reviewed all four changed files and the 656-line Triton kernel patch in detail. Key findings:

E4M3FN → E4M3FNUZ normalization (normalize_mxfp8_e4m3fn_to_e4m3fnuz): The math is correct — FNUZ's exponent bias is 1 higher, so incrementing the E8M0 scale compensates; negative zero (0x80) is properly canonicalized. The overflow guard (scales.max() >= 254) correctly prevents incrementing into E8M0 NaN (255).
FNUZ kernel scale computation (tl.exp2(asc + bsc - 254.0)): Correct — two E8M0 dequants 2^(a-127) * 2^(b-127) = 2^(a+b-254).
Split-K: Occupancy-aware selection, FP32 accumulation buffer when split_k > 1, tl.atomic_add for reduction — all consistent.
Grid bound (min(sorted_token_ids.shape[0], M_routed * block_m)): Valid upper bound since active experts ≤ M_routed, and the kernel's num_post guard handles any overestimate.
Benchmark script: Expert parallelism correctly conditioned on EP_SIZE, vllm serve arguments properly formatted on separate lines, patch application is idempotent via the grep guard + --forward.
Perf-changelog: New entry correctly appended at the end of the file.
Master config: Only the comment was updated; no functional config or image changes.

functionstackx

plz create upstream PR and have it reviewed before merging this patch

github-actions · 2026-06-14T07:28:14Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27490966693
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27490966693

github-actions · 2026-06-14T07:30:33Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27491894470
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27491894470

Oseltamivir · 2026-06-14T07:31:31Z

Opened the requested upstream vLLM PR: vllm-project/vllm#45567. It is stacked on the active MiniMax M3 model branch/PR (#45381), includes the tested gfx94x MXFP8 kernel and benchmark, and passes all vLLM pre-commit hooks. The InferenceX patch has also been updated to the optimized tile selection and no longer uses split-K.

github-actions · 2026-06-14T08:05:18Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27491936967
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27491936967

github-actions · 2026-06-14T10:05:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495448403
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27495448403

cursor · 2026-06-14T10:05:29Z

+MXFP8_ORACLE="$VLLM_PACKAGE_ROOT/vllm/model_executor/layers/fused_moe/oracle/mxfp8.py"
+if ! grep -q "Using fused CDNA3 (gfx94x)" "$MXFP8_ORACLE"; then
+    patch --batch --forward -d "$VLLM_PACKAGE_ROOT" -p1 < "$MXFP8_PATCH"
+fi


MTP script skips MXFP8 patch

Medium Severity

Runtime MXFP8 patching was added only to the non-MTP MI300X benchmark script. launch_mi300x-amds.sh runs minimaxm3_fp8_mi300x_mtp.sh for spec-decoding: mtp configs, so those jobs never apply minimaxm3_mi300x_mxfp8.patch despite the MTP script claiming it mirrors this recipe.

^{Reviewed by Cursor Bugbot for commit c3cdc37. Configure here.}

github-actions · 2026-06-14T10:08:11Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495458775
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27495458775

github-actions · 2026-06-14T11:20:49Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495662629
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27495662629

github-actions · 2026-06-14T14:55:00Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27502391149
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27502391149

github-actions · 2026-06-14T15:51:49Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27502391149
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27502391149

github-actions · 2026-06-14T18:34:54Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27506382432
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27506382432

Oseltamivir · 2026-06-14T20:41:45Z

Packed-scale follow-up is pushed in 7678b0bc (merge refresh 684b6a3d). Matched local results, with parallelism unchanged:

1K/1K TP8: 214.239 / 329.709 / 491.056 / 707.749 tok/s/GPU at concurrency 16/32/64/128
8K/1K TP8 c64: 1199.146 tok/s/GPU
8K c64 is +5.60% vs the previous hybrid sweep and +22.74% vs BF16 emulation

Full validation sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27511311644

functionstackx · 2026-06-14T20:48:02Z

@Oseltamivir 's AI agent, remember to have ur search space start at conc=1 like i am fixing it rn #1760

github-actions · 2026-06-15T03:20:33Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27519117381
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27519117381

github-actions · 2026-06-15T03:26:18Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27522139191
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27522139191

github-actions · 2026-06-15T04:52:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27522193211
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27522193211

github-actions · 2026-06-17T20:58:32Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27719249411
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27719249411

github-actions · 2026-06-17T21:22:01Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27720384974
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27720384974

github-actions · 2026-06-17T21:31:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27720541036
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27720541036

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 95e79da. Configure here.}

cursor · 2026-06-17T21:33:03Z

+    - "Convert checkpoint MXFP8 MoE weights once at load time to 128x128 block FP8 on gfx942."
+    - "Normalize OCP E4M3 values to FNUZ and use the regular Triton block-FP8 backend."
+    - "Use measured low-token TP tiles and a tuned local-expert table without changing the TP8/TP8+EP8 matrix."
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1753


Changelog fix triggers deletion error

Medium Severity

This commit replaces the malformed glm5-fp4-gb300-dynamo-trt pr-link line and adds the minimaxm3-fp8-mi300x-vllm entry. process_changelog.get_added_lines treats the removed pr-link line as a non-whitespace deletion and raises, so labeled PR sweep setup can fail before benchmarks run.

^{Reviewed by Cursor Bugbot for commit 95e79da. Configure here.}

github-actions · 2026-06-17T21:42:24Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27721062705
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27721062705

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

github-actions · 2026-06-17T22:47:46Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27721901991
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27721901991

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

github-actions · 2026-06-17T23:56:30Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27725228435
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27725228435

github-actions · 2026-06-18T00:13:13Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27725228435
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27725228435

Oseltamivir · 2026-06-18T01:04:21Z

/reuse-sweep-run 27725228435

…p8-clean # Conflicts: # perf-changelog.yaml

github-project-automation Bot added this to InferenceMAX Board Jun 14, 2026

Oseltamivir added the full-sweep-fail-fast label Jun 14, 2026

Oseltamivir marked this pull request as ready for review June 14, 2026 06:42

Oseltamivir requested a review from a team June 14, 2026 06:42

Oseltamivir requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 14, 2026 06:42

functionstackx requested changes Jun 14, 2026

View reviewed changes

cursor Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh

Oseltamivir changed the title ~~[AMD] feat: native MXFP8 MoE for MiniMax M3 on MI300X~~ [AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X Jun 14, 2026

Oseltamivir added full-sweep-enabled and removed full-sweep-fail-fast labels Jun 14, 2026

cursor Bot reviewed Jun 14, 2026

View reviewed changes

functionstackx changed the title ~~[AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X~~ [Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X Jun 14, 2026

Oseltamivir closed this Jun 15, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 15, 2026

Oseltamivir reopened this Jun 15, 2026

Oseltamivir mentioned this pull request Jun 15, 2026

perf(vllm): compact MiniMax M3 EP decode routes on MI300X #1782

Open

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from d1638a0 to 465ff47 Compare June 17, 2026 20:51

Oseltamivir changed the title ~~[Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X~~ [Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X Jun 17, 2026

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 2 times, most recently from 3d35ece to 6abc0eb Compare June 17, 2026 20:56

github-code-quality Bot found potential problems Jun 17, 2026

View reviewed changes

Comment thread utils/process_changelog.py Fixed

cursor Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh

Comment thread benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 2 times, most recently from 9031110 to 0a01999 Compare June 17, 2026 21:20

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from 0a01999 to 95e79da Compare June 17, 2026 21:30

cursor Bot reviewed Jun 17, 2026

View reviewed changes

perf(mi300x): use load-time block FP8 MoE conversion

27510c4

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from 95e79da to 27510c4 Compare June 17, 2026 21:47

fix(mi300x): preserve M3 SwiGLU parameters in FP8 patch

7521394

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

Oseltamivir added 2 commits June 18, 2026 09:04

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-blockf…

6c29d32

…p8-clean # Conflicts: # perf-changelog.yaml

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-blockf…

6f5a399

…p8-clean # Conflicts: # perf-changelog.yaml

Conversation

Oseltamivir commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

End-to-End Interpretation

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

claude Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #1753

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Uh oh!

Oseltamivir commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

cursor Bot Jun 14, 2026

Choose a reason for hiding this comment

MTP script skips MXFP8 patch

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

Oseltamivir commented Jun 14, 2026

Uh oh!

functionstackx commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 17, 2026

Choose a reason for hiding this comment

Changelog fix triggers deletion error

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026

Uh oh!

Oseltamivir commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Oseltamivir commented Jun 14, 2026 •

edited

Loading

claude Bot commented Jun 14, 2026 •

edited

Loading