[Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X#1753
[Experimental][DNM till upstream PR merges][AMD] perf: load-time block FP8 MoE for MiniMax M3 on MI300X#1753Oseltamivir wants to merge 4 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @Oseltamivir's task in 3m 50s —— View job Review of PR #1753
LGTM — no blocking issues found. Summary: This PR adds a fused CDNA3 MXFP8 MoE backend for MiniMax-M3 on MI300X via a runtime vLLM patch. I reviewed all four changed files and the 656-line Triton kernel patch in detail. Key findings:
|
functionstackx
left a comment
There was a problem hiding this comment.
plz create upstream PR and have it reviewed before merging this patch
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27490966693 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27491894470 |
|
Opened the requested upstream vLLM PR: vllm-project/vllm#45567. It is stacked on the active MiniMax M3 model branch/PR (#45381), includes the tested gfx94x MXFP8 kernel and benchmark, and passes all vLLM pre-commit hooks. The InferenceX patch has also been updated to the optimized tile selection and no longer uses split-K. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27491936967 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495448403 |
| MXFP8_ORACLE="$VLLM_PACKAGE_ROOT/vllm/model_executor/layers/fused_moe/oracle/mxfp8.py" | ||
| if ! grep -q "Using fused CDNA3 (gfx94x)" "$MXFP8_ORACLE"; then | ||
| patch --batch --forward -d "$VLLM_PACKAGE_ROOT" -p1 < "$MXFP8_PATCH" | ||
| fi |
There was a problem hiding this comment.
MTP script skips MXFP8 patch
Medium Severity
Runtime MXFP8 patching was added only to the non-MTP MI300X benchmark script. launch_mi300x-amds.sh runs minimaxm3_fp8_mi300x_mtp.sh for spec-decoding: mtp configs, so those jobs never apply minimaxm3_mi300x_mxfp8.patch despite the MTP script claiming it mirrors this recipe.
Reviewed by Cursor Bugbot for commit c3cdc37. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495458775 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495662629 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27502391149 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27502391149 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27506382432 |
|
Packed-scale follow-up is pushed in
Full validation sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27511311644 |
|
@Oseltamivir 's AI agent, remember to have ur search space start at conc=1 like i am fixing it rn #1760 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27519117381 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27522139191 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27522193211 |
d1638a0 to
465ff47
Compare
3d35ece to
6abc0eb
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27719249411 |
9031110 to
0a01999
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27720384974 |
0a01999 to
95e79da
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27720541036 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 95e79da. Configure here.
| - "Convert checkpoint MXFP8 MoE weights once at load time to 128x128 block FP8 on gfx942." | ||
| - "Normalize OCP E4M3 values to FNUZ and use the regular Triton block-FP8 backend." | ||
| - "Use measured low-token TP tiles and a tuned local-expert table without changing the TP8/TP8+EP8 matrix." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1753 |
There was a problem hiding this comment.
Changelog fix triggers deletion error
Medium Severity
This commit replaces the malformed glm5-fp4-gb300-dynamo-trt pr-link line and adds the minimaxm3-fp8-mi300x-vllm entry. process_changelog.get_added_lines treats the removed pr-link line as a non-whitespace deletion and raises, so labeled PR sweep setup can fail before benchmarks run.
Reviewed by Cursor Bugbot for commit 95e79da. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27721062705 |
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
95e79da to
27510c4
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27721901991 |
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27725228435 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27725228435 |
|
/reuse-sweep-run 27725228435 |
…p8-clean # Conflicts: # perf-changelog.yaml
…p8-clean # Conflicts: # perf-changelog.yaml


Summary
checkpoint MXFP8 MoE weights once at load time into 128x128 block FP8.
regular Triton block-FP8 backend.
is absent.
The patch is scoped to
benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh. The separateMTP recipe remains unchanged.
Validation
4a560dd8db67c270f5e2afb614558271b76f2294.python -m pytest utils/matrix_logic/ utils/test_process_changelog.py -q:158 passed.
both sequence lengths, and the expected eval points.
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27725228435
node-local Pyxis failures that occurred before model startup.
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27725256963
95.53%-95.98% for TP8 and 95.30%-95.91% for TP8+EP8.
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27733137495
smoke test reported 0.037185 relative error.
(453.13 vs. 455.10 tok/s) and 0.84% at concurrency 64
(4018.45 vs. 4052.69 tok/s) of the matched control.
The E16 table was measured twice. Its independent 100-iteration rerun reduced
kernel latency by 13.5% to 26.4% versus the built-in fallback for every tested
batch size from 64 through 8192.
End-to-End Interpretation
The unofficial chart overlay is not a before/after comparison. The branch
series uses MI300X with TP8 or TP8+EP8, while the adjacent MI355X series uses
TP4 on a different GPU generation.
Against the previous MI300X result with the same 8K/1K TP/EP and concurrency
shapes
(https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27510667862),
the patched path improves total throughput per GPU:
These are real same-hardware gains, but they do not close the end-to-end gap
to the MI355X TP4 curve in the throughput-oriented region. An earlier MI300X
TP4/DP2 experiment
(https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27664746568)
reached 1393.42 tok/s/GPU at 8K/1K concurrency 256, below the patched
TP8+EP8 result of 1469.05 tok/s/GPU, so changing parallelism alone does not
close that gap.
The runtime patch optimizes MoE weight representation and MoE kernel dispatch.
The chart also includes attention, sparse indexing, KV-cache handling,
scheduling, prefill/decode balance, collectives, and hardware differences.
The supported performance claim is therefore improved MI300X performance
relative to its previous path, not parity with MI355X.
MI300X end-to-end serving results, in total tokens/s/GPU:
Note
Medium Risk
Runtime patching of vLLM quantization/MoE affects numerical behavior on MI300X serving paths until upstream lands; benchmark scope is limited but load-time requantization is accuracy-sensitive.
Overview
Enables the MiniMax-M3 MXFP8 MI300X vLLM benchmark by patching the pinned ROCm image at job start instead of waiting on upstream.
minimaxm3_fp8_mi300x.shlocates the installedvllmtree, idempotently appliesminimaxm3_mi300x_mxfp8.patch, verifies it, and fails the run if patch state is ambiguous.The patch adds load-time conversion of checkpoint MXFP8 MoE weights to 128×128 block FP8 on gfx942 (OCP → FNUZ), routes MoE through the Triton block-FP8 backend, and ships retuned MI300X fused-MoE JSON (including a new E=16,N=3072 table). BF16 KV and the existing TP8 / TP8+EP8 sweep matrix are unchanged;
amd-master.yamlcomments now describe this path.Also records the change in
perf-changelog.yamland hardensget_added_linesso append-only changelog edits (e.g. missing trailing newline) are detected correctly, with new tests inutils/test_process_changelog.py.Reviewed by Cursor Bugbot for commit 6f5a399. Bugbot is set up for automated code reviews on this repo. Configure here.