perf(vllm): compact MiniMax M3 EP decode routes on MI300X by Oseltamivir · Pull Request #1782 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-15T19:13:24Z

Summary

compact MiniMax M3 EP8 decode assignments to locally owned experts on MI300X
use profiled 16-row BF16 grouped-GEMM tiles for the compacted route density
keep prefill and mixed batches on the existing generic TritonExperts path
remove the regressive dual-accumulator GEMM1/SwiGLU fusion
retain the existing short-context native MXFP8/BF16 policy

This PR is stacked on #1753 and contains only the incremental EP8 runtime
optimization. It does not include the profiling branch, temporary benchmark
configuration, AITER work, or perf-changelog.yaml changes.

Regression analysis

The original fused candidate improved 1k/c256 but regressed 8k/c256 against
main:

Point	Candidate	Main	Delta
1k1k c256	877.0 tok/s/GPU	782.7 tok/s/GPU	+12.1%
8k1k c256	994.1 tok/s/GPU	1199.2 tok/s/GPU	-17.1%

At 8k/c256, mean TTFT rose from 46.55 s to 57.33 s and mean TPOT rose from
185.39 ms to 223.16 ms. GPU power also fell from about 712 W to 652 W while
gfx utilization remained near 100%, indicating inefficient MFMA execution
rather than a bandwidth bottleneck.

The regression had two causes:

The custom GEMM1 held gate and up FP32 accumulators and issued their dot
products serially. Removing an activation launch did not repay the lost
matrix-core efficiency.
A decode-oriented tile was also forced during 8k prefill, bypassing the
larger prefill configuration selected by the existing generic path.

Bad run:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27569397626/attempts/7

Main comparison:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27510667862

Profile-based optimization

At c256, the profiled generation batch has about 216 active tokens and top-k
4, or 864 global routes. EP8 retains about 108 local routes across 16 experts,
roughly 6.75 rows per local expert. Once remote routes are removed, a 64-row M
tile wastes most of each active expert block. A 16-row tile better matches the
actual local occupancy.

Across the 57 sparse layers in the selected decode step:

Phase	Existing path	Local routes, BM16	Delta
Expert GEMM1	18.608 ms	17.794 ms	-4.4%
SwiGLU activation	0.441 ms	0.546 ms	+23.9%
Expert GEMM2	9.586 ms	8.746 ms	-8.8%
Token align + sort	0.832 ms	0.680 ms	-18.2%
Expert reduction	0.328 ms	0.295 ms	-10.1%

The relevant MoE path falls from about 29.80 ms to 28.06 ms, a 5.8% reduction.
BM32 and BM64 controls were slower, confirming that local-route padding, not
the generic total-token selector, should drive this specialized decode tile.

Profile controls:

BM16: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27646862471
BM64 tuned-config control: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27647771499

Scope

The path is gated to the exact gfx94x MiniMax M3 EP8 BF16 shape and decode
batches of at most 256 tokens. Prefill and larger mixed batches use the
existing generic implementation and its established MI300X configurations.
Other models and platforms are unchanged.

The prepared vLLM follow-up branch is stacked on the native gfx94x MXFP8 MoE
work in vLLM #45726 and is not opened as a PR. It will be rebased onto main
after that prerequisite merges:

https://github.com/Oseltamivir/vllm/tree/codex/minimax-m3-mi300x-ep-mxfp8

Validation

python -m pytest utils/matrix_logic/ -q: 156 passed
runtime patch applies cleanly to the pinned image source
patched vLLM sources pass Ruff, formatting, compileall, and
git diff --check
correctness coverage exercises local-route GEMM1, activation, GEMM2, and
expert-map-aware reduction, including skipped remote rows

The requested MI300X serving matrix completed at c1, c16, and c256 for 1k1k
and 8k1k:

Point	This PR	Main	Regressed fusion	vs. main	vs. regressed
1k1k c1	23.1	23.3	19.0	-1.1%	+21.4%
1k1k c16	220.0	203.1	159.7	+8.3%	+37.8%
1k1k c256	912.0	782.7	877.0	+16.5%	+4.0%
8k1k c1	100.2	99.9	76.7	+0.3%	+30.7%
8k1k c16	699.1	669.2	469.3	+4.5%	+49.0%
8k1k c256	1238.6	1199.2	994.1	+3.3%	+24.6%

Values are total throughput in tok/s/GPU. At 8k1k c256, mean TTFT improves
from 46.55 s on main to 45.36 s, and mean TPOT improves from 185.39 ms to
179.37 ms. Average GPU power is unchanged at about 712 W, unlike the
inefficient regressed fusion's 652 W.

Benchmark runs:

1k1k c1/c16/c256: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27650162692
8k1k c1/c16: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27649315663
8k1k c256: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27649921579

The final runs checked out benchmark commit efe99e11, whose runtime patch
matches this PR. An earlier 8k1k c256 attempt failed before container startup
because the assigned node could not create an enroot user namespace; no
benchmark result from that attempt is included above.

# Conflicts: # perf-changelog.yaml

Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> # Conflicts: # perf-changelog.yaml

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

# Conflicts: # perf-changelog.yaml

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

# Conflicts: # perf-changelog.yaml

github-actions · 2026-06-15T19:13:41Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-16T17:53:52Z

Claude finished @Oseltamivir's task in 7m 0s —— View job

Review of PR #1782

LGTM — no blocking issues found.

Summary: This is a well-structured incremental optimization for MiniMax M3 EP8 on MI300X. The patch adds two complementary paths (short-context native/BF16 dispatch via the oracle, long-context sparse BF16 emulation with fused SwiGLU), and the gating logic correctly ensures each path is only reached for the profiled configuration. Key observations:

Correctness of the fused gated kernel: fused_moe_gated_kernel correctly computes both gate and up projections in a single pass, preserves the BF16 store/reload boundary before applying SwiGLU in FP32 (matching the unfused numerical behavior), and writes per-route rows to the activation buffer. The b_up_ptrs = b_gate_ptrs + N * stride_bn offset correctly addresses the split layout.
Local-expert allocation tightening: The num_local_experts parameter in moe_align_block_size correctly reduces the allocation from 128 * (block_size-1) to 16 * (block_size-1) padding tokens. The _max_post_padded bound is sound — min(allocation, M*block_m, M + E*(block_m-1)) takes the tightest of three independent bounds.
EP route filtering in emulation path: The _apply_sparse_mi300x_ep method correctly passes ignore_invalid_experts=True and num_local_experts=E to _prepare_expert_assignment, so remote-expert routes are dropped from sorted_token_ids. The downstream moe_fused_mul_sum with apply_weights=False and expert_map correctly zeros contributions from non-local experts (via tl.where(expert_mask, b_val, 0.0)), so unwritten rows in the workspace don't leak garbage.
Fused top-k atomic path in native MXFP8: FUSE_TOPK=True uses tl.atomic_add on a zero-initialized output with c_row = offs_token // top_k, correctly accumulating weighted per-route results into per-token output rows. The M_routed % top_k divisibility check guards against misuse.
Oracle routing: Short-context EP8 (max_model_len <= 4096) routes to Mxfp8NativeTritonExperts with mixed native/BF16 weights; long-context EP8 routes to Mxfp8EmulationTritonExperts with the sparse BF16 path. Both guards (_should_use_native_ep, _is_minimax_m3_mi300x_ep8) are scoped to the exact MiniMax M3 config.
Benchmark script: Patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. Server launch uses proper line continuations.
Repo guidelines: No master config or perf-changelog changes needed (this PR only touches the benchmark script and runtime patch). No MTP or model prefix concerns.

claude · 2026-06-17T01:09:54Z

Claude finished @Oseltamivir's task in 4m 57s —— View job

Review of PR #1782

Read repository guidelines (CLAUDE.md)
Gather context and read changed files
Review benchmark shell script changes
Review runtime patch file
Validate against PR review checklist
Post review summary

LGTM — no blocking issues found.

Summary: This is a well-scoped incremental EP8 decode optimization stacked on #1753. The patch adds two complementary paths: short-context EP8 routes to Mxfp8NativeTritonExperts with the profiled native/BF16 dispatch; long-context EP8 routes to Mxfp8EmulationTritonExperts which now compacts decode batches (≤256 tokens) to locally-owned experts with BM16 tiles, bypassing the overpadded generic path. Key observations:

Local-route compaction in _apply_sparse_mi300x_ep: The _prepare_expert_assignment call with ignore_invalid_experts=True and num_local_experts=E correctly drops remote-expert routes from sorted_token_ids. Buffer management is sound — workspace2 is reused for intermediate_gate_up (GEMM1 output) and intermediate_output (GEMM2 output) with the activation fully consuming the former before the latter is written. GEMM2 passes top_k=1 (treating each route as an independent token) with mul_routed_weight=True, then moe_fused_mul_sum with apply_weights=False sums per-token without double-applying weights, zeroing non-local experts via expert_map.
_max_post_padded bound: The three-way min(allocation, M_routed * block_m, M_routed + E * (block_m - 1)) is sound — the first is the buffer size, the second the original bound, the third the tighter per-expert padding bound. The block-alignment floor is correct.
moe_align_block_size tightening: When ignore_invalid_experts, expert_map, and num_local_experts are all set, padding allocation drops from global_experts * (block_size - 1) to local_experts * (block_size - 1). The 0 < num_local_experts <= num_experts validation prevents misuse.
Fused top-k atomic in _mxfp8_grouped_gemm_*_kernel: c_row = offs_token // top_k correctly maps route-indexed rows to token-indexed output, tl.atomic_add with zero-initialized output accumulates concurrent routes, and the M_routed % top_k divisibility check is validated before launch.
Route-aware SwiGLU kernel (_swiglu_oai_quant_routed_kernel): Processes only locally-routed rows via sorted_token_ids, with proper padding/remote masking. Gate is clamped from above only (gate * sigmoid → 0 for negative gate, so lower clamp is a no-op), up is symmetrically clamped — matching the SwiGLU-OAI numeric contract.
Oracle routing: Short-context EP8 (≤4096 max_model_len) → Mxfp8NativeTritonExperts; long-context EP8 → Mxfp8EmulationTritonExperts. Both guards are scoped to the exact profiled MiniMax M3 gfx94x shape. The bf16_weights_available flag prevents using uninitialized BF16 weights in long-context EP8 where they aren't retained.
Decode gating: The use_sparse_ep predicate in Mxfp8EmulationTritonExperts.apply correctly gates on model match, BF16 dtype, ≤256 tokens, SwiGLU activation, expert_map presence, no router-weight-on-input, and no LoRA. Prefill and mixed batches fall through to the generic TritonExperts path.
Benchmark script: EP patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. No master config or perf-changelog changes are included (as documented in scope).

Oseltamivir and others added 22 commits June 13, 2026 23:39

feat: enable native mxfp8 moe for minimax m3 mi300x

d93e4e4

chore: trigger MiniMax M3 MI300X MXFP8 sweep

6b70497

Merge branch 'main' into feat/m3-mi300x-mxfp8

980f9c8

perf: tune MiniMax M3 gfx942 MXFP8 tiles

e9fa9b7

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-mxfp8

7f159d3

perf: recover MiniMax M3 MI300X serving curve

c3cdc37

fix: rebuild MI300X patch from pinned vLLM

60a0002

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-mxfp8

33584f9

# Conflicts: # perf-changelog.yaml

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-mxfp8

2bfc584

# Conflicts: # perf-changelog.yaml

perf: update MiniMax M3 MI300X MXFP8 patch

a38f5ab

Co-authored-by: OpenAI Codex <codex@openai.com>

perf(mi300x): pack MiniMax M3 MXFP8 scales

7678b0b

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com>

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-mxfp8

684b6a3

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> # Conflicts: # perf-changelog.yaml

perf(mi300x): tune MiniMax M3 MXFP8 refill dispatch

280c030

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-mxfp8

ba30da1

# Conflicts: # perf-changelog.yaml

perf(mi300x): tune short-k MXFP8 MoE GEMM2

23925cc

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-mxfp8

dd871ac

# Conflicts: # perf-changelog.yaml

fix(benchmarks): fail if MI300X patch is not applied

1e3bfdd

Merge remote-tracking branch 'origin/main' into feat/m3-mi300x-mxfp8

d1638a0

perf(vllm): optimize MiniMax M3 MXFP8 EP routes

28e3f75

fix(vllm): exclude tests from runtime patch

8279f50

perf(vllm): keep MiniMax M3 EP weights compressed

b25eff5

perf(vllm): fuse MiniMax M3 BF16 EP experts

16c596a

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

Oseltamivir marked this pull request as ready for review June 16, 2026 17:53

Oseltamivir added the full-sweep-enabled label Jun 16, 2026

Oseltamivir marked this pull request as draft June 16, 2026 20:26

Oseltamivir removed the full-sweep-enabled label Jun 16, 2026

fix(vllm): restore MiniMax M3 EP performance

393962c

Oseltamivir changed the title ~~perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X~~ perf(vllm): compact MiniMax M3 EP decode routes on MI300X Jun 16, 2026

Oseltamivir marked this pull request as ready for review June 17, 2026 01:09

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from d1638a0 to 465ff47 Compare June 17, 2026 20:51

Oseltamivir requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 17, 2026 20:51

Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 6 times, most recently from 95e79da to 27510c4 Compare June 17, 2026 21:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vllm): compact MiniMax M3 EP decode routes on MI300X#1782

perf(vllm): compact MiniMax M3 EP decode routes on MI300X#1782
Oseltamivir wants to merge 23 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8

Oseltamivir commented Jun 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 16, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Regression analysis

Profile-based optimization

Scope

Validation

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

claude Bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #1782

Uh oh!

claude Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review of PR #1782

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 15, 2026 •

edited

Loading

claude Bot commented Jun 16, 2026 •

edited

Loading

claude Bot commented Jun 17, 2026 •

edited

Loading