Skip to content

perf(vllm): compact MiniMax M3 EP decode routes on MI300X#1782

Open
Oseltamivir wants to merge 23 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8
Open

perf(vllm): compact MiniMax M3 EP decode routes on MI300X#1782
Oseltamivir wants to merge 23 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • compact MiniMax M3 EP8 decode assignments to locally owned experts on MI300X
  • use profiled 16-row BF16 grouped-GEMM tiles for the compacted route density
  • keep prefill and mixed batches on the existing generic TritonExperts path
  • remove the regressive dual-accumulator GEMM1/SwiGLU fusion
  • retain the existing short-context native MXFP8/BF16 policy

This PR is stacked on #1753 and contains only the incremental EP8 runtime
optimization. It does not include the profiling branch, temporary benchmark
configuration, AITER work, or perf-changelog.yaml changes.

Regression analysis

The original fused candidate improved 1k/c256 but regressed 8k/c256 against
main:

Point Candidate Main Delta
1k1k c256 877.0 tok/s/GPU 782.7 tok/s/GPU +12.1%
8k1k c256 994.1 tok/s/GPU 1199.2 tok/s/GPU -17.1%

At 8k/c256, mean TTFT rose from 46.55 s to 57.33 s and mean TPOT rose from
185.39 ms to 223.16 ms. GPU power also fell from about 712 W to 652 W while
gfx utilization remained near 100%, indicating inefficient MFMA execution
rather than a bandwidth bottleneck.

The regression had two causes:

  1. The custom GEMM1 held gate and up FP32 accumulators and issued their dot
    products serially. Removing an activation launch did not repay the lost
    matrix-core efficiency.
  2. A decode-oriented tile was also forced during 8k prefill, bypassing the
    larger prefill configuration selected by the existing generic path.

Bad run:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27569397626/attempts/7

Main comparison:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27510667862

Profile-based optimization

At c256, the profiled generation batch has about 216 active tokens and top-k
4, or 864 global routes. EP8 retains about 108 local routes across 16 experts,
roughly 6.75 rows per local expert. Once remote routes are removed, a 64-row M
tile wastes most of each active expert block. A 16-row tile better matches the
actual local occupancy.

Across the 57 sparse layers in the selected decode step:

Phase Existing path Local routes, BM16 Delta
Expert GEMM1 18.608 ms 17.794 ms -4.4%
SwiGLU activation 0.441 ms 0.546 ms +23.9%
Expert GEMM2 9.586 ms 8.746 ms -8.8%
Token align + sort 0.832 ms 0.680 ms -18.2%
Expert reduction 0.328 ms 0.295 ms -10.1%

The relevant MoE path falls from about 29.80 ms to 28.06 ms, a 5.8% reduction.
BM32 and BM64 controls were slower, confirming that local-route padding, not
the generic total-token selector, should drive this specialized decode tile.

Profile controls:

Scope

The path is gated to the exact gfx94x MiniMax M3 EP8 BF16 shape and decode
batches of at most 256 tokens. Prefill and larger mixed batches use the
existing generic implementation and its established MI300X configurations.
Other models and platforms are unchanged.

The prepared vLLM follow-up branch is stacked on the native gfx94x MXFP8 MoE
work in vLLM #45726 and is not opened as a PR. It will be rebased onto main
after that prerequisite merges:

https://github.com/Oseltamivir/vllm/tree/codex/minimax-m3-mi300x-ep-mxfp8

Validation

  • python -m pytest utils/matrix_logic/ -q: 156 passed
  • runtime patch applies cleanly to the pinned image source
  • patched vLLM sources pass Ruff, formatting, compileall, and
    git diff --check
  • correctness coverage exercises local-route GEMM1, activation, GEMM2, and
    expert-map-aware reduction, including skipped remote rows

The requested MI300X serving matrix completed at c1, c16, and c256 for 1k1k
and 8k1k:

Point This PR Main Regressed fusion vs. main vs. regressed
1k1k c1 23.1 23.3 19.0 -1.1% +21.4%
1k1k c16 220.0 203.1 159.7 +8.3% +37.8%
1k1k c256 912.0 782.7 877.0 +16.5% +4.0%
8k1k c1 100.2 99.9 76.7 +0.3% +30.7%
8k1k c16 699.1 669.2 469.3 +4.5% +49.0%
8k1k c256 1238.6 1199.2 994.1 +3.3% +24.6%

Values are total throughput in tok/s/GPU. At 8k1k c256, mean TTFT improves
from 46.55 s on main to 45.36 s, and mean TPOT improves from 185.39 ms to
179.37 ms. Average GPU power is unchanged at about 712 W, unlike the
inefficient regressed fusion's 652 W.

Benchmark runs:

The final runs checked out benchmark commit efe99e11, whose runtime patch
matches this PR. An earlier 8k1k c256 attempt failed before container startup
because the assigned node could not create an enroot user namespace; no
benchmark result from that attempt is included above.

Oseltamivir and others added 22 commits June 13, 2026 23:39
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>

# Conflicts:
#	perf-changelog.yaml
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir marked this pull request as ready for review June 16, 2026 17:53
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 7m 0s —— View job


Review of PR #1782

LGTM — no blocking issues found.

Summary: This is a well-structured incremental optimization for MiniMax M3 EP8 on MI300X. The patch adds two complementary paths (short-context native/BF16 dispatch via the oracle, long-context sparse BF16 emulation with fused SwiGLU), and the gating logic correctly ensures each path is only reached for the profiled configuration. Key observations:

  • Correctness of the fused gated kernel: fused_moe_gated_kernel correctly computes both gate and up projections in a single pass, preserves the BF16 store/reload boundary before applying SwiGLU in FP32 (matching the unfused numerical behavior), and writes per-route rows to the activation buffer. The b_up_ptrs = b_gate_ptrs + N * stride_bn offset correctly addresses the split layout.
  • Local-expert allocation tightening: The num_local_experts parameter in moe_align_block_size correctly reduces the allocation from 128 * (block_size-1) to 16 * (block_size-1) padding tokens. The _max_post_padded bound is sound — min(allocation, M*block_m, M + E*(block_m-1)) takes the tightest of three independent bounds.
  • EP route filtering in emulation path: The _apply_sparse_mi300x_ep method correctly passes ignore_invalid_experts=True and num_local_experts=E to _prepare_expert_assignment, so remote-expert routes are dropped from sorted_token_ids. The downstream moe_fused_mul_sum with apply_weights=False and expert_map correctly zeros contributions from non-local experts (via tl.where(expert_mask, b_val, 0.0)), so unwritten rows in the workspace don't leak garbage.
  • Fused top-k atomic path in native MXFP8: FUSE_TOPK=True uses tl.atomic_add on a zero-initialized output with c_row = offs_token // top_k, correctly accumulating weighted per-route results into per-token output rows. The M_routed % top_k divisibility check guards against misuse.
  • Oracle routing: Short-context EP8 (max_model_len <= 4096) routes to Mxfp8NativeTritonExperts with mixed native/BF16 weights; long-context EP8 routes to Mxfp8EmulationTritonExperts with the sparse BF16 path. Both guards (_should_use_native_ep, _is_minimax_m3_mi300x_ep8) are scoped to the exact MiniMax M3 config.
  • Benchmark script: Patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. Server launch uses proper line continuations.
  • Repo guidelines: No master config or perf-changelog changes needed (this PR only touches the benchmark script and runtime patch). No MTP or model prefix concerns.

@Oseltamivir Oseltamivir changed the title perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X perf(vllm): compact MiniMax M3 EP decode routes on MI300X Jun 16, 2026
@Oseltamivir Oseltamivir marked this pull request as ready for review June 17, 2026 01:09
@claude

claude Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Claude finished @Oseltamivir's task in 4m 57s —— View job


Review of PR #1782

  • Read repository guidelines (CLAUDE.md)
  • Gather context and read changed files
  • Review benchmark shell script changes
  • Review runtime patch file
  • Validate against PR review checklist
  • Post review summary

LGTM — no blocking issues found.

Summary: This is a well-scoped incremental EP8 decode optimization stacked on #1753. The patch adds two complementary paths: short-context EP8 routes to Mxfp8NativeTritonExperts with the profiled native/BF16 dispatch; long-context EP8 routes to Mxfp8EmulationTritonExperts which now compacts decode batches (≤256 tokens) to locally-owned experts with BM16 tiles, bypassing the overpadded generic path. Key observations:

  • Local-route compaction in _apply_sparse_mi300x_ep: The _prepare_expert_assignment call with ignore_invalid_experts=True and num_local_experts=E correctly drops remote-expert routes from sorted_token_ids. Buffer management is sound — workspace2 is reused for intermediate_gate_up (GEMM1 output) and intermediate_output (GEMM2 output) with the activation fully consuming the former before the latter is written. GEMM2 passes top_k=1 (treating each route as an independent token) with mul_routed_weight=True, then moe_fused_mul_sum with apply_weights=False sums per-token without double-applying weights, zeroing non-local experts via expert_map.
  • _max_post_padded bound: The three-way min(allocation, M_routed * block_m, M_routed + E * (block_m - 1)) is sound — the first is the buffer size, the second the original bound, the third the tighter per-expert padding bound. The block-alignment floor is correct.
  • moe_align_block_size tightening: When ignore_invalid_experts, expert_map, and num_local_experts are all set, padding allocation drops from global_experts * (block_size - 1) to local_experts * (block_size - 1). The 0 < num_local_experts <= num_experts validation prevents misuse.
  • Fused top-k atomic in _mxfp8_grouped_gemm_*_kernel: c_row = offs_token // top_k correctly maps route-indexed rows to token-indexed output, tl.atomic_add with zero-initialized output accumulates concurrent routes, and the M_routed % top_k divisibility check is validated before launch.
  • Route-aware SwiGLU kernel (_swiglu_oai_quant_routed_kernel): Processes only locally-routed rows via sorted_token_ids, with proper padding/remote masking. Gate is clamped from above only (gate * sigmoid → 0 for negative gate, so lower clamp is a no-op), up is symmetrically clamped — matching the SwiGLU-OAI numeric contract.
  • Oracle routing: Short-context EP8 (≤4096 max_model_len) → Mxfp8NativeTritonExperts; long-context EP8 → Mxfp8EmulationTritonExperts. Both guards are scoped to the exact profiled MiniMax M3 gfx94x shape. The bf16_weights_available flag prevents using uninitialized BF16 weights in long-context EP8 where they aren't retained.
  • Decode gating: The use_sparse_ep predicate in Mxfp8EmulationTritonExperts.apply correctly gates on model match, BF16 dtype, ≤256 tokens, SwiGLU activation, expert_map presence, no router-weight-on-input, and no LoRA. Prefill and mixed batches fall through to the generic TritonExperts path.
  • Benchmark script: EP patch application follows the existing idempotent marker pattern. EP flag is correctly conditional on EP_SIZE. No master config or perf-changelog changes are included (as documented in scope).

@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch from d1638a0 to 465ff47 Compare June 17, 2026 20:51
@Oseltamivir Oseltamivir force-pushed the feat/m3-mi300x-mxfp8 branch 6 times, most recently from 95e79da to 27510c4 Compare June 17, 2026 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant