minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X

JohnQinAMD · ZhengGong-amd · cursoragent · JohnQinAMD · commit d3a617fbb101 · 2026-06-16T15:40:24.000Z
Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master
toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR,
_RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so
they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and
is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8
checkpoint lacks calibrated q/prob scales for ROCm FP8 attention.

Also set AMD-recommended numerically-inert MI300X runtime knobs:
TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised
above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped
below the default of 4). All changes are kernel-selection/runtime only;
GSM8K holds ~0.95.

Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across
conc 4..256; conc 1-2 unchanged (latency-bound).

Co-authored-by: Gong Zheng &lt;zgong@amd.com&gt;
Co-authored-by: Cursor &lt;cursoragent@cursor.com&gt;
Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh b/benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
@@ -34,6 +34,13 @@ SERVER_LOG=/workspace/server.log
 export VLLM_ENGINE_READY_TIMEOUT_S=3600
 export VLLM_USE_BREAKABLE_CUDAGRAPH=0
 
+export VLLM_ROCM_USE_AITER=1
+export VLLM_ROCM_USE_AITER_MHA=0
+
+export TORCH_BLAS_PREFER_HIPBLASLT=1
+export NCCL_MIN_NCHANNELS="${NCCL_MIN_NCHANNELS:-112}"
+export GPU_MAX_HW_QUEUES="${GPU_MAX_HW_QUEUES:-2}"
+
 if [ "${EVAL_ONLY}" = "true" ]; then
     setup_eval_context
 fi
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
@@ -3871,3 +3871,11 @@
     - "Image: vllm/vllm-openai-rocm:nightly"
     - "Add more sweep points"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1785
+
+- config-keys:
+    - minimaxm3-fp8-mi300x-vllm
+  description:
+    - "Enable AITER kernels for MiniMax-M3 MXFP8 on MI300X/gfx942 via the single master toggle VLLM_ROCM_USE_AITER=1: the stock image left it unset, so the hot decode GEMMs and fused MoE ran on the generic kernels. The per-component AITER flags (MoE, linear, RMSNorm, FP8 batched-GEMM) default to True and are gated behind the master flag, so they are left at their defaults. Keep attention on TRITON_ATTN (VLLM_ROCM_USE_AITER_MHA=0, which defaults to True) because the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention."
+    - "Add AMD-recommended, numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (raises RCCL channels above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (caps HIP streams below the default of 4)."
+    - "Measured uplift on 8xMI300X, 1k1k random sweep (total tok/s/gpu): conc256 782.7->856.1 (+9.4%), conc128 598.9->637.0 (+6.4%), conc64 365.1->392.0 (+7.4%), conc32 295.6->327.4 (+10.8%), conc16 203.1->216.5 (+6.6%), conc8 127.6->136.6 (+7.1%), conc4 80.1->84.6 (+5.6%); conc1-2 unchanged (latency-bound). GSM8K exact-match holds at ~0.95 (kernel-selection change only)."
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING