Skip to content

Commit d3a617f

Browse files
JohnQinAMDZhengGong-amdcursoragentclaude
committed
minimaxm3-fp8-mi300x-vllm: enable AITER kernels for MXFP8 on MI300X
Enable AITER on MI300X/gfx942 for MiniMax-M3 MXFP8 via the single master toggle VLLM_ROCM_USE_AITER=1. The per-component AITER flags (_MOE, _LINEAR, _RMSNORM, _FP8BMM) default to True and are gated behind the master flag, so they are left at their defaults. VLLM_ROCM_USE_AITER_MHA defaults to True and is explicitly set to 0 to keep attention on TRITON_ATTN, since the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention. Also set AMD-recommended numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (RCCL channels, raised above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (HIP streams, capped below the default of 4). All changes are kernel-selection/runtime only; GSM8K holds ~0.95. Measured uplift (8xMI300X, 1k1k, total tok/s/gpu): +5.6..+10.8% across conc 4..256; conc 1-2 unchanged (latency-bound). Co-authored-by: Gong Zheng <zgong@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 6462ac0 commit d3a617f

2 files changed

Lines changed: 15 additions & 0 deletions

File tree

benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,13 @@ SERVER_LOG=/workspace/server.log
3434
export VLLM_ENGINE_READY_TIMEOUT_S=3600
3535
export VLLM_USE_BREAKABLE_CUDAGRAPH=0
3636

37+
export VLLM_ROCM_USE_AITER=1
38+
export VLLM_ROCM_USE_AITER_MHA=0
39+
40+
export TORCH_BLAS_PREFER_HIPBLASLT=1
41+
export NCCL_MIN_NCHANNELS="${NCCL_MIN_NCHANNELS:-112}"
42+
export GPU_MAX_HW_QUEUES="${GPU_MAX_HW_QUEUES:-2}"
43+
3744
if [ "${EVAL_ONLY}" = "true" ]; then
3845
setup_eval_context
3946
fi

perf-changelog.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3871,3 +3871,11 @@
38713871
- "Image: vllm/vllm-openai-rocm:nightly"
38723872
- "Add more sweep points"
38733873
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1785
3874+
3875+
- config-keys:
3876+
- minimaxm3-fp8-mi300x-vllm
3877+
description:
3878+
- "Enable AITER kernels for MiniMax-M3 MXFP8 on MI300X/gfx942 via the single master toggle VLLM_ROCM_USE_AITER=1: the stock image left it unset, so the hot decode GEMMs and fused MoE ran on the generic kernels. The per-component AITER flags (MoE, linear, RMSNorm, FP8 batched-GEMM) default to True and are gated behind the master flag, so they are left at their defaults. Keep attention on TRITON_ATTN (VLLM_ROCM_USE_AITER_MHA=0, which defaults to True) because the MXFP8 checkpoint lacks calibrated q/prob scales for ROCm FP8 attention."
3879+
- "Add AMD-recommended, numerically-inert MI300X runtime knobs: TORCH_BLAS_PREFER_HIPBLASLT=1, NCCL_MIN_NCHANNELS=112 (raises RCCL channels above the ~32-64 default for TP8), GPU_MAX_HW_QUEUES=2 (caps HIP streams below the default of 4)."
3880+
- "Measured uplift on 8xMI300X, 1k1k random sweep (total tok/s/gpu): conc256 782.7->856.1 (+9.4%), conc128 598.9->637.0 (+6.4%), conc64 365.1->392.0 (+7.4%), conc32 295.6->327.4 (+10.8%), conc16 203.1->216.5 (+6.6%), conc8 127.6->136.6 (+7.1%), conc4 80.1->84.6 (+5.6%); conc1-2 unchanged (latency-bound). GSM8K exact-match holds at ~0.95 (kernel-selection change only)."
3881+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING

0 commit comments

Comments
 (0)