MinimaxM2.5-FP8-MI325x-vLLM: pin AITER FA attention backend

chunfangamd · chunfangamd · commit b990fb6b7288 · 2026-05-30T12:50:39.000Z
vLLM PR #36702 (between v0.18.0 and v0.21.0) flipped the dense full-attention default on ROCm from ROCM_AITER_FA to ROCM_ATTN, causing a ~38% throughput regression for MiniMax-M2.5 FP8 on MI325X (vllm-project/vllm#43029). Align benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh with the merged upstream recipe (vllm-project/recipes#481) to restore the v0.18.0 attention path on the v0.21.0 image: - export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 (asm/hip paged-attention auto-dispatch) - pass --attention-backend ROCM_AITER_FA to vllm serve
diff --git a/benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh b/benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh
@@ -24,9 +24,8 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
     export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
 fi
 
-# following AMD andy's recipe 
-# https://www.linkedin.com/posts/andyluo77_day-0-support-of-minimax-25-on-amd-gpu-activity-7428151527309025280-hXR8/
 export VLLM_ROCM_USE_AITER=1
+export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1
 
 SERVER_LOG=/workspace/server.log
 PORT=${PORT:-8888}
@@ -52,6 +51,7 @@ $EP \
 --max-model-len $MAX_MODEL_LEN \
 --block-size=32 \
 --no-enable-prefix-caching \
+--attention-backend ROCM_AITER_FA \
 --trust-remote-code > $SERVER_LOG 2>&1 &
 
 SERVER_PID=$!