Skip to content

Commit b990fb6

Browse files
committed
MinimaxM2.5-FP8-MI325x-vLLM: pin AITER FA attention backend
vLLM PR #36702 (between v0.18.0 and v0.21.0) flipped the dense full-attention default on ROCm from ROCM_AITER_FA to ROCM_ATTN, causing a ~38% throughput regression for MiniMax-M2.5 FP8 on MI325X (vllm-project/vllm#43029). Align benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh with the merged upstream recipe (vllm-project/recipes#481) to restore the v0.18.0 attention path on the v0.21.0 image: - export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 (asm/hip paged-attention auto-dispatch) - pass --attention-backend ROCM_AITER_FA to vllm serve
1 parent e0cd8f7 commit b990fb6

1 file changed

Lines changed: 2 additions & 2 deletions

File tree

benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,8 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
2424
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
2525
fi
2626

27-
# following AMD andy's recipe
28-
# https://www.linkedin.com/posts/andyluo77_day-0-support-of-minimax-25-on-amd-gpu-activity-7428151527309025280-hXR8/
2927
export VLLM_ROCM_USE_AITER=1
28+
export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1
3029

3130
SERVER_LOG=/workspace/server.log
3231
PORT=${PORT:-8888}
@@ -52,6 +51,7 @@ $EP \
5251
--max-model-len $MAX_MODEL_LEN \
5352
--block-size=32 \
5453
--no-enable-prefix-caching \
54+
--attention-backend ROCM_AITER_FA \
5555
--trust-remote-code > $SERVER_LOG 2>&1 &
5656

5757
SERVER_PID=$!

0 commit comments

Comments
 (0)