Skip to content

Commit 1b71db4

Browse files
authored
MinimaxM2.5-FP8-MI325x-vLLM: pin AITER FA attention backend (#1594)
* MinimaxM2.5-FP8-MI325x-vLLM: pin AITER FA attention backend vLLM PR #36702 (between v0.18.0 and v0.21.0) flipped the dense full-attention default on ROCm from ROCM_AITER_FA to ROCM_ATTN, causing a ~38% throughput regression for MiniMax-M2.5 FP8 on MI325X (vllm-project/vllm#43029). Align benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh with the merged upstream recipe (vllm-project/recipes#481) to restore the v0.18.0 attention path on the v0.21.0 image: - export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 (asm/hip paged-attention auto-dispatch) - pass --attention-backend ROCM_AITER_FA to vllm serve * Update the perf-changelog * runners/launch_mi325x-amds.sh: propagate srun failures * minimaxm2.5-fp8-mi325x-vllm: align with upstream MiniMax-M2.5 ROCm recipe * runners/launch_mi325x-amds.sh: derive PORT per job; sudo -n in cleanup Use `40000 + (JOB_ID % 10000)` instead of a hard-coded 8888 — a non-SLURM Docker workload on chi-mi325x-pod1-019 bound :8888 and made every sweep job scheduled there fail in sock.bind() with EADDRINUSE before vLLM ran. Also harden the benchmark_logs trap with `sudo -n` so it fails fast under a non-tty instead of hanging. * minimaxm2.5-fp8-mi325x-vllm: gate SHUFFLE_KV_CACHE_LAYOUT per (TP, CONC) Set VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 (recipes#481 pillar 2) only at shapes where AITER's gfx942 ASM paged-attn kernel exists: TP=2 EP=1 CONC<=16, TP=8 EP=8 CONC<=64. Above those, pa_fwd_asm hits `get_heuristic_kernel: cannot get heuristic kernel!` (gqa=6, block_size=32, qTile=0) and HTTP-500s every request. Mirrors the per-shape toggle in the mi355x sibling. vllm#43029, sweep run 26692603804.
1 parent 48c1840 commit 1b71db4

4 files changed

Lines changed: 45 additions & 11 deletions

File tree

.github/configs/amd-master.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1038,7 +1038,7 @@ minimaxm2.5-fp8-mi300x-vllm-agentic:
10381038
- { tp: 4, offloading: cpu, conc-list: [16, 20, 24, 28, 32] }
10391039

10401040
minimaxm2.5-fp8-mi325x-vllm:
1041-
image: vllm/vllm-openai-rocm:v0.21.0
1041+
image: vllm/vllm-openai-rocm:v0.20.2
10421042
model: MiniMaxAI/MiniMax-M2.5
10431043
model-prefix: minimaxm2.5
10441044
runner: mi325x

benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,18 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
2424
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
2525
fi
2626

27-
# following AMD andy's recipe
28-
# https://www.linkedin.com/posts/andyluo77_day-0-support-of-minimax-25-on-amd-gpu-activity-7428151527309025280-hXR8/
2927
export VLLM_ROCM_USE_AITER=1
3028

29+
ENABLE_SHUFFLE_KV_CACHE_LAYOUT=0
30+
if [[ "$TP" == "2" && "$EP_SIZE" == "1" ]] && (( CONC <= 16 )); then
31+
ENABLE_SHUFFLE_KV_CACHE_LAYOUT=1
32+
elif [[ "$TP" == "8" && "$EP_SIZE" == "8" ]] && (( CONC <= 64 )); then
33+
ENABLE_SHUFFLE_KV_CACHE_LAYOUT=1
34+
fi
35+
if (( ENABLE_SHUFFLE_KV_CACHE_LAYOUT )); then
36+
export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1
37+
fi
38+
3139
SERVER_LOG=/workspace/server.log
3240
PORT=${PORT:-8888}
3341

@@ -52,6 +60,8 @@ $EP \
5260
--max-model-len $MAX_MODEL_LEN \
5361
--block-size=32 \
5462
--no-enable-prefix-caching \
63+
--attention-backend ROCM_AITER_FA \
64+
--compilation-config '{"mode":3,"cudagraph_mode":"PIECEWISE"}' \
5565
--trust-remote-code > $SERVER_LOG 2>&1 &
5666

5767
SERVER_PID=$!

perf-changelog.yaml

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3228,3 +3228,14 @@
32283228
- "Picks up the fix for the GSM8K accuracy regression reported in sgl-project/sglang#25742 (v0.5.12-20260517 collapsed to ~0.32 at TP=2)"
32293229
- "Local eval-only runs on MI355X recover to gsm8k strict-match 0.975 at TP=2/conc=64 and 0.974 at TP=4/conc=16, well above the 0.92 upstream gate added in sgl-project/sglang#26396"
32303230
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1593
3231+
3232+
- config-keys:
3233+
- minimaxm2.5-fp8-mi325x-vllm
3234+
description:
3235+
- "Pin AITER FA attention backend in benchmarks/single_node/minimaxm2.5_fp8_mi325x.sh to recover the ~38% MI325X throughput regression introduced when vLLM PR #36702 (between v0.18.0 and v0.21.0) flipped the dense full-attention default on ROCm from ROCM_AITER_FA to ROCM_ATTN (vllm-project/vllm#43029)"
3236+
- "Export VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 to enable the AITER asm/hip paged-attention auto-dispatch"
3237+
- "Pass --attention-backend ROCM_AITER_FA to vllm serve, aligning with the merged upstream MiniMax ROCm recipe (vllm-project/recipes#481)"
3238+
- "Pin image vllm/vllm-openai-rocm:v0.20.2 — the version the upstream recipe explicitly validates (`min_vllm_version: 0.20.2`). v0.21.0 separately crashes during AITER MoE CUDA-graph capture on MiniMax-M2.5 (silent worker death, `Engine core initialization failed`) reproducible via the recipe's exact flags; v0.20.2 + recipe completes a 100-prompt vllm bench serve cleanly at 2030 tok/s total throughput on MI325X (TP=4)"
3239+
- "Add --compilation-config '{\"mode\":3,\"cudagraph_mode\":\"PIECEWISE\"}' to vllm serve, mirroring `model.base_args` from the upstream recipe. `pass_config.fuse_minimax_qk_norm` from the recipe is intentionally omitted — it triggers an upstream NameError on ROCm because vllm/compilation/passes/pass_manager.py imports MiniMaxQKNormPass under `is_cuda()` (NVIDIA-only) while using it unconditionally"
3240+
- "Conditionally enable VLLM_ROCM_SHUFFLE_KV_CACHE_LAYOUT=1 per (TP, EP, CONC) — on for shapes where the AITER ASM paged-attention kernel exists in the gfx942 heuristic table (TP=2 EP=1 CONC<=16, TP=8 EP=8 CONC<=64), off otherwise. Above the thresholds vllm/v1/attention/backends/rocm_aiter_fa.py routes decode through aiter pa_fwd_asm and crashes with `RuntimeError: get_heuristic_kernel: cannot get heuristic kernel!` for MiniMax-M2.5's attention shape (gqa=6 block_size=32 qTile=0); below them the ASM auto-dispatch is the perf win the recipe wants. Thresholds confirmed across 17 bench cells + 3 eval cells in PR #1594 sweep run 26692603804. Mirrors the per-shape toggle pattern in benchmarks/single_node/minimaxm2.5_fp8_mi355x.sh; can collapse to unconditional SHUFFLE=1 once AITER registers the missing kernel on gfx942"
3241+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1594

runners/launch_mi325x-amds.sh

Lines changed: 21 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,52 @@
11
#!/usr/bin/env bash
2+
set -euo pipefail
23

34
export HF_HUB_CACHE_MOUNT="/nfsdata/sa/gharunner/gharunners/hf-hub-cache/"
4-
export PORT=8888
55

66
PARTITION="compute"
77
SQUASH_FILE="/nfsdata/sa/gharunner/gharunners/squash/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh"
88
LOCK_FILE="${SQUASH_FILE}.lock"
99

10+
cleanup_stale_benchmark_logs() {
11+
if [[ -n "${GITHUB_WORKSPACE:-}" ]]; then
12+
sudo -n rm -rf "$GITHUB_WORKSPACE/benchmark_logs" 2>/dev/null || \
13+
rm -rf "$GITHUB_WORKSPACE/benchmark_logs" 2>/dev/null || true
14+
fi
15+
}
16+
cleanup_stale_benchmark_logs
17+
1018
set -x
1119

1220
# Exclude known-broken mi325x nodes:
1321
# chi-mi325x-pod1-121: enroot-aufs2ovlfs setcap fails on this node's NFS-backed
1422
# squash dir; container image import never completes
1523
# (root-caused via #1467/#1468/#1469 sweep failures).
16-
JOB_ID=$(salloc --partition=$PARTITION --exclude=chi-mi325x-pod1-121.ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=480 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
24+
JOB_ID=$(set +o pipefail; salloc --partition=$PARTITION --exclude=chi-mi325x-pod1-121.ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=480 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
1725

1826
if [ -z "$JOB_ID" ]; then
19-
echo "ERROR: salloc failed to allocate a job"
27+
echo "ERROR: salloc failed to allocate a job" >&2
2028
exit 1
2129
fi
2230

31+
export PORT=$(( 40000 + (JOB_ID % 10000) ))
32+
33+
trap 'rc=$?; scancel "$JOB_ID" 2>/dev/null || true; cleanup_stale_benchmark_logs; exit "$rc"' EXIT
34+
2335
# Use flock to serialize concurrent imports to the same squash file
24-
srun --jobid=$JOB_ID --job-name="$RUNNER_NAME" bash -c "
36+
srun --jobid="$JOB_ID" --job-name="$RUNNER_NAME" bash -c "
37+
set -euo pipefail
2538
exec 9>\"$LOCK_FILE\"
26-
flock -w 600 9 || { echo 'Failed to acquire lock for $SQUASH_FILE'; exit 1; }
39+
flock -w 600 9 || { echo 'Failed to acquire lock for $SQUASH_FILE' >&2; exit 1; }
2740
if unsquashfs -l \"$SQUASH_FILE\" > /dev/null 2>&1; then
2841
echo 'Squash file already exists and is valid, skipping import'
2942
else
3043
rm -f \"$SQUASH_FILE\"
3144
enroot import -o \"$SQUASH_FILE\" docker://$IMAGE
3245
fi
3346
"
34-
srun --jobid=$JOB_ID \
35-
--container-image=$SQUASH_FILE \
36-
--container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE \
47+
srun --jobid="$JOB_ID" \
48+
--container-image="$SQUASH_FILE" \
49+
--container-mounts="$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE" \
3750
--container-mount-home \
3851
--container-writable \
3952
--container-remap-root \

0 commit comments

Comments
 (0)