Skip to content

Commit cc78fc9

Browse files
andyluo7indianspeedsterseungrokj
authored
[Klaud Cold] Add minimaxm3-fp4-mi355x-atom (upstream branch for full-sweep validation) (#1813)
* minimaxm3-fp4-mi355x-atom: day-zero MiniMax-M3 MXFP4 MI355X atom recipe Smoke-tested on MI355X (mia1-p01-g07): TP4 conc-1 1k1k served and benched clean (mean TPOT 6.8ms). KV cache left at default dtype — amd/MiniMax-M3-MXFP4 has no calibrated FP8 KV scales, so --kv_cache_dtype fp8 asserts in the MSA fused_qknorm kernel. * minimaxm3-fp4-mi355x-atom: route amd/MiniMax-M3* weights to NFS cache * minimaxm3-fp4-mi355x-atom: fill perf-changelog pr-link * minimaxm3-fp4-mi355x-atom: use matrix MAX_MODEL_LEN (isl+osl+256) * trigger full-sweep validation * perf-changelog: point minimaxm3-fp4-mi355x-atom pr-link to upstream PR #1813 --------- Co-authored-by: shekhar <shekhar.pandey@amd.com> Co-authored-by: seungrokj <144636725+seungrokj@users.noreply.github.com>
1 parent 529a500 commit cc78fc9

4 files changed

Lines changed: 121 additions & 2 deletions

File tree

.github/configs/amd-master.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2851,6 +2851,28 @@ minimaxm3-fp8-mi355x-vllm-mtp:
28512851
- { tp: 4, conc-start: 1, conc-end: 64, spec-decoding: mtp }
28522852
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 256, spec-decoding: mtp }
28532853

2854+
# MiniMax-M3 MXFP4 MI355X atom recipe:
2855+
# https://github.com/ROCm/ATOM/blob/5d42d49f9e4292e5b61475917e92e7ec1b1dacb7/recipes/MiniMax-M3.md
2856+
# block size 128 is mandatory for MSA. TP4 on a single gfx950 node, per the recipe.
2857+
minimaxm3-fp4-mi355x-atom:
2858+
image: rocm/atom-dev:M3
2859+
model: amd/MiniMax-M3-MXFP4
2860+
model-prefix: minimaxm3
2861+
runner: mi355x
2862+
precision: fp4
2863+
framework: atom
2864+
multinode: false
2865+
scenarios:
2866+
fixed-seq-len:
2867+
- isl: 1024
2868+
osl: 1024
2869+
search-space:
2870+
- { tp: 4, conc-start: 1, conc-end: 128 }
2871+
- isl: 8192
2872+
osl: 1024
2873+
search-space:
2874+
- { tp: 4, conc-start: 1, conc-end: 128 }
2875+
28542876
# MiniMax-M3 MXFP8 MI300X day-zero recipe. Reuse the dedicated ROCm image and
28552877
# MI355X serving shape, but retain the default BF16 KV cache because this
28562878
# checkpoint lacks calibrated ROCm FP8 attention scales. Use the TP8-only H100
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
#!/usr/bin/env bash
2+
3+
source "$(dirname "$0")/../../benchmark_lib.sh"
4+
5+
check_env_vars \
6+
MODEL \
7+
TP \
8+
CONC \
9+
ISL \
10+
OSL \
11+
RANDOM_RANGE_RATIO \
12+
RESULT_FILENAME \
13+
EP_SIZE \
14+
DP_ATTENTION \
15+
MAX_MODEL_LEN
16+
17+
if [[ -n "$SLURM_JOB_ID" ]]; then
18+
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
19+
fi
20+
21+
echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTION: $DP_ATTENTION"
22+
23+
SERVER_LOG=/workspace/server.log
24+
25+
export OMP_NUM_THREADS=1
26+
27+
# Use the matrix-supplied MAX_MODEL_LEN (isl + osl + 256). Eval-only jobs need a
28+
# larger window for the eval prompts, so override it from the eval context.
29+
if [ "${EVAL_ONLY}" = "true" ]; then
30+
setup_eval_context
31+
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
32+
fi
33+
34+
if [ "$EP_SIZE" -gt 1 ]; then
35+
EP=" --enable-expert-parallel"
36+
else
37+
EP=" "
38+
fi
39+
40+
# Start GPU monitoring (power, temperature, clocks every second)
41+
start_gpu_monitor
42+
MEM_FRAC_STATIC=0.8
43+
44+
set -x
45+
46+
# Flags follow the ATOM MiniMax-M3 MXFP4 recipe (FP4 on 4xMI355 section):
47+
# https://github.com/ROCm/ATOM/blob/5d42d49f9e4292e5b61475917e92e7ec1b1dacb7/recipes/MiniMax-M3.md
48+
# --block-size 128 is mandatory for MiniMax MSA. KV cache is left at the default
49+
# dtype: amd/MiniMax-M3-MXFP4 ships no calibrated FP8 KV scales, so
50+
# --kv_cache_dtype fp8 trips an assertion (k_scale is None) in the MSA
51+
# fused_qknorm kernel during init.
52+
python3 -m atom.entrypoints.openai_server \
53+
--model $MODEL \
54+
--server-port $PORT \
55+
-tp $TP \
56+
--max-model-len $MAX_MODEL_LEN $EP \
57+
--block-size 128 \
58+
--gpu-memory-utilization $MEM_FRAC_STATIC \
59+
--trust-remote-code \
60+
> $SERVER_LOG 2>&1 &
61+
62+
SERVER_PID=$!
63+
64+
# Wait for server to be ready
65+
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
66+
67+
export PYTHONDONTWRITEBYTECODE=1
68+
run_benchmark_serving \
69+
--model "$MODEL" \
70+
--port "$PORT" \
71+
--backend vllm \
72+
--input-len "$ISL" \
73+
--output-len "$OSL" \
74+
--random-range-ratio "$RANDOM_RANGE_RATIO" \
75+
--num-prompts "$((CONC * 10))" \
76+
--max-concurrency "$CONC" \
77+
--result-filename "$RESULT_FILENAME" \
78+
--result-dir /workspace/ \
79+
--trust-remote-code
80+
81+
# After throughput, run evaluation only if RUN_EVAL is true
82+
if [ "${RUN_EVAL}" = "true" ]; then
83+
run_eval --framework lm-eval --port "$PORT"
84+
append_lm_eval_summary
85+
fi
86+
87+
# Stop GPU monitoring
88+
stop_gpu_monitor
89+
set +x

perf-changelog.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3918,6 +3918,13 @@
39183918
- "This issue is now fixed in the latest TRTLLM release."
39193919
- "Also update all configs for DSR1 TRTLLM FP8 to reflect latest released image usage"
39203920
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1767
3921+
3922+
- config-keys:
3923+
- minimaxm3-fp4-mi355x-atom
3924+
description:
3925+
- "Add day-zero MiniMax-M3 MXFP4 (amd/MiniMax-M3-MXFP4) single-node atom benchmark on MI355X, following the ROCm/ATOM MiniMax-M3 recipe (TP4, block size 128 for MSA, default KV cache dtype)."
3926+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1813
3927+
39213928

39223929
- config-keys:
39233930
- glm5-fp4-gb300-dynamo-trt

runners/launch_mi355x-amds.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -242,8 +242,9 @@ else
242242
fi
243243

244244
# MiniMax-M3 weights are not staged on the node-local /var/lib NVMe cache;
245-
# they are pre-downloaded once to the NFS share instead.
246-
if [[ "$MODEL" == MiniMaxAI/MiniMax-M3* ]]; then
245+
# they are pre-downloaded once to the NFS share instead. Covers both the
246+
# MiniMaxAI MXFP8 checkpoint and the amd MXFP4 atom checkpoint.
247+
if [[ "$MODEL" == MiniMaxAI/MiniMax-M3* || "$MODEL" == amd/MiniMax-M3* ]]; then
247248
export HF_HUB_CACHE_MOUNT="/it-share/hf-hub-cache/"
248249
fi
249250

0 commit comments

Comments
 (0)