Skip to content

Commit 298d8f9

Browse files
[NV] update Minimax2.5 fp8 h100 vllm (#1516)
* update h100 minimax * Update PR link in perf-changelog.yaml --------- Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
1 parent 3c1eaed commit 298d8f9

3 files changed

Lines changed: 19 additions & 12 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4515,7 +4515,7 @@ gptoss-fp4-h100-vllm:
45154515
- { tp: 8, conc-start: 4, conc-end: 16 }
45164516

45174517
minimaxm2.5-fp8-h100-vllm:
4518-
image: vllm/vllm-openai:v0.21.0
4518+
image: vllm/vllm-openai:v0.19.1-cu130
45194519
model: MiniMaxAI/MiniMax-M2.5
45204520
model-prefix: minimaxm2.5
45214521
runner: h100
@@ -4527,13 +4527,11 @@ minimaxm2.5-fp8-h100-vllm:
45274527
- isl: 1024
45284528
osl: 1024
45294529
search-space:
4530-
# - { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
4531-
- { tp: 4, ep: 4, conc-start: 4, conc-end: 64 }
4530+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128 }
45324531
- isl: 8192
45334532
osl: 1024
45344533
search-space:
4535-
# - { tp: 8, ep: 8, conc-start: 4, conc-end: 64 }
4536-
- { tp: 4, ep: 4, conc-start: 4, conc-end: 64 }
4534+
- { tp: 8, ep: 8, conc-start: 4, conc-end: 128 }
45374535

45384536
# Diverged from minimaxm2.5-fp8-h100-vllm (agentic-coding sibling). Metadata is
45394537
# identical to origin/main's minimaxm2.5-fp8-h100-vllm; the split exists because this

benchmarks/single_node/minimaxm2.5_fp8_h100.sh

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,6 @@ check_env_vars \
99
CONC \
1010
ISL \
1111
OSL \
12-
MAX_MODEL_LEN \
1312
RANDOM_RANGE_RATIO \
1413
RESULT_FILENAME
1514

@@ -28,7 +27,6 @@ PORT=${PORT:-8888}
2827

2928
if [ "${EVAL_ONLY}" = "true" ]; then
3029
setup_eval_context
31-
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
3230
fi
3331

3432
if [ "$EP_SIZE" -gt 1 ]; then
@@ -44,12 +42,13 @@ set -x
4442
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
4543
--tensor-parallel-size=$TP \
4644
$EP \
47-
--gpu-memory-utilization 0.90 \
48-
--max-model-len $MAX_MODEL_LEN \
49-
--max-num-seqs 256 \
50-
--no-enable-prefix-caching \
5145
--trust-remote-code \
52-
--compilation-config '{"cudagraph_mode":"PIECEWISE"}' > $SERVER_LOG 2>&1 &
46+
--enable-auto-tool-choice \
47+
--tool-call-parser minimax_m2 \
48+
--reasoning-parser minimax_m2_append_think \
49+
--compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
50+
--gpu-memory-utilization 0.9 \
51+
> $SERVER_LOG 2>&1 &
5352

5453
SERVER_PID=$!
5554

perf-changelog.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3113,3 +3113,13 @@
31133113
- "1k1k and 8k1k STP low-latency and max-throughput srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp4/ (ported from upstream srt-slurm PR #152)"
31143114
- "Wire glm5/fp4 model + dynamo-sglang framework branches into runners/launch_gb300-nv.sh"
31153115
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1514
3116+
3117+
- config-keys:
3118+
- minimaxm2.5-fp8-h100-vllm
3119+
description:
3120+
- "Update minimaxm2.5-fp8-h100-vllm recipe (v0.19.1)"
3121+
- "Image: vllm/vllm-openai:v0.21.0 -> v0.19.1-cu130"
3122+
- "Replace recipe flags: drop PIECEWISE/0.90 mem util/256 max-num-seqs/no-prefix-caching/explicit max-model-len; add --enable-auto-tool-choice, --tool-call-parser minimax_m2, --reasoning-parser minimax_m2_append_think, --compilation-config mode:3+fuse_minimax_qk_norm"
3123+
- "Search-space: tp:8 ep:8 (TEP=8), conc-end 128 chosen at saturation per local sweep"
3124+
- "Local bench: TEP=8 peaks at C=128 with 26923 tot tps (+178% vs TEP=4 peak at C=32 in May 6 j11600242 sweep)"
3125+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1516

0 commit comments

Comments
 (0)