Skip to content

Commit bddbf40

Browse files
kedarpotdar-nvgithub-actions[bot]functionstackxcquil11
authored
B200 Minimax FP8 vllm upgrade (#947)
* Update nvidia-master.yaml * vllm version bump * add perf changelog * update search space and configs * fix typo in VLLM_USE_DEEP_GEMM * Remove ISL 1024 / OSL 8192 seq-len config for minimaxm2.5-fp8-b200-vllm Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> * update image * update config and remove DEEPGEMM flag * test tep * fix typo in ep bash script * add max cudagraph size * upgrade to vllm 0.19 * typo * revert h200 change * fix: update perf-changelog version to v0.19.0 Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> * Remove commented-out tp:8 search-space entry Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: functionstackx <functionstackx@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
1 parent 800f57a commit bddbf40

3 files changed

Lines changed: 25 additions & 11 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3101,7 +3101,7 @@ gptoss-fp4-b200-vllm:
31013101
- { tp: 8, conc-start: 4, conc-end: 4 }
31023102

31033103
minimaxm2.5-fp8-b200-vllm:
3104-
image: vllm/vllm-openai:v0.17.0-cu130
3104+
image: vllm/vllm-openai:v0.19.0-cu130
31053105
model: MiniMaxAI/MiniMax-M2.5
31063106
model-prefix: minimaxm2.5
31073107
runner: b200
@@ -3112,13 +3112,15 @@ minimaxm2.5-fp8-b200-vllm:
31123112
- isl: 1024
31133113
osl: 1024
31143114
search-space:
3115-
- { tp: 2, conc-start: 4, conc-end: 64 }
3116-
- { tp: 4, conc-start: 4, conc-end: 64 }
3115+
- { tp: 2, conc-start: 4, conc-end: 512 }
3116+
- { tp: 2, ep: 2, conc-start: 4, conc-end: 256 }
3117+
- { tp: 4, conc-start: 4, conc-end: 512 }
3118+
- { tp: 4, ep: 4, conc-start: 16, conc-end: 64 }
31173119
- isl: 8192
31183120
osl: 1024
31193121
search-space:
3120-
- { tp: 2, conc-start: 4, conc-end: 64 }
3121-
- { tp: 4, conc-start: 4, conc-end: 64 }
3122+
- { tp: 2, conc-start: 4, conc-end: 256 }
3123+
- { tp: 4, conc-start: 4, conc-end: 256 }
31223124

31233125
gptoss-fp4-h100-vllm:
31243126
image: vllm/vllm-openai:v0.18.0

benchmarks/single_node/minimaxm2.5_fp8_b200.sh

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,9 @@ hf download "$MODEL"
2424
SERVER_LOG=/workspace/server.log
2525
PORT=${PORT:-8888}
2626

27-
export VLLM_USE_FLASHINFER_MOE_FP8=0
28-
export VLLM_MOE_USE_DEEP_GEMM=0
27+
export VLLM_FLASHINFER_ALLREDUCE_BACKEND=mnnvl
2928

30-
if [ "$EP_SIZE" -ge 1 ]; then
29+
if [ "$EP_SIZE" -gt 1 ]; then
3130
EP=" --enable-expert-parallel"
3231
else
3332
EP=" "
@@ -44,10 +43,13 @@ set -x
4443
vllm serve $MODEL --port $PORT \
4544
--tensor-parallel-size=$TP \
4645
$EP \
47-
--gpu-memory-utilization 0.95 \
46+
--gpu-memory-utilization 0.90 \
4847
--max-model-len $MAX_MODEL_LEN \
4948
--block-size=32 \
50-
--no-enable-prefix-caching \
49+
--kv-cache-dtype fp8 \
50+
--max-cudagraph-capture-size 2048 \
51+
--max-num-batched-tokens "$((ISL * 2 ))" \
52+
--stream-interval 20 --no-enable-prefix-caching \
5153
--trust-remote-code > $SERVER_LOG 2>&1 &
5254

5355
SERVER_PID=$!

perf-changelog.yaml

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1143,7 +1143,7 @@
11431143
description:
11441144
- "Disable prefix caching (--no-enable-prefix-caching) for all MiniMax benchmarks using random datasets"
11451145
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/966
1146-
1146+
11471147
- config-keys:
11481148
# NVIDIA single-node
11491149
- dsr1-fp4-b200-sglang
@@ -1235,3 +1235,13 @@
12351235
- "New model support on ATOM framework"
12361236
- "Kimi-K2.5 FP4, and MiniMax-M2.5 FP8 configs added for MI355X ATOM"
12371237
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/963
1238+
1239+
- config-keys:
1240+
- minimaxm2.5-fp8-b200-vllm
1241+
description:
1242+
- "Update vLLM image from v0.17.0 to v0.19.0 for MiniMax-M2.5 FP8 B200"
1243+
- "Add tp4 ep4 search-space entries (conc 32-256) for all seq-len configs"
1244+
- "Remove ISL 1024 / OSL 8192 seq-len config"
1245+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/947
1246+
1247+

0 commit comments

Comments
 (0)