Skip to content

Commit 0618646

Browse files
Updating dsv4 b200 vllm version (#1384)
* Try updating b200 dsv4 * add changelog * Set MAX_CUDAGRAPH_CAPTURE_SIZE to 2048 unconditionally * Update Docker image for dsv4-fp4-b200-vllm * Update vLLM image tag in perf-changelog.yaml Updated the vLLM image tag to specify the nightly version. * Update Docker image tag for dsv4-fp4-b200-vllm * Update vLLM image tag to v0.22.0 * Update conc-end values in nvidia-master.yaml --------- Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
1 parent c088658 commit 0618646

3 files changed

Lines changed: 12 additions & 10 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1756,7 +1756,7 @@ dsv4-fp4-b200-sglang:
17561756
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512 }
17571757

17581758
dsv4-fp4-b200-vllm:
1759-
image: vllm/vllm-openai:v0.21.0
1759+
image: vllm/vllm-openai:v0.22.0
17601760
model: deepseek-ai/DeepSeek-V4-Pro
17611761
model-prefix: dsv4
17621762
runner: b200-dsv4
@@ -1770,7 +1770,8 @@ dsv4-fp4-b200-vllm:
17701770
search-space:
17711771
- { tp: 8, conc-start: 1, conc-end: 64 }
17721772
- { tp: 8, ep: 8, conc-start: 128, conc-end: 128 }
1773-
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 4096 }
1773+
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 1024 }
1774+
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4096, conc-end: 4096 }
17741775
- isl: 8192
17751776
osl: 1024
17761777
search-space:

benchmarks/single_node/dsv4_fp4_b200_vllm.sh

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -42,13 +42,9 @@ if [ "${EP_SIZE:-1}" -gt 1 ]; then
4242
EP_ARGS=(--enable-expert-parallel)
4343
fi
4444

45-
# Mega-MoE backend and the lower GMU only kick in on the DP-attn path,
46-
# per the vLLM v0.20.0 DeepSeek-V4-Pro recipe. All configs share the
47-
# FULL_AND_PIECEWISE compilation config.
4845
GMU_ARGS=()
4946
MOE_ARGS=()
5047
if [ "${DP_ATTENTION}" = "true" ]; then
51-
GMU_ARGS=(--gpu-memory-utilization 0.85)
5248
MOE_ARGS=(--moe-backend deep_gemm_mega_moe)
5349
fi
5450

@@ -58,10 +54,9 @@ else
5854
MAX_NUM_BATCHED_TOKENS=2048
5955
fi
6056

57+
MAX_CUDAGRAPH_CAPTURE_SIZE=2048
58+
6159
BENCHMARK_MAX_MODEL_LEN="$MAX_MODEL_LEN"
62-
if [ "$ISL" -eq 1024 ] && [ "$OSL" -eq 1024 ]; then
63-
BENCHMARK_MAX_MODEL_LEN=4096
64-
fi
6560

6661
if [ "${EVAL_ONLY}" = "true" ]; then
6762
EVAL_MAX_MODEL_LEN=$(compute_eval_context_length "$MODEL" "$BENCHMARK_MAX_MODEL_LEN")
@@ -90,7 +85,7 @@ vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
9085
--tool-call-parser deepseek_v4 \
9186
--enable-auto-tool-choice \
9287
--reasoning-parser deepseek_v4 \
93-
--max-cudagraph-capture-size 2048 \
88+
--max-cudagraph-capture-size "$MAX_CUDAGRAPH_CAPTURE_SIZE" \
9489
--max-model-len "$SERVE_MAX_MODEL_LEN" \
9590
--max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS" > "$SERVER_LOG" 2>&1 &
9691

perf-changelog.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3208,3 +3208,9 @@
32083208
- "1k1k and 8k1k STP hightpt and lowlat srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ (resolved from upstream srt-slurm PR #160 via srtctl resolve-override)"
32093209
- "Wire glm5/fp8 model + dynamo-sglang framework branches into runners/launch_gb300-nv.sh with SA upstream defaults (SLURM_PARTITION=batch_1, SLURM_ACCOUNT=benchmark, SQUASH_FILE under /home/sa-shared/gharunners/squash/)"
32103210
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1557
3211+
3212+
- config-keys:
3213+
- dsv4-fp4-b200-vllm
3214+
description:
3215+
- "Update vLLM image tag to v0.22.0"
3216+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1384

0 commit comments

Comments
 (0)