Skip to content

Commit 292726b

Browse files
[Klaud Cold] Add dsr1-fp4-b200-sglang-mtp single-node MTP recipe (#1522)
* [Klaud Cold] Add dsr1-fp4-b200-sglang-mtp single-node MTP recipe New MTP/EAGLE speculative-decoding sibling for the existing dsr1-fp4-b200-sglang recipe. Recipe key: dsr1-fp4-b200-sglang-mtp Model: nvidia/DeepSeek-R1-0528-FP4-V2 (same as off sibling) Image: lmsysorg/sglang:v0.5.12-cu130 (same as off sibling) Search: tp=8 ep=1 conc 4..512 spec-decoding=mtp on 1k1k + 8k1k Launch script benchmarks/single_node/dsr1_fp4_b200_mtp.sh clones the off variant (dsr1_fp4_b200.sh) and overlays MTP bits from the production B200 sglang MTP template (dsr1_fp8_b200_mtp.sh): - TP=8 enforcement check - --cuda-graph-max-bs 512 / --max-running-requests 512 - --speculative-algorithm EAGLE with num-steps=2, draft-tokens=3, eagle-topk=1 - SGLANG_ENABLE_SPEC_V2=1 env var - --use-chat-template on the bench client Keeps fp4-specific bits intact (--quantization modelopt_fp4, --moe-runner-backend flashinfer_trtllm). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: fill pr-link for #1522 --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 80c944e commit 292726b

3 files changed

Lines changed: 120 additions & 0 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1698,6 +1698,25 @@ dsr1-fp4-b200-sglang:
16981698
# - { tp: 4, ep: 4, offloading: none, conc-list: [1, 2, 4, 8, 12, 16, 24, 32, 48, 64, 128, 256] }
16991699
# - { tp: 8, ep: 8, offloading: none, conc-list: [1, 2, 4, 8, 12, 16, 32, 64, 128, 256, 512] }
17001700

1701+
dsr1-fp4-b200-sglang-mtp:
1702+
image: lmsysorg/sglang:v0.5.12-cu130
1703+
model: nvidia/DeepSeek-R1-0528-FP4-V2
1704+
model-prefix: dsr1
1705+
runner: b200
1706+
precision: fp4
1707+
framework: sglang
1708+
multinode: false
1709+
scenarios:
1710+
fixed-seq-len:
1711+
- isl: 1024
1712+
osl: 1024
1713+
search-space:
1714+
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
1715+
- isl: 8192
1716+
osl: 1024
1717+
search-space:
1718+
- { tp: 8, ep: 1, conc-start: 4, conc-end: 512, spec-decoding: mtp }
1719+
17011720
dsv4-fp4-b200-sglang:
17021721
image: lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc4aa9ecf59451002b49ba00cae58042de9e2a96378bbd21b404dd62c7b
17031722
model: deepseek-ai/DeepSeek-V4-Pro
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
#!/usr/bin/env bash
2+
3+
# DeepSeek-R1-0528 FP4 on B200 with EAGLE/MTP speculative decoding.
4+
# Mirrors dsr1_fp4_b200.sh and adds the speculative-* flags from
5+
# dsr1_fp8_b200_mtp.sh (the production B200 sglang MTP template).
6+
7+
source "$(dirname "$0")/../benchmark_lib.sh"
8+
9+
check_env_vars \
10+
MODEL \
11+
TP \
12+
CONC \
13+
ISL \
14+
OSL \
15+
RANDOM_RANGE_RATIO \
16+
RESULT_FILENAME \
17+
EP_SIZE
18+
19+
if [[ -n "$SLURM_JOB_ID" ]]; then
20+
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
21+
fi
22+
23+
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
24+
25+
nvidia-smi
26+
27+
# MTP only supports TP=8 for now (matching dsr1_fp8_b200_mtp.sh)
28+
if [[ $TP -ne 8 ]]; then
29+
echo "MTP only supports TP=8, got TP=$TP!"
30+
exit 1
31+
fi
32+
33+
SERVER_LOG=/workspace/server.log
34+
PORT=${PORT:-8888}
35+
36+
if [[ $CONC -ge 16 ]]; then
37+
SCHEDULER_RECV_INTERVAL=30
38+
else
39+
SCHEDULER_RECV_INTERVAL=10
40+
fi
41+
echo "SCHEDULER_RECV_INTERVAL: $SCHEDULER_RECV_INTERVAL, CONC: $CONC, ISL: $ISL, OSL: $OSL"
42+
43+
# MTP (Multi-Token Prediction) Config - EAGLE speculative decoding
44+
SPECULATIVE_NUM_STEPS=2
45+
SPECULATIVE_DRAFT_TOKENS=3
46+
SPECULATIVE_EAGLE_TOPK=1
47+
48+
export SGLANG_ENABLE_SPEC_V2=1
49+
50+
EVAL_CONTEXT_ARGS=""
51+
if [ "${EVAL_ONLY}" = "true" ]; then
52+
setup_eval_context
53+
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
54+
fi
55+
start_gpu_monitor
56+
57+
set -x
58+
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL --host 0.0.0.0 --port $PORT --trust-remote-code \
59+
--tensor-parallel-size=$TP --data-parallel-size=1 \
60+
--cuda-graph-max-bs 512 --max-running-requests 512 --mem-fraction-static 0.82 --kv-cache-dtype fp8_e4m3 \
61+
--chunked-prefill-size 16384 --max-prefill-tokens 16384 \
62+
--ep-size $EP_SIZE --quantization modelopt_fp4 --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
63+
--enable-symm-mem --disable-radix-cache --attention-backend trtllm_mla --moe-runner-backend flashinfer_trtllm --stream-interval 30 \
64+
--speculative-algorithm EAGLE \
65+
--speculative-num-steps $SPECULATIVE_NUM_STEPS \
66+
--speculative-num-draft-tokens $SPECULATIVE_DRAFT_TOKENS \
67+
--speculative-eagle-topk $SPECULATIVE_EAGLE_TOPK \
68+
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
69+
70+
SERVER_PID=$!
71+
72+
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
73+
74+
pip install -q datasets pandas
75+
76+
run_benchmark_serving \
77+
--model "$MODEL" \
78+
--port "$PORT" \
79+
--backend vllm \
80+
--input-len "$ISL" \
81+
--output-len "$OSL" \
82+
--random-range-ratio "$RANDOM_RANGE_RATIO" \
83+
--num-prompts $((CONC * 10)) \
84+
--max-concurrency "$CONC" \
85+
--result-filename "$RESULT_FILENAME" \
86+
--result-dir /workspace/ \
87+
--use-chat-template
88+
89+
if [ "${RUN_EVAL}" = "true" ]; then
90+
run_eval --framework lm-eval --port "$PORT"
91+
append_lm_eval_summary
92+
fi
93+
94+
stop_gpu_monitor
95+
set +x

perf-changelog.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3022,3 +3022,9 @@
30223022
description:
30233023
- "Update SGLang image from nightly-dev-cu13-20260518-c67b2870 to nightly-dev-cu13-20260519-dbac4647"
30243024
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1492
3025+
3026+
- config-keys:
3027+
- dsr1-fp4-b200-sglang-mtp
3028+
description:
3029+
- "Add MTP/EAGLE speculative-decoding sibling for dsr1-fp4-b200-sglang (model: nvidia/DeepSeek-R1-0528-FP4-V2) on lmsysorg/sglang:v0.5.12-cu130"
3030+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1522

0 commit comments

Comments
 (0)