Skip to content

Commit 39f914b

Browse files
committed
feat(profile): add Flash vLLM MTP3 run
1 parent e3393af commit 39f914b

2 files changed

Lines changed: 22 additions & 1 deletion

File tree

.github/configs/nvidia-master.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2073,6 +2073,23 @@ dsv4-flash-fp4-b300-vllm:
20732073
search-space:
20742074
- { tp: 4, ep: 1, conc-start: 64, conc-end: 64 }
20752075

2076+
# Targeted Flash vLLM MTP profile at the same single-point profile location.
2077+
# The shared vLLM MTP launcher selects 3 speculative tokens for this model.
2078+
dsv4-flash-fp4-b300-vllm-mtp:
2079+
image: vllm/vllm-openai:v0.21.0
2080+
model: deepseek-ai/DeepSeek-V4-Flash
2081+
model-prefix: dsv4
2082+
runner: b300
2083+
precision: fp4
2084+
framework: vllm
2085+
multinode: false
2086+
scenarios:
2087+
fixed-seq-len:
2088+
- isl: 1024
2089+
osl: 1024
2090+
search-space:
2091+
- { tp: 4, ep: 1, conc-start: 64, conc-end: 64, spec-decoding: mtp }
2092+
20762093
# Targeted Flash MTP profile: DEP4 at the same 1k1k conc=64 point as the
20772094
# non-MTP Flash profile above. The shared SGLang MTP launcher selects the
20782095
# Flash-only (steps=3, draft-tokens=3) speculative settings for this model.

benchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,8 +62,12 @@ else
6262
SERVE_MAX_MODEL_LEN="$BENCHMARK_MAX_MODEL_LEN"
6363
fi
6464

65-
# use 2 speculative tokens for all configs for now
65+
# Keep the existing Pro MTP profile at 2 speculative tokens; Flash uses the
66+
# requested 3-token MTP profile.
6667
NUM_SPEC_TOKENS=2
68+
if [[ "$MODEL" == "deepseek-ai/DeepSeek-V4-Flash" ]]; then
69+
NUM_SPEC_TOKENS=3
70+
fi
6771

6872
start_gpu_monitor
6973

0 commit comments

Comments
 (0)