Skip to content

Commit b4df8b0

Browse files
Oseltamivirclaude
andcommitted
bench: dsv4 gb300-cw sglang mtp3 5p1d-c12288 + mooncake P→D tuning
5p1d at 12288 was 9.10% zero-output without tuning. Probe the two SGLang env vars most likely to widen the P→D pipeline: - SGLANG_DISAGGREGATION_QUEUE_SIZE=8 (default 4) on both sides — number of parallel FastQueues that shard transfer requests by session-port hash. - SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=32 (default capped at 12) on both sides — sender threads. 144 cpus-per-task means current default caps at ~12. - SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS=2048 (default 0) on decode only — pre-reserves req_to_token_pool slots so KV transfers overlap with decode steps. Directly targets the #running-req: 65 vs configured 3072 gap observed in the 5p2d-c12288 run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1deb248 commit b4df8b0

2 files changed

Lines changed: 13 additions & 3 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8820,9 +8820,10 @@ dsv4-fp4-gb300-dynamo-sglang-mtp3:
88208820
tp: 16
88218821
ep: 16
88228822
dp-attn: true
8823-
# Mid curve 5p1d-dep8-dep8. 12 nodes. Conc 9216 (~12% above the 8k clean point — probe).
8823+
# Mid curve 5p1d-dep8-dep8. 12 nodes. Conc 12288 with mooncake P→D tuning (queue=8, threads=32, prealloc=2048).
8824+
# Baseline was 9.10% zero-output at 12288 without tuning.
88248825
- spec-decoding: mtp
8825-
conc-list: [9216]
8826+
conc-list: [12288]
88268827
prefill:
88278828
num-worker: 5
88288829
tp: 8

benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-mid-curve-5p1d-dep8-dep8-mtp-c24576.yaml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,10 @@ backend:
6060
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
6161
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
6262

63+
# Mooncake P→D pipeline tuning (probe).
64+
SGLANG_DISAGGREGATION_QUEUE_SIZE: "8"
65+
SGLANG_DISAGGREGATION_THREAD_POOL_SIZE: "32"
66+
6367
decode_environment:
6468
PYTHONUNBUFFERED: "1"
6569
SGLANG_RADIX_DISABLE_REUSE: "1"
@@ -89,6 +93,11 @@ backend:
8993
SGLANG_OPT_SWA_RELEASE_LEAF_LOCK_AFTER_WINDOW: "1"
9094
SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: "0" # CAR_V2 is single-node only.
9195

96+
# Mooncake P→D pipeline tuning (probe).
97+
SGLANG_DISAGGREGATION_QUEUE_SIZE: "8"
98+
SGLANG_DISAGGREGATION_THREAD_POOL_SIZE: "32"
99+
SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS: "2048"
100+
92101
sglang_config:
93102
prefill:
94103
served-model-name: "deepseek-ai/DeepSeek-V4-Pro"
@@ -151,7 +160,7 @@ benchmark:
151160
isl: 8192
152161
osl: 256
153162
random_range_ratio: 1.0
154-
concurrencies: "9216"
163+
concurrencies: "12288"
155164
req_rate: "inf"
156165
use_chat_template: true
157166
custom_tokenizer: "sa_bench_tokenizers.sglang_deepseek_v4.SGLangDeepseekV4Tokenizer"

0 commit comments

Comments
 (0)