Commit b4df8b0
bench: dsv4 gb300-cw sglang mtp3 5p1d-c12288 + mooncake P→D tuning
5p1d at 12288 was 9.10% zero-output without tuning. Probe the two
SGLang env vars most likely to widen the P→D pipeline:
- SGLANG_DISAGGREGATION_QUEUE_SIZE=8 (default 4) on both sides — number
of parallel FastQueues that shard transfer requests by session-port
hash.
- SGLANG_DISAGGREGATION_THREAD_POOL_SIZE=32 (default capped at 12) on
both sides — sender threads. 144 cpus-per-task means current default
caps at ~12.
- SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS=2048 (default 0) on
decode only — pre-reserves req_to_token_pool slots so KV transfers
overlap with decode steps. Directly targets the #running-req: 65
vs configured 3072 gap observed in the 5p2d-c12288 run.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 1deb248 commit b4df8b0
2 files changed
Lines changed: 13 additions & 3 deletions
File tree
- .github/configs
- benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8820 | 8820 | | |
8821 | 8821 | | |
8822 | 8822 | | |
8823 | | - | |
| 8823 | + | |
| 8824 | + | |
8824 | 8825 | | |
8825 | | - | |
| 8826 | + | |
8826 | 8827 | | |
8827 | 8828 | | |
8828 | 8829 | | |
| |||
Lines changed: 10 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
63 | 67 | | |
64 | 68 | | |
65 | 69 | | |
| |||
89 | 93 | | |
90 | 94 | | |
91 | 95 | | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
92 | 101 | | |
93 | 102 | | |
94 | 103 | | |
| |||
151 | 160 | | |
152 | 161 | | |
153 | 162 | | |
154 | | - | |
| 163 | + | |
155 | 164 | | |
156 | 165 | | |
157 | 166 | | |
0 commit comments