Skip to content

Commit e0cd8f7

Browse files
glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology (#1583)
* glm5-fp4-gb300-dynamo-sglang: extend 8k1k low-lat sweep with 1p17d topology Mirrors NVIDIA/srt-slurm#175: adds a 5th 8k1k_stp_lowlat_4 recipe with decode_nodes/workers=17, and lowers per-zip-index decode max-running-requests / cuda-graph-max-bs from a flat 4096 to 128/64/32/16/1 across lowlat_0..4. Benchmark concurrencies follow suit: 128/64/32/16/12. nvidia-master.yaml conc-list updated to match for each of the five 1p{3,5,9,15,17}d entries. * perf-changelog: set PR link to #1583 --------- Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
1 parent 0618646 commit e0cd8f7

7 files changed

Lines changed: 211 additions & 16 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9296,7 +9296,7 @@ glm5-fp4-gb300-dynamo-sglang:
92969296
osl: 1024
92979297
search-space:
92989298
# 1p3d. 4 nodes (1P + 3 D workers @ 1 node each).
9299-
- conc-list: [1024]
9299+
- conc-list: [128]
93009300
prefill:
93019301
num-worker: 1
93029302
tp: 4
@@ -9310,7 +9310,7 @@ glm5-fp4-gb300-dynamo-sglang:
93109310
ep: 1
93119311
dp-attn: false
93129312
# 1p5d. 6 nodes.
9313-
- conc-list: [1024]
9313+
- conc-list: [64]
93149314
prefill:
93159315
num-worker: 1
93169316
tp: 4
@@ -9324,7 +9324,7 @@ glm5-fp4-gb300-dynamo-sglang:
93249324
ep: 1
93259325
dp-attn: false
93269326
# 1p9d. 10 nodes.
9327-
- conc-list: [1024]
9327+
- conc-list: [32]
93289328
prefill:
93299329
num-worker: 1
93309330
tp: 4
@@ -9338,7 +9338,7 @@ glm5-fp4-gb300-dynamo-sglang:
93389338
ep: 1
93399339
dp-attn: false
93409340
# 1p15d. 16 nodes.
9341-
- conc-list: [1024]
9341+
- conc-list: [16]
93429342
prefill:
93439343
num-worker: 1
93449344
tp: 4
@@ -9351,6 +9351,20 @@ glm5-fp4-gb300-dynamo-sglang:
93519351
tp: 4
93529352
ep: 1
93539353
dp-attn: false
9354+
# 1p17d. 18 nodes.
9355+
- conc-list: [12]
9356+
prefill:
9357+
num-worker: 1
9358+
tp: 4
9359+
ep: 1
9360+
dp-attn: true
9361+
additional-settings:
9362+
- "CONFIG_FILE=recipes/sglang/glm5/gb300-fp4/8k1k/disagg/stp/8k1k_stp_lowlat_4.yaml"
9363+
decode:
9364+
num-worker: 17
9365+
tp: 4
9366+
ep: 1
9367+
dp-attn: false
93549368
# ---------- 1k1k high-throughput (wide-EP TP=32 decode) ----------
93559369
- isl: 1024
93569370
osl: 1024

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp4/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,8 @@ backend:
158158
enable-flashinfer-allreduce-fusion: true
159159

160160
moe-runner-backend: "flashinfer_trtllm"
161-
max-running-requests: 4096
162-
cuda-graph-max-bs: 4096
161+
max-running-requests: 128
162+
cuda-graph-max-bs: 128
163163

164164

165165

@@ -171,5 +171,5 @@ benchmark:
171171
type: "sa-bench"
172172
isl: 8192
173173
osl: 1024
174-
concurrencies: "1024"
174+
concurrencies: "128"
175175
req_rate: "inf"

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp4/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,8 @@ backend:
158158
enable-flashinfer-allreduce-fusion: true
159159

160160
moe-runner-backend: "flashinfer_trtllm"
161-
max-running-requests: 4096
162-
cuda-graph-max-bs: 4096
161+
max-running-requests: 64
162+
cuda-graph-max-bs: 64
163163

164164

165165

@@ -171,5 +171,5 @@ benchmark:
171171
type: "sa-bench"
172172
isl: 8192
173173
osl: 1024
174-
concurrencies: "1024"
174+
concurrencies: "64"
175175
req_rate: "inf"

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp4/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,8 @@ backend:
158158
enable-flashinfer-allreduce-fusion: true
159159

160160
moe-runner-backend: "flashinfer_trtllm"
161-
max-running-requests: 4096
162-
cuda-graph-max-bs: 4096
161+
max-running-requests: 32
162+
cuda-graph-max-bs: 32
163163

164164

165165

@@ -171,5 +171,5 @@ benchmark:
171171
type: "sa-bench"
172172
isl: 8192
173173
osl: 1024
174-
concurrencies: "1024"
174+
concurrencies: "32"
175175
req_rate: "inf"

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp4/8k1k/disagg/stp/8k1k_stp_lowlat_3.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,8 @@ backend:
158158
enable-flashinfer-allreduce-fusion: true
159159

160160
moe-runner-backend: "flashinfer_trtllm"
161-
max-running-requests: 4096
162-
cuda-graph-max-bs: 4096
161+
max-running-requests: 16
162+
cuda-graph-max-bs: 16
163163

164164

165165

@@ -171,5 +171,5 @@ benchmark:
171171
type: "sa-bench"
172172
isl: 8192
173173
osl: 1024
174-
concurrencies: "1024"
174+
concurrencies: "16"
175175
req_rate: "inf"
Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
name: "gb300-fp4-glm5_8k1k_lowlat_4"
2+
3+
# Ported from upstream srt-slurm recipes/gb300-fp4/glm5.yaml (PR #152).
4+
# Upstream uses a single combined file with `zip_override_*` arrays
5+
# expanded by srtctl across zip indices. We split into one flat yaml
6+
# per concrete topology to match the InferenceX dsv4 sglang convention
7+
# (see ../deepseek-v4/8k1k/*.yaml). All shared base envs and the
8+
# prefill sglang_config are inlined here verbatim from the upstream
9+
# `base:` block; the decode block is the upstream base plus the
10+
# topology-specific override from this zip index.
11+
12+
model:
13+
path: "glm-5-fp4"
14+
container: "lmsysorg/sglang:v0.5.11-cu130"
15+
precision: "fp4"
16+
17+
# Released dynamo wheel — upstream recipe uses dynamo.version: "1.1.0".
18+
# launch_gb300-cw.sh stages /configs/dynamo-wheels for `hash:` source
19+
# builds (dsv4 path); the version path uses a released wheel and does
20+
# not depend on that cache.
21+
dynamo:
22+
version: "1.1.0"
23+
24+
slurm:
25+
time_limit: "03:00:00"
26+
27+
# Mirror dsv4 sglang recipes: cpus-per-task=144 avoids the 1-CPU
28+
# default that turns dynamo install + sglang weight load into a serial
29+
# crawl; mem=0 grants whole-node memory.
30+
sbatch_directives:
31+
cpus-per-task: "144"
32+
mem: "0"
33+
34+
resources:
35+
gpu_type: "gb300"
36+
gpus_per_node: 4
37+
prefill_nodes: 1
38+
prefill_workers: 1
39+
gpus_per_prefill: 4
40+
decode_nodes: 17
41+
decode_workers: 17
42+
gpus_per_decode: 4
43+
44+
frontend:
45+
type: dynamo
46+
47+
backend:
48+
type: sglang
49+
50+
prefill_environment:
51+
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
52+
PYTHONUNBUFFERED: "1"
53+
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
54+
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
55+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
56+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
57+
MC_TE_METRIC: "true"
58+
MC_FORCE_MNNVL: "1"
59+
NCCL_MNNVL_ENABLE: "1"
60+
NCCL_CUMEM_ENABLE: "1"
61+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
62+
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
63+
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
64+
65+
decode_environment:
66+
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
67+
PYTHONUNBUFFERED: "1"
68+
DYN_SKIP_SGLANG_LOG_FORMATTING: "1"
69+
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: "100000"
70+
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: "100000"
71+
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: "100000"
72+
MC_TE_METRIC: "true"
73+
MC_FORCE_MNNVL: "1"
74+
NCCL_MNNVL_ENABLE: "1"
75+
NCCL_CUMEM_ENABLE: "1"
76+
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: "True"
77+
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: "0"
78+
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: "1"
79+
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: "512"
80+
SGLANG_MOE_NVFP4_DISPATCH: "1"
81+
82+
sglang_config:
83+
prefill:
84+
# Model configuration
85+
served-model-name: "GLM-5-FP4"
86+
trust-remote-code: true
87+
quantization: "modelopt_fp4"
88+
kv-cache-dtype: "fp8_e4m3"
89+
90+
# Disaggregation mode
91+
disaggregation-mode: "prefill"
92+
disaggregation-transfer-backend: "nixl"
93+
94+
# Size limits
95+
max-running-requests: 256
96+
cuda-graph-max-bs: 256
97+
mem-fraction-static: 0.7
98+
context-length: 9600
99+
chunked-prefill-size: 32768
100+
max-prefill-tokens: 8192
101+
102+
# Parallelism
103+
tensor-parallel-size: 4
104+
data-parallel-size: 4
105+
expert-parallel-size: 1
106+
enable-dp-attention: true
107+
enable-dp-lm-head: true
108+
load-balance-method: "total_tokens"
109+
110+
# Backend
111+
nsa-decode-backend: "trtllm"
112+
nsa-prefill-backend: "trtllm"
113+
moe-runner-backend: "flashinfer_trtllm"
114+
fp4-gemm-backend: "flashinfer_cutlass"
115+
116+
# Other flags
117+
# disable-shared-experts-fusion: true
118+
enable-flashinfer-allreduce-fusion: true
119+
disable-radix-cache: true
120+
weight-loader-prefetch-checkpoints: true
121+
model-loader-extra-config: '{"enable_multithread_load": true}'
122+
123+
decode:
124+
# Model configuration
125+
served-model-name: "GLM-5-FP4"
126+
trust-remote-code: true
127+
128+
quantization: "modelopt_fp4"
129+
kv-cache-dtype: "fp8_e4m3"
130+
131+
# Disaggregation mode
132+
disaggregation-mode: "decode"
133+
disaggregation-transfer-backend: "nixl"
134+
135+
# Memory and token limits
136+
mem-fraction-static: 0.8
137+
context-length: 9600
138+
139+
# Backend
140+
nsa-decode-backend: "trtllm"
141+
nsa-prefill-backend: "trtllm"
142+
moe-runner-backend: "flashinfer_cutedsl"
143+
fp4-gemm-backend: "flashinfer_cutlass"
144+
145+
# Detokenizer
146+
skip-tokenizer-init: true
147+
stream-interval: 30
148+
149+
# Other flags
150+
# disable-shared-experts-fusion: true
151+
disable-radix-cache: true
152+
weight-loader-prefetch-checkpoints: true
153+
model-loader-extra-config: '{"enable_multithread_load": true}'
154+
# Parallelism (override from upstream zip_override_*_lowlat)
155+
tensor-parallel-size: 4
156+
expert-parallel-size: 1
157+
data-parallel-size: 1
158+
enable-flashinfer-allreduce-fusion: true
159+
160+
moe-runner-backend: "flashinfer_trtllm"
161+
max-running-requests: 1
162+
cuda-graph-max-bs: 1
163+
164+
165+
166+
health_check:
167+
max_attempts: 360
168+
interval_seconds: 10
169+
170+
benchmark:
171+
type: "sa-bench"
172+
isl: 8192
173+
osl: 1024
174+
concurrencies: "12"
175+
req_rate: "inf"

perf-changelog.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3214,3 +3214,9 @@
32143214
description:
32153215
- "Update vLLM image tag to v0.22.0"
32163216
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1384
3217+
3218+
- config-keys:
3219+
- glm5-fp4-gb300-dynamo-sglang
3220+
description:
3221+
- "Update GB300 FP4 GLM-5 8k1k low-latency sweep to mirror NVIDIA/srt-slurm#175: add a 5th 1p17d topology (decode_nodes/workers=17), and lower decode max-running-requests / cuda-graph-max-bs / benchmark concurrency per-zip-index from a flat 4096/1024 to 128/64/32/16/1 (mrr & cuda-graph) and 128/64/32/16/12 (concurrency)"
3222+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1583

0 commit comments

Comments
 (0)