Skip to content

Commit 4587b6e

Browse files
Use TEP for Qwen H100 high concurrency
1 parent c3b92eb commit 4587b6e

3 files changed

Lines changed: 8 additions & 6 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9210,13 +9210,13 @@ qwen3.5-fp8-h100-sglang:
92109210
search-space:
92119211
- { tp: 8, ep: 1, conc-start: 1, conc-end: 8 }
92129212
- { tp: 8, ep: 8, conc-start: 16, conc-end: 64 }
9213-
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 256 }
9213+
- { tp: 8, ep: 8, conc-start: 128, conc-end: 256 }
92149214
- isl: 8192
92159215
osl: 1024
92169216
search-space:
92179217
- { tp: 8, ep: 1, conc-start: 1, conc-end: 8 }
92189218
- { tp: 8, ep: 8, conc-start: 16, conc-end: 64 }
9219-
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 256 }
9219+
- { tp: 8, ep: 8, conc-start: 128, conc-end: 256 }
92209220

92219221
qwen3.5-fp8-h100-sglang-mtp:
92229222
image: lmsysorg/sglang:v0.5.12-cu130

benchmarks/single_node/qwen3.5_fp8_h100.sh

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
#!/usr/bin/env bash
22

33
# Qwen-3.5-397B-A17B FP8 on H100 via sglang.
4-
# Uses TP8/EP1 at conc 1-8, TP8/EP8 at conc 16-64,
5-
# and TP8/EP8 with DP attention at conc 128-256.
4+
# Uses TP8/EP1 at conc 1-8 and TP8/EP8 at conc 16-256.
65

76
source "$(dirname "$0")/../benchmark_lib.sh"
87

@@ -56,6 +55,9 @@ if [ "${DP_ATTENTION}" != "true" ]; then
5655
64)
5756
SCHEDULER_RECV_INTERVAL=600
5857
;;
58+
128|256)
59+
SCHEDULER_RECV_INTERVAL=1920
60+
;;
5961
*)
6062
echo "Unsupported CONC=$CONC for qwen3.5 FP8 H100 SGLang recipe" >&2
6163
exit 1

perf-changelog.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3205,7 +3205,7 @@
32053205
- qwen3.5-fp8-h100-sglang
32063206
description:
32073207
- "Tune Qwen3.5-397B-A17B-FP8 H100 SGLang aggregate recipe for 1k/1k and 8k/1k sweeps"
3208-
- "Use TP8/EP1 for conc 1-8, TP8/EP8 for conc 16-64, and TP8/EP8 DP-attention for conc 128-256"
3209-
- "Use scheduler-recv-interval values 2/60/30/1200/600 for non-DP conc 1-4/8/16/32/64"
3208+
- "Use TP8/EP1 for conc 1-8 and TP8/EP8 for conc 16-256"
3209+
- "Use scheduler-recv-interval values 2/60/30/1200/600/1920 for conc 1-4/8/16/32/64/128-256"
32103210
- "Set max-running-requests=256, chunked-prefill-size=16384, mem-fraction-static=0.8, cuda-graph-max-bs=CONC, and enable symm-mem"
32113211
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1544

0 commit comments

Comments
 (0)