Skip to content

Commit ae59ce9

Browse files
committed
[OMNIML-4788] tools/launcher: bump qualitative concurrency to 32, throughput_32k to 80 samples @ concurrency=8
Prior config (concurrency=8 on qualitative, --num_requests 20 + concurrency=4 on throughput_32k) was conservative-tuned for time-budget headroom. With tp_size=2 in place the KV budget is doubled, so we can push concurrency further: task_0 (qualitative): concurrency 8 -> 32 (still tp_size=2) task_1 (throughput_32k): concurrency 4 -> 8, --num_requests 20 -> 80 (still tp_size=2) AL is concurrency-independent; the bump only sacrifices aa_timing fidelity. 8 * 32K = 256K tokens of in-flight KV stays within the doubled KV budget on tp_size=2. Signed-off-by: chenhany <chenhany@nvidia.com>
1 parent 5c24516 commit ae59ce9

2 files changed

Lines changed: 14 additions & 14 deletions

File tree

tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ pipeline:
3232
hf_model: /hf-local/Qwen/Qwen3.5-4B
3333

3434
# Step 1: qualitative split — quality / acceptance-rate numbers.
35-
# tp_size=2 + concurrency=8 trades aa_timing fidelity for ~10x wall-clock
35+
# tp_size=2 + concurrency=32 trades aa_timing fidelity for ~30x wall-clock
3636
# speedup; acceptance-length (AL) is concurrency-independent and is the
3737
# primary metric we care about for this split.
3838
task_0:
@@ -44,7 +44,7 @@ pipeline:
4444
- --speculative_algorithm NONE
4545
- --tp_size 2
4646
- --ep_size 1
47-
- --concurrency 8
47+
- --concurrency 32
4848
- --output_length 4096
4949
- --aa_timing
5050
- --show_progress
@@ -60,10 +60,10 @@ pipeline:
6060
container: vllm/vllm-openai:qwen3_5-cu130
6161

6262
# Step 2: throughput_32k split — long-context throughput.
63-
# `--num_requests 20` caps the run at 20 samples (split has 1,536) so it fits
63+
# `--num_requests 80` caps the run at 80 samples (split has 1,536) so it fits
6464
# in the 4h Slurm time-limit; each 32K-input sample takes ~60-90s.
65-
# tp_size=2 doubles the KV-cache budget across 2 GPUs, making concurrency>1
66-
# feasible at 32K prompts.
65+
# tp_size=2 doubles the KV-cache budget across 2 GPUs; concurrency=8 keeps
66+
# 8 * 32K = 256K tokens of in-flight KV under that doubled budget.
6767
task_1:
6868
script: common/specdec_bench/run.sh
6969
args:
@@ -73,8 +73,8 @@ pipeline:
7373
- --speculative_algorithm NONE
7474
- --tp_size 2
7575
- --ep_size 1
76-
- --concurrency 4
77-
- --num_requests 20
76+
- --concurrency 8
77+
- --num_requests 80
7878
- --output_length 4096
7979
- --runtime_params modules/Model-Optimizer/tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
8080
- --aa_timing

tools/launcher/examples/Qwen/Qwen3.5-4B/specdec_bench_mtp.yaml

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ pipeline:
2525
hf_model: /hf-local/Qwen/Qwen3.5-4B
2626

2727
# Step 1: qualitative split — quality / acceptance-rate numbers with MTP draft=3.
28-
# tp_size=2 + concurrency=8 trades aa_timing fidelity for ~10x wall-clock
28+
# tp_size=2 + concurrency=32 trades aa_timing fidelity for ~30x wall-clock
2929
# speedup; acceptance-length (AL) is concurrency-independent and is the
3030
# primary metric we care about for this split.
3131
task_0:
@@ -38,7 +38,7 @@ pipeline:
3838
- --draft_length 3
3939
- --tp_size 2
4040
- --ep_size 1
41-
- --concurrency 8
41+
- --concurrency 32
4242
- --output_length 4096
4343
- --aa_timing
4444
- --show_progress
@@ -69,10 +69,10 @@ pipeline:
6969
container: vllm/vllm-openai:qwen3_5-cu130
7070

7171
# Step 2: throughput_32k split — long-context throughput with MTP draft=3.
72-
# `--num_requests 20` caps the run at 20 samples (split has 1,536) so it fits
72+
# `--num_requests 80` caps the run at 80 samples (split has 1,536) so it fits
7373
# in the 4h Slurm time-limit; each 32K-input sample takes ~60-90s.
74-
# tp_size=2 doubles the KV-cache budget across 2 GPUs, making concurrency>1
75-
# feasible at 32K prompts.
74+
# tp_size=2 doubles the KV-cache budget across 2 GPUs; concurrency=8 keeps
75+
# 8 * 32K = 256K tokens of in-flight KV under that doubled budget.
7676
task_1:
7777
script: common/specdec_bench/run.sh
7878
args:
@@ -83,8 +83,8 @@ pipeline:
8383
- --draft_length 3
8484
- --tp_size 2
8585
- --ep_size 1
86-
- --concurrency 4
87-
- --num_requests 20
86+
- --concurrency 8
87+
- --num_requests 80
8888
- --output_length 4096
8989
- --runtime_params modules/Model-Optimizer/tools/launcher/common/specdec_bench/runtime_params_throughput_32k.yaml
9090
- --aa_timing

0 commit comments

Comments
 (0)