Skip to content

Commit 9ff8446

Browse files
Oseltamivirclaude
andcommitted
Disable overlap scheduler for DSv4 B200 TRT (pin max_num_requests=256)
The 2dd03e6 build sizes the slot pool as max_num_requests = max_batch_size * num_micro_batches, with num_micro_batches=2 under the overlap scheduler -> 512 at --max_batch_size 256 (tensorrt_llm/_torch/pyexecutor/_util.py on feat/deepseek_v4). The older 9aa3715 build used 256. That extra headroom pushed the conc-256 dpa=true 8k1k prefill-warmup ~0.3 GiB over B200's 178 GiB and OOM'd (run 26987679137, job 79643136619). Setting disable_overlap_scheduler: true makes num_micro_batches=1 -> max_num_requests=256, matching the 9aa3715 footprint that fit conc-256 on B200. Trade-off: turns off the overlap scheduler (throughput optimization), so these B200 numbers are not directly comparable to overlap-on configs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 7baa914 commit 9ff8446

3 files changed

Lines changed: 3 additions & 0 deletions

File tree

benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ cuda_graph_config:
7676
max_batch_size: $CUDA_GRAPH_MAX_BATCH_SIZE
7777
enable_attention_dp: $DP_ATTENTION$ATTENTION_DP_CONFIG
7878
print_iter_log: true
79+
disable_overlap_scheduler: true
7980
kv_cache_config:
8081
tokens_per_block: 128
8182
dtype: fp8

benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt_mtp.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ cuda_graph_config:
7676
max_batch_size: $CUDA_GRAPH_MAX_BATCH_SIZE
7777
enable_attention_dp: $DP_ATTENTION$ATTENTION_DP_CONFIG
7878
print_iter_log: true
79+
disable_overlap_scheduler: true
7980
kv_cache_config:
8081
tokens_per_block: 128
8182
dtype: fp8

perf-changelog.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3464,6 +3464,7 @@
34643464
- dsv4-fp4-b200-trt-mtp
34653465
description:
34663466
- "Update B200 DeepSeek-V4-Pro TRT image to ghcr.io/semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6 (non-MTP and MTP), replacing the older 9aa3715 build."
3467+
- "Set disable_overlap_scheduler: true so the 2dd03e6 build's slot pool is sized max_num_requests = max_batch_size x 1 (256) instead of x2 (512) under the overlap scheduler, matching the 9aa3715 footprint that fit conc-256 on B200 (avoids the conc-256 dpa=true prefill-warmup OOM)."
34673468
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1664
34683469

34693470
- config-keys:

0 commit comments

Comments
 (0)