Disable overlap scheduler for DSv4 B200 TRT (pin max_num_requests=256)

Oseltamivir · claude · Oseltamivir · commit 9ff8446dbedd · 2026-06-05T00:00:28.000-07:00
The 2dd03e6 build sizes the slot pool as max_num_requests = max_batch_size *
num_micro_batches, with num_micro_batches=2 under the overlap scheduler -&gt; 512
at --max_batch_size 256 (tensorrt_llm/_torch/pyexecutor/_util.py on
feat/deepseek_v4). The older 9aa3715 build used 256. That extra headroom pushed
the conc-256 dpa=true 8k1k prefill-warmup ~0.3 GiB over B200's 178 GiB and OOM'd
(run 26987679137, job 79643136619).

Setting disable_overlap_scheduler: true makes num_micro_batches=1 -&gt;
max_num_requests=256, matching the 9aa3715 footprint that fit conc-256 on B200.
Trade-off: turns off the overlap scheduler (throughput optimization), so these
B200 numbers are not directly comparable to overlap-on configs.

Co-Authored-By: Claude Opus 4.8 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt.sh b/benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt.sh
@@ -76,6 +76,7 @@ cuda_graph_config:
     max_batch_size: $CUDA_GRAPH_MAX_BATCH_SIZE
 enable_attention_dp: $DP_ATTENTION$ATTENTION_DP_CONFIG
 print_iter_log: true
+disable_overlap_scheduler: true
 kv_cache_config:
     tokens_per_block: 128
     dtype: fp8
diff --git a/benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt_mtp.sh b/benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt_mtp.sh
@@ -76,6 +76,7 @@ cuda_graph_config:
     max_batch_size: $CUDA_GRAPH_MAX_BATCH_SIZE
 enable_attention_dp: $DP_ATTENTION$ATTENTION_DP_CONFIG
 print_iter_log: true
+disable_overlap_scheduler: true
 kv_cache_config:
     tokens_per_block: 128
     dtype: fp8
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
@@ -3464,6 +3464,7 @@
     - dsv4-fp4-b200-trt-mtp
   description:
     - "Update B200 DeepSeek-V4-Pro TRT image to ghcr.io/semianalysisai/trtllm-deepseek-v4:feat-deepseek_v4-2dd03e6 (non-MTP and MTP), replacing the older 9aa3715 build."
+    - "Set disable_overlap_scheduler: true so the 2dd03e6 build's slot pool is sized max_num_requests = max_batch_size x 1 (256) instead of x2 (512) under the overlap scheduler, matching the 9aa3715 footprint that fit conc-256 on B200 (avoids the conc-256 dpa=true prefill-warmup OOM)."
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1664
 
 - config-keys: