Skip to content

Commit f481793

Browse files
committed
[NV] llm-d: shrink DSR1 H200 prefill workspace to fit 140 GB HBM
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
1 parent 8aa1c18 commit f481793

1 file changed

Lines changed: 10 additions & 2 deletions

File tree

benchmarks/multi_node/llm-d-recipes/dsr1-fp8-h200-1p1d-simple.yaml

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,19 @@ dataLayer:
7676
# server.sh. The cross-node DP coordination flags
7777
# (--data-parallel-hybrid-lb, --data-parallel-size-local, etc.) are NOT
7878
# emitted because LWS_GROUP_SIZE = PREFILL_NODES = DECODE_NODES = 1.
79+
# Prefill tuning (H200, 140 GB HBM):
80+
# Peer 1P+1D DSv4 vLLM disagg recipes (b200-low-latency, b300-low-latency,
81+
# gb200-low-latency in benchmarks/multi_node/srt-slurm-recipes/vllm/
82+
# deepseek-v4/8k1k/) run prefill at DP=8 EP=8 with 32K batched-tokens and
83+
# gpu-memory-utilization 0.80, but only on >=180 GB HBM cards. On H200
84+
# the peer 32K shape OOMs the fused-MoE workspace allocator
85+
# (vllm/v1/worker/workspace.py). Drop batched-tokens to 8K and align
86+
# headroom to the peer 0.80 to fit the workspace inside H200 HBM.
7987
prefill:
8088
extra-args: >-
81-
--gpu-memory-utilization 0.85
89+
--gpu-memory-utilization 0.80
8290
--kv-cache-dtype fp8
83-
--max-num-batched-tokens 32768
91+
--max-num-batched-tokens 8192
8492
--max-num-seqs 16
8593
--block-size 256
8694
--no-enable-prefix-caching

0 commit comments

Comments
 (0)