Skip to content

Commit 170ee9f

Browse files
committed
[NV] llm-d: raise H200 prefill gmu to 0.90 and cap max-model-len 16K
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
1 parent f481793 commit 170ee9f

1 file changed

Lines changed: 10 additions & 3 deletions

File tree

benchmarks/multi_node/llm-d-recipes/dsr1-fp8-h200-1p1d-simple.yaml

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,14 +82,20 @@ dataLayer:
8282
# deepseek-v4/8k1k/) run prefill at DP=8 EP=8 with 32K batched-tokens and
8383
# gpu-memory-utilization 0.80, but only on >=180 GB HBM cards. On H200
8484
# the peer 32K shape OOMs the fused-MoE workspace allocator
85-
# (vllm/v1/worker/workspace.py). Drop batched-tokens to 8K and align
86-
# headroom to the peer 0.80 to fit the workspace inside H200 HBM.
85+
# (vllm/v1/worker/workspace.py). Drop batched-tokens to 8K so the workspace
86+
# fits, then move gpu-memory-utilization to 0.90 (matching the decode role
87+
# and the gb300-4p1d / gb300-1p6d peers) so KV cache has positive headroom
88+
# after weights+workspace+cudagraphs - at 0.80 vLLM reports
89+
# "Available KV cache memory: -3.23 GiB". Cap max-model-len at 16384 to
90+
# align with peer recipes and keep the KV manager from provisioning for
91+
# DSR1's 128K default context.
8792
prefill:
8893
extra-args: >-
89-
--gpu-memory-utilization 0.80
94+
--gpu-memory-utilization 0.90
9095
--kv-cache-dtype fp8
9196
--max-num-batched-tokens 8192
9297
--max-num-seqs 16
98+
--max-model-len 16384
9399
--block-size 256
94100
--no-enable-prefix-caching
95101
env: {}
@@ -100,6 +106,7 @@ decode:
100106
--kv-cache-dtype fp8
101107
--max-num-batched-tokens 256
102108
--max-num-seqs 256
109+
--max-model-len 16384
103110
--block-size 256
104111
--no-enable-prefix-caching
105112
env: {}

0 commit comments

Comments
 (0)