File tree Expand file tree Collapse file tree
benchmarks/multi_node/llm-d-recipes Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -82,14 +82,20 @@ dataLayer:
8282# deepseek-v4/8k1k/) run prefill at DP=8 EP=8 with 32K batched-tokens and
8383# gpu-memory-utilization 0.80, but only on >=180 GB HBM cards. On H200
8484# the peer 32K shape OOMs the fused-MoE workspace allocator
85- # (vllm/v1/worker/workspace.py). Drop batched-tokens to 8K and align
86- # headroom to the peer 0.80 to fit the workspace inside H200 HBM.
85+ # (vllm/v1/worker/workspace.py). Drop batched-tokens to 8K so the workspace
86+ # fits, then move gpu-memory-utilization to 0.90 (matching the decode role
87+ # and the gb300-4p1d / gb300-1p6d peers) so KV cache has positive headroom
88+ # after weights+workspace+cudagraphs - at 0.80 vLLM reports
89+ # "Available KV cache memory: -3.23 GiB". Cap max-model-len at 16384 to
90+ # align with peer recipes and keep the KV manager from provisioning for
91+ # DSR1's 128K default context.
8792prefill :
8893 extra-args : >-
89- --gpu-memory-utilization 0.80
94+ --gpu-memory-utilization 0.90
9095 --kv-cache-dtype fp8
9196 --max-num-batched-tokens 8192
9297 --max-num-seqs 16
98+ --max-model-len 16384
9399 --block-size 256
94100 --no-enable-prefix-caching
95101 env : {}
@@ -100,6 +106,7 @@ decode:
100106 --kv-cache-dtype fp8
101107 --max-num-batched-tokens 256
102108 --max-num-seqs 256
109+ --max-model-len 16384
103110 --block-size 256
104111 --no-enable-prefix-caching
105112 env : {}
You can’t perform that action at this time.
0 commit comments