File tree Expand file tree Collapse file tree
benchmarks/multi_node/llm-d-recipes Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -76,11 +76,19 @@ dataLayer:
7676# server.sh. The cross-node DP coordination flags
7777# (--data-parallel-hybrid-lb, --data-parallel-size-local, etc.) are NOT
7878# emitted because LWS_GROUP_SIZE = PREFILL_NODES = DECODE_NODES = 1.
79+ # Prefill tuning (H200, 140 GB HBM):
80+ # Peer 1P+1D DSv4 vLLM disagg recipes (b200-low-latency, b300-low-latency,
81+ # gb200-low-latency in benchmarks/multi_node/srt-slurm-recipes/vllm/
82+ # deepseek-v4/8k1k/) run prefill at DP=8 EP=8 with 32K batched-tokens and
83+ # gpu-memory-utilization 0.80, but only on >=180 GB HBM cards. On H200
84+ # the peer 32K shape OOMs the fused-MoE workspace allocator
85+ # (vllm/v1/worker/workspace.py). Drop batched-tokens to 8K and align
86+ # headroom to the peer 0.80 to fit the workspace inside H200 HBM.
7987prefill :
8088 extra-args : >-
81- --gpu-memory-utilization 0.85
89+ --gpu-memory-utilization 0.80
8290 --kv-cache-dtype fp8
83- --max-num-batched-tokens 32768
91+ --max-num-batched-tokens 8192
8492 --max-num-seqs 16
8593 --block-size 256
8694 --no-enable-prefix-caching
You can’t perform that action at this time.
0 commit comments