Skip to content

Commit 957d020

Browse files
cquil11claude
andcommitted
fix: Kimi B300 cpu-offload pool 2200GB + H100 OOM fix
R2 deterministic failures: - B300 Kimi NVFP4 cpu offload: 9/9 terminated cpu jobs failed with silent VllmWorker death ~4 min into engine init. Same shape as DSv4 D' — the 3000 GB offload reservation left zero headroom for vLLM worker RSS on the 3.0 TiB host. Drop TOTAL_CPU_DRAM_GB 3000 -> 2200 (the proven-good value from DSv4 D''). - H100 Kimi INT4: CUDA OOM in fused_marlin_moe even at conc=1. 78.5 GB out of 79.2 GB used; weights (~44 GB/GPU) + KV reservation at MAX=32K + MoE intermediate workspace can't all fit. Drop MAX_MODEL_LEN to 16K AND --gpu-memory-utilization from 0.95 to 0.85 to leave room for the ~900 MiB-per-call MoE workspace. R2_K_H100 was cancelled (deterministic OOM, 7/14 already failed). R2_K_B300's offload=none jobs (10) are still in flight and being kept; R3 will re-dispatch just the cpu subset with the lower pool size. R2_K_H200 (mixed pattern) and R2_K_MI355X (no failures yet) left alone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 53df78a commit 957d020

2 files changed

Lines changed: 12 additions & 11 deletions

File tree

benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,11 @@ OFFLOAD_ARGS=""
3636
case "$OFFLOADING" in
3737
none) ;;
3838
cpu)
39-
# B300 NV nodes have ~3.0 TiB total host DRAM. Reserve up to 3 TB for
40-
# the simple CPU offload connector. If the bench hangs at high conc
41-
# this likely needs to come down (workers need headroom for their own
42-
# RSS + page cache + slurm cgroup overhead).
43-
TOTAL_CPU_DRAM_GB=3000
39+
# B300 NV nodes have ~3.0 TiB total host DRAM. 3000 GB left zero
40+
# headroom for worker RSS and triggered silent VllmWorker deaths
41+
# during connector init (same shape as DSv4 D' earlier). 2200 GB is
42+
# the proven-good value from D''.
43+
TOTAL_CPU_DRAM_GB=2200
4444
export VLLM_USE_SIMPLE_KV_OFFLOAD=1
4545
OFFLOAD_ARGS="--kv_offloading_backend native --kv_offloading_size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager"
4646
;;

benchmarks/single_node/agentic/kimik2.5_int4_h100.sh

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -16,12 +16,13 @@ DURATION=${DURATION:-1800}
1616
MAX_DELAY=${MAX_DELAY:-60}
1717
ADVANCE_MIN=${ADVANCE_MIN:-0.0}
1818
ADVANCE_MAX=${ADVANCE_MAX:-0.7}
19-
# H100 80 GB HBM is too small for Kimi K2.5 INT4's native 1M-token context;
20-
# vLLM bails at engine init with "No available memory for the cache blocks".
21-
# Cap at 32K so the engine can reserve KV blocks. Long-context trace turns
22-
# beyond this cap are truncated by vLLM at request time.
19+
# H100 80 GB HBM is barely enough for Kimi K2.5 INT4 (~44 GB/GPU weights at
20+
# TP=8) plus KV reservation plus MoE intermediate buffers. R2 at MAX=32K hit
21+
# CUDA OOM in fused_marlin_moe even at conc=1. Drop to 16K so KV reservation
22+
# is half, and pair with --gpu-memory-utilization 0.85 (below) to leave room
23+
# for the ~900 MiB-per-call MoE workspace.
2324
if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
24-
MAX_MODEL_LEN=32768
25+
MAX_MODEL_LEN=16384
2526
fi
2627

2728
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
@@ -60,7 +61,7 @@ export VLLM_USE_FLASHINFER_MOE_INT4=1
6061
vllm serve $MODEL \
6162
--host 0.0.0.0 \
6263
--port $PORT \
63-
--gpu-memory-utilization 0.95 \
64+
--gpu-memory-utilization 0.85 \
6465
--tensor-parallel-size $TP \
6566
--max-model-len $MAX_MODEL_LEN \
6667
--max-num-seqs $CONC \

0 commit comments

Comments
 (0)