fix: Kimi B300 cpu-offload pool 2200GB + H100 OOM fix

cquil11 · claude · cquil11 · commit 957d020c5778 · 2026-05-14T10:22:29.000-05:00
R2 deterministic failures:

- B300 Kimi NVFP4 cpu offload: 9/9 terminated cpu jobs failed with silent
  VllmWorker death ~4 min into engine init. Same shape as DSv4 D' — the
  3000 GB offload reservation left zero headroom for vLLM worker RSS on
  the 3.0 TiB host. Drop TOTAL_CPU_DRAM_GB 3000 -&gt; 2200 (the proven-good
  value from DSv4 D'').

- H100 Kimi INT4: CUDA OOM in fused_marlin_moe even at conc=1. 78.5 GB
  out of 79.2 GB used; weights (~44 GB/GPU) + KV reservation at MAX=32K
  + MoE intermediate workspace can't all fit. Drop MAX_MODEL_LEN to 16K
  AND --gpu-memory-utilization from 0.95 to 0.85 to leave room for the
  ~900 MiB-per-call MoE workspace.

R2_K_H100 was cancelled (deterministic OOM, 7/14 already failed).
R2_K_B300's offload=none jobs (10) are still in flight and being kept;
R3 will re-dispatch just the cpu subset with the lower pool size.
R2_K_H200 (mixed pattern) and R2_K_MI355X (no failures yet) left alone.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh b/benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh
@@ -36,11 +36,11 @@ OFFLOAD_ARGS=""
 case "$OFFLOADING" in
     none) ;;
     cpu)
-        # B300 NV nodes have ~3.0 TiB total host DRAM. Reserve up to 3 TB for
-        # the simple CPU offload connector. If the bench hangs at high conc
-        # this likely needs to come down (workers need headroom for their own
-        # RSS + page cache + slurm cgroup overhead).
-        TOTAL_CPU_DRAM_GB=3000
+        # B300 NV nodes have ~3.0 TiB total host DRAM. 3000 GB left zero
+        # headroom for worker RSS and triggered silent VllmWorker deaths
+        # during connector init (same shape as DSv4 D' earlier). 2200 GB is
+        # the proven-good value from D''.
+        TOTAL_CPU_DRAM_GB=2200
         export VLLM_USE_SIMPLE_KV_OFFLOAD=1
         OFFLOAD_ARGS="--kv_offloading_backend native --kv_offloading_size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager"
         ;;
diff --git a/benchmarks/single_node/agentic/kimik2.5_int4_h100.sh b/benchmarks/single_node/agentic/kimik2.5_int4_h100.sh
@@ -16,12 +16,13 @@ DURATION=${DURATION:-1800}
 MAX_DELAY=${MAX_DELAY:-60}
 ADVANCE_MIN=${ADVANCE_MIN:-0.0}
 ADVANCE_MAX=${ADVANCE_MAX:-0.7}
-# H100 80 GB HBM is too small for Kimi K2.5 INT4's native 1M-token context;
-# vLLM bails at engine init with "No available memory for the cache blocks".
-# Cap at 32K so the engine can reserve KV blocks. Long-context trace turns
-# beyond this cap are truncated by vLLM at request time.
+# H100 80 GB HBM is barely enough for Kimi K2.5 INT4 (~44 GB/GPU weights at
+# TP=8) plus KV reservation plus MoE intermediate buffers. R2 at MAX=32K hit
+# CUDA OOM in fused_marlin_moe even at conc=1. Drop to 16K so KV reservation
+# is half, and pair with --gpu-memory-utilization 0.85 (below) to leave room
+# for the ~900 MiB-per-call MoE workspace.
 if [ -z "${MAX_MODEL_LEN:-}" ] || [ "$MAX_MODEL_LEN" = "0" ]; then
-    MAX_MODEL_LEN=32768
+    MAX_MODEL_LEN=16384
 fi
 
 if [[ -n "${SLURM_JOB_ID:-}" ]]; then
@@ -60,7 +61,7 @@ export VLLM_USE_FLASHINFER_MOE_INT4=1
 vllm serve $MODEL \
 --host 0.0.0.0 \
 --port $PORT \
---gpu-memory-utilization 0.95 \
+--gpu-memory-utilization 0.85 \
 --tensor-parallel-size $TP \
 --max-model-len $MAX_MODEL_LEN \
 --max-num-seqs $CONC \