test: B300 Kimi - retry cpu sweep at 2500 GB pool

cquil11 · claude · cquil11 · commit 1e9669baba55 · 2026-05-14T10:42:38.000-05:00
R2_K_B300 showed 10/10 cpu-offload jobs failing (silent VllmWorker
death at ~4 min into connector init) at TOTAL_CPU_DRAM_GB=3000. Root
cause: B300 slurm cgroup caps each job at AllocMem ~2.82 TiB and the
3 TB offload pool + vLLM worker RSS exceeded that.

Drop TOTAL_CPU_DRAM_GB 3000 -&gt; 2500 (leaves ~465 GB cgroup headroom
for worker RSS + page cache; matches what worked on MI355X's similar
3 TiB host).

Config temporarily trimmed to cpu-only for this retry; R2's `none`
jobs were running cleanly at the same time and don't need re-running.
Restore both none + cpu after cpu is validated.

Co-Authored-By: Claude Opus 4.7 &lt;noreply@anthropic.com&gt;
diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml
@@ -2638,10 +2638,12 @@ kimik2.5-fp4-b300-vllm:
       - { tp: 4, ep: 1, conc-start: 4, conc-end: 64 }
     # B300 has 288 GB HBM per GPU (vs B200's 192 GB) so the KV-cache cliff
     # sits higher than B200's. Extend the conc sweep to probe up to 64.
+    # NOTE: temporarily cpu-only for the R3 retry (R2 had `none` jobs running
+    # cleanly at 3 TB offload but `cpu` failed; this run retries cpu at the
+    # corrected 2.5 TB pool size). Restore both lines once cpu is validated.
     agentic-coding:
     - duration: 1800
       search-space:
-      - { tp: 8, ep: 1, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 40, 48, 56, 64] }
       - { tp: 8, ep: 1, offloading: cpu,  conc-list: [1, 2, 4, 8, 16, 32, 40, 48, 56, 64] }
 
 dsr1-fp8-b200-sglang-mtp:
diff --git a/benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh b/benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh
@@ -36,11 +36,12 @@ OFFLOAD_ARGS=""
 case "$OFFLOADING" in
     none) ;;
     cpu)
-        # B300 NV nodes have ~3.0 TiB total host DRAM. 3000 GB left zero
-        # headroom for worker RSS and triggered silent VllmWorker deaths
-        # during connector init (same shape as DSv4 D' earlier). 2200 GB is
-        # the proven-good value from D''.
-        TOTAL_CPU_DRAM_GB=2200
+        # B300 NV nodes: RealMemory ~3.02 TiB; Slurm AllocMem cgroup caps each
+        # job at ~2.82 TiB (2965 GB). At 3 TB the offload pool + worker RSS
+        # exceeded the cgroup and workers OOM-killed silently (R2 saw 10/10
+        # cpu jobs die ~4 min into init). 2.5 TB leaves ~465 GB headroom
+        # inside the cgroup for vLLM worker RSS + page cache.
+        TOTAL_CPU_DRAM_GB=2500
         export VLLM_USE_SIMPLE_KV_OFFLOAD=1
         OFFLOAD_ARGS="--kv_offloading_backend native --kv_offloading_size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager"
         ;;