Skip to content

Commit 1e9669b

Browse files
cquil11claude
andcommitted
test: B300 Kimi - retry cpu sweep at 2500 GB pool
R2_K_B300 showed 10/10 cpu-offload jobs failing (silent VllmWorker death at ~4 min into connector init) at TOTAL_CPU_DRAM_GB=3000. Root cause: B300 slurm cgroup caps each job at AllocMem ~2.82 TiB and the 3 TB offload pool + vLLM worker RSS exceeded that. Drop TOTAL_CPU_DRAM_GB 3000 -> 2500 (leaves ~465 GB cgroup headroom for worker RSS + page cache; matches what worked on MI355X's similar 3 TiB host). Config temporarily trimmed to cpu-only for this retry; R2's `none` jobs were running cleanly at the same time and don't need re-running. Restore both none + cpu after cpu is validated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent c79bf86 commit 1e9669b

2 files changed

Lines changed: 9 additions & 6 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2638,10 +2638,12 @@ kimik2.5-fp4-b300-vllm:
26382638
- { tp: 4, ep: 1, conc-start: 4, conc-end: 64 }
26392639
# B300 has 288 GB HBM per GPU (vs B200's 192 GB) so the KV-cache cliff
26402640
# sits higher than B200's. Extend the conc sweep to probe up to 64.
2641+
# NOTE: temporarily cpu-only for the R3 retry (R2 had `none` jobs running
2642+
# cleanly at 3 TB offload but `cpu` failed; this run retries cpu at the
2643+
# corrected 2.5 TB pool size). Restore both lines once cpu is validated.
26412644
agentic-coding:
26422645
- duration: 1800
26432646
search-space:
2644-
- { tp: 8, ep: 1, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 40, 48, 56, 64] }
26452647
- { tp: 8, ep: 1, offloading: cpu, conc-list: [1, 2, 4, 8, 16, 32, 40, 48, 56, 64] }
26462648

26472649
dsr1-fp8-b200-sglang-mtp:

benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,12 @@ OFFLOAD_ARGS=""
3636
case "$OFFLOADING" in
3737
none) ;;
3838
cpu)
39-
# B300 NV nodes have ~3.0 TiB total host DRAM. 3000 GB left zero
40-
# headroom for worker RSS and triggered silent VllmWorker deaths
41-
# during connector init (same shape as DSv4 D' earlier). 2200 GB is
42-
# the proven-good value from D''.
43-
TOTAL_CPU_DRAM_GB=2200
39+
# B300 NV nodes: RealMemory ~3.02 TiB; Slurm AllocMem cgroup caps each
40+
# job at ~2.82 TiB (2965 GB). At 3 TB the offload pool + worker RSS
41+
# exceeded the cgroup and workers OOM-killed silently (R2 saw 10/10
42+
# cpu jobs die ~4 min into init). 2.5 TB leaves ~465 GB headroom
43+
# inside the cgroup for vLLM worker RSS + page cache.
44+
TOTAL_CPU_DRAM_GB=2500
4445
export VLLM_USE_SIMPLE_KV_OFFLOAD=1
4546
OFFLOAD_ARGS="--kv_offloading_backend native --kv_offloading_size $TOTAL_CPU_DRAM_GB --disable-hybrid-kv-cache-manager"
4647
;;

0 commit comments

Comments
 (0)