Commit 957d020
fix: Kimi B300 cpu-offload pool 2200GB + H100 OOM fix
R2 deterministic failures:
- B300 Kimi NVFP4 cpu offload: 9/9 terminated cpu jobs failed with silent
VllmWorker death ~4 min into engine init. Same shape as DSv4 D' — the
3000 GB offload reservation left zero headroom for vLLM worker RSS on
the 3.0 TiB host. Drop TOTAL_CPU_DRAM_GB 3000 -> 2200 (the proven-good
value from DSv4 D'').
- H100 Kimi INT4: CUDA OOM in fused_marlin_moe even at conc=1. 78.5 GB
out of 79.2 GB used; weights (~44 GB/GPU) + KV reservation at MAX=32K
+ MoE intermediate workspace can't all fit. Drop MAX_MODEL_LEN to 16K
AND --gpu-memory-utilization from 0.95 to 0.85 to leave room for the
~900 MiB-per-call MoE workspace.
R2_K_H100 was cancelled (deterministic OOM, 7/14 already failed).
R2_K_B300's offload=none jobs (10) are still in flight and being kept;
R3 will re-dispatch just the cpu subset with the lower pool size.
R2_K_H200 (mixed pattern) and R2_K_MI355X (no failures yet) left alone.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>1 parent 53df78a commit 957d020
2 files changed
Lines changed: 12 additions & 11 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | | - | |
21 | | - | |
22 | | - | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
23 | 24 | | |
24 | | - | |
| 25 | + | |
25 | 26 | | |
26 | 27 | | |
27 | 28 | | |
| |||
60 | 61 | | |
61 | 62 | | |
62 | 63 | | |
63 | | - | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
0 commit comments