Commit 1e9669b
test: B300 Kimi - retry cpu sweep at 2500 GB pool
R2_K_B300 showed 10/10 cpu-offload jobs failing (silent VllmWorker
death at ~4 min into connector init) at TOTAL_CPU_DRAM_GB=3000. Root
cause: B300 slurm cgroup caps each job at AllocMem ~2.82 TiB and the
3 TB offload pool + vLLM worker RSS exceeded that.
Drop TOTAL_CPU_DRAM_GB 3000 -> 2500 (leaves ~465 GB cgroup headroom
for worker RSS + page cache; matches what worked on MI355X's similar
3 TiB host).
Config temporarily trimmed to cpu-only for this retry; R2's `none`
jobs were running cleanly at the same time and don't need re-running.
Restore both none + cpu after cpu is validated.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>1 parent c79bf86 commit 1e9669b
2 files changed
Lines changed: 9 additions & 6 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2638 | 2638 | | |
2639 | 2639 | | |
2640 | 2640 | | |
| 2641 | + | |
| 2642 | + | |
| 2643 | + | |
2641 | 2644 | | |
2642 | 2645 | | |
2643 | 2646 | | |
2644 | | - | |
2645 | 2647 | | |
2646 | 2648 | | |
2647 | 2649 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
44 | 45 | | |
45 | 46 | | |
46 | 47 | | |
| |||
0 commit comments