Commit 4da367c
authored
runners(mi300x): pin salloc to known-good nodes (#1462)
Three of the nine mi300x compute nodes are currently unusable:
- chi-mi300x-033, chi-mi300x-037: down (Not responding)
- chi-mi300x-049: drained for persistent /nvme_home disk-full
(kept down by a watchdog re-applying State=DOWN every 10s)
Without a nodelist filter, salloc sometimes lands a job on a node
that's about to be drained or that has a half-extracted enroot dir,
causing 'pyxis: failed to create container filesystem (No space left
on device)' / 'srun: Node failure' / 'manifest unknown'-style errors
visible in PRs #1426 and #1403.
Add an explicit --nodelist of the 6 healthy nodes (mirroring how
runners/launch_b300-nv.sh:336 pins to the known-good B300 set).1 parent 171d0c4 commit 4da367c
1 file changed
Lines changed: 4 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
13 | 16 | | |
14 | 17 | | |
15 | 18 | | |
| |||
0 commit comments