Commit 30add15
runners(mi325x): exclude broken enroot node chi-mi325x-pod1-121 (#1477)
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three
[Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every
failure landed on chi-mi325x-pod1-121 with
enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted
before the .sqsh import even completes; subsequent pyxis mount then
fails with "No such file or directory". The same image works cleanly
on every other up node (017/018/019/020/027) — confirmed not OOM and
not a recipe issue.
This matches the existing pattern for mi300x in #1462 (pin salloc away
from chronically-bad nodes); for mi325x there's currently only the one
node to exclude, so use --exclude rather than --nodelist so we don't
have to maintain the allow-list as nodes come and go.
pod1-121 has separately been drained on the controller with a watchdog
(per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix
the underlying setcap regression.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 891a72c commit 30add15
1 file changed
Lines changed: 5 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
9 | 9 | | |
10 | 10 | | |
11 | 11 | | |
12 | | - | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
13 | 17 | | |
14 | 18 | | |
15 | 19 | | |
| |||
0 commit comments