Skip to content

Commit 30add15

Browse files
runners(mi325x): exclude broken enroot node chi-mi325x-pod1-121 (#1477)
Root-caused via the failed sweeps on #1467, #1468, #1469 (all three [Klaud Cold] vLLM v0.21 bumps on different mi325x recipes): every failure landed on chi-mi325x-pod1-121 with enroot-aufs2ovlfs: failed to set capabilities: Operation not permitted before the .sqsh import even completes; subsequent pyxis mount then fails with "No such file or directory". The same image works cleanly on every other up node (017/018/019/020/027) — confirmed not OOM and not a recipe issue. This matches the existing pattern for mi300x in #1462 (pin salloc away from chronically-bad nodes); for mi325x there's currently only the one node to exclude, so use --exclude rather than --nodelist so we don't have to maintain the allow-list as nodes come and go. pod1-121 has separately been drained on the controller with a watchdog (per KLAUD_DEBUG.md §5.6) so it stays out of the pool until ops fix the underlying setcap regression. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 891a72c commit 30add15

1 file changed

Lines changed: 5 additions & 1 deletion

File tree

runners/launch_mi325x-amds.sh

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,11 @@ LOCK_FILE="${SQUASH_FILE}.lock"
99

1010
set -x
1111

12-
JOB_ID=$(salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=480 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
12+
# Exclude known-broken mi325x nodes:
13+
# chi-mi325x-pod1-121: enroot-aufs2ovlfs setcap fails on this node's NFS-backed
14+
# squash dir; container image import never completes
15+
# (root-caused via #1467/#1468/#1469 sweep failures).
16+
JOB_ID=$(salloc --partition=$PARTITION --exclude=chi-mi325x-pod1-121.ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=480 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
1317

1418
if [ -z "$JOB_ID" ]; then
1519
echo "ERROR: salloc failed to allocate a job"

0 commit comments

Comments
 (0)