Skip to content

Commit 4da367c

Browse files
runners(mi300x): pin salloc to known-good nodes (#1462)
Three of the nine mi300x compute nodes are currently unusable: - chi-mi300x-033, chi-mi300x-037: down (Not responding) - chi-mi300x-049: drained for persistent /nvme_home disk-full (kept down by a watchdog re-applying State=DOWN every 10s) Without a nodelist filter, salloc sometimes lands a job on a node that's about to be drained or that has a half-extracted enroot dir, causing 'pyxis: failed to create container filesystem (No space left on device)' / 'srun: Node failure' / 'manifest unknown'-style errors visible in PRs #1426 and #1403. Add an explicit --nodelist of the 6 healthy nodes (mirroring how runners/launch_b300-nv.sh:336 pins to the known-good B300 set).
1 parent 171d0c4 commit 4da367c

1 file changed

Lines changed: 4 additions & 1 deletion

File tree

runners/launch_mi300x-amds.sh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,10 @@ LOCK_FILE="${SQUASH_FILE}.lock"
99

1010
set -x
1111

12-
JOB_ID=$(salloc --partition=$PARTITION --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
12+
# Pin to the known-good mi300x nodes; others are unavailable:
13+
# chi-mi300x-033, chi-mi300x-037: down (Not responding)
14+
# chi-mi300x-049: drained (persistent /nvme_home disk-full)
15+
JOB_ID=$(salloc --partition=$PARTITION --nodelist=chi-mi300x-[034-036,054,057-058].ord.vultr.cpe.ice.amd.com --gres=gpu:$TP --cpus-per-task=256 --time=180 --no-shell --job-name="$RUNNER_NAME" 2>&1 | tee /dev/stderr | grep -oP 'Granted job allocation \K[0-9]+')
1316

1417
if [ -z "$JOB_ID" ]; then
1518
echo "ERROR: salloc failed to allocate a job"

0 commit comments

Comments
 (0)