Skip to content

Commit e7a4f94

Browse files
aryguptclaude
andcommitted
fix(launcher): tolerate transient squeue timeouts in mi355x job poll
A full sweep floods slurmctld, so `squeue` intermittently returns "slurm_load_jobs error: Socket timed out". The old liveness check (`! squeue ... | grep -q $JOB_ID`) treated that empty/failed output as "job died" and exit 1'd — a false failure on a healthy job (observed on dsr1-fp8-mi355x-sglang-disagg conc 1024x2048). Add job_alive(): a non-zero squeue exit is treated as "still alive" (don't false-fail on a scheduler blip); only a SUCCESSFUL squeue that omits the job — re-checked once to avoid a single-sample race — counts as gone. Used by both the wait-for-log loop and the completion poll. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 475ce8a commit e7a4f94

1 file changed

Lines changed: 22 additions & 5 deletions

File tree

runners/launch_mi355x-amds.sh

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -90,21 +90,38 @@ if [[ "$IS_MULTINODE" == "true" ]]; then
9090
# Give slurm time to start the job and create log file
9191
sleep 10
9292

93+
# Whether $JOB_ID is still in the SLURM queue, resilient to transient
94+
# slurmctld timeouts ("slurm_load_jobs error: Socket timed out") — common
95+
# when a full sweep floods the controller. A FAILED squeue (non-zero exit)
96+
# is treated as "still alive" so a scheduler blip can't be misread as job
97+
# death; only a SUCCESSFUL squeue that omits the job means it's gone, and we
98+
# re-check once before declaring it gone to avoid a single-sample race.
99+
job_alive() {
100+
local out rc
101+
out=$(squeue -u "$USER" --noheader --format='%i' 2>/dev/null); rc=$?
102+
[[ $rc -ne 0 ]] && return 0 # scheduler hiccup → assume alive
103+
grep -qw "$JOB_ID" <<<"$out" && return 0
104+
sleep 5
105+
out=$(squeue -u "$USER" --noheader --format='%i' 2>/dev/null) || return 0
106+
grep -qw "$JOB_ID" <<<"$out"
107+
}
108+
93109
# Wait for log file to appear (also check job is still alive)
94110
while ! ls "$LOG_FILE" &>/dev/null; do
95-
if ! squeue -u "$USER" --noheader --format='%i' | grep -q "$JOB_ID"; then
96-
echo "ERROR: Job $JOB_ID failed before creating log file"
97-
scontrol show job "$JOB_ID"
111+
if ! job_alive; then
112+
echo "ERROR: Job $JOB_ID is no longer in the queue and never created a log file"
113+
scontrol show job "$JOB_ID" 2>/dev/null || true
98114
exit 1
99115
fi
100116
sleep 5
101117
done
102118

103119
set +x
104120

105-
# Poll for job completion in background
121+
# Poll for job completion in background (tolerant of transient squeue
122+
# timeouts via job_alive — a scheduler blip must not look like completion).
106123
(
107-
while squeue -u $USER --noheader --format='%i' | grep -q "$JOB_ID"; do
124+
while job_alive; do
108125
sleep 10
109126
done
110127
) &

0 commit comments

Comments
 (0)