Commit e7a4f94
fix(launcher): tolerate transient squeue timeouts in mi355x job poll
A full sweep floods slurmctld, so `squeue` intermittently returns
"slurm_load_jobs error: Socket timed out". The old liveness check
(`! squeue ... | grep -q $JOB_ID`) treated that empty/failed output as
"job died" and exit 1'd — a false failure on a healthy job (observed on
dsr1-fp8-mi355x-sglang-disagg conc 1024x2048).
Add job_alive(): a non-zero squeue exit is treated as "still alive" (don't
false-fail on a scheduler blip); only a SUCCESSFUL squeue that omits the
job — re-checked once to avoid a single-sample race — counts as gone. Used
by both the wait-for-log loop and the completion poll.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>1 parent 475ce8a commit e7a4f94
1 file changed
Lines changed: 22 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
90 | 90 | | |
91 | 91 | | |
92 | 92 | | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
93 | 109 | | |
94 | 110 | | |
95 | | - | |
96 | | - | |
97 | | - | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
98 | 114 | | |
99 | 115 | | |
100 | 116 | | |
101 | 117 | | |
102 | 118 | | |
103 | 119 | | |
104 | 120 | | |
105 | | - | |
| 121 | + | |
| 122 | + | |
106 | 123 | | |
107 | | - | |
| 124 | + | |
108 | 125 | | |
109 | 126 | | |
110 | 127 | | |
| |||
0 commit comments