Skip to content

Commit a149886

Browse files
sbryngelsonclaude
andcommitted
Auto-requeue SLURM jobs on preemption
Add --requeue to Phoenix sbatch scripts so preempted embers-QOS jobs are automatically rescheduled. Remove PREEMPTED from the monitor's terminal state list so it keeps waiting through the requeue cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent a553a75 commit a149886

3 files changed

Lines changed: 3 additions & 1 deletion

File tree

.github/scripts/monitor_slurm_job.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ get_job_state() {
5858
# Check if a state is terminal (job is done, for better or worse)
5959
is_terminal_state() {
6060
case "$1" in
61-
COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|PREEMPTED|BOOT_FAIL|DEADLINE)
61+
COMPLETED|FAILED|CANCELLED|CANCELLED+|TIMEOUT|OUT_OF_MEMORY|NODE_FAIL|BOOT_FAIL|DEADLINE)
6262
return 0 ;;
6363
*)
6464
return 1 ;;

.github/workflows/phoenix/submit-bench.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@ sbatch <<EOT
4444
$sbatch_device_opts
4545
#SBATCH -t 04:00:00 # Duration of the job (Ex: 15 mins)
4646
#SBATCH -q embers # QOS Name
47+
#SBATCH --requeue # Auto-requeue on preemption
4748
#SBATCH -o$job_slug.out # Combined output and error messages file
4849
4950
set -e

.github/workflows/phoenix/submit.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ submit_output=$(sbatch <<EOT
4848
$sbatch_device_opts
4949
#SBATCH -t 03:00:00 # Duration of the job (Ex: 15 mins)
5050
#SBATCH -q embers # QOS Name
51+
#SBATCH --requeue # Auto-requeue on preemption
5152
#SBATCH -o$output_file # Combined output and error messages file
5253
5354
set -e

0 commit comments

Comments
 (0)