-
Notifications
You must be signed in to change notification settings - Fork 153
Fix self-hosted CI robustness: build cache, SLURM QOS, and submit resilience #1295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 11 commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
9163a11
Fix Frontier benchmark SLURM: use batch+1:59+normal QOS
ffe80ec
Fix bench.yml: restore timeout-minutes to 480 (revert accidental 240)
cfbc023
Remove persistent build cache for self-hosted test runners
sbryngelson 5742030
Remove build cache from benchmark jobs on Phoenix and Frontier
sbryngelson 7edb7c3
Fix submit.sh to survive monitor SIGKILL by re-checking SLURM state
sbryngelson 773f5ad
Extract monitor SIGKILL recovery into shared run_monitored_slurm_job.sh
sbryngelson 1311cbe
Reduce benchmark steps and switch Frontier bench to batch/normal QOS
sbryngelson 644c9e4
Cap bench script parallelism at 64 to fix GNR node failures
sbryngelson a02f4b2
Disable AVX-512 FP16 to fix build on Granite Rapids nodes
sbryngelson ba91673
Fix Rich MarkupError crash when build output contains bracket paths
sbryngelson 438627e
Merge branch 'master' into fix/ci-robustness
sbryngelson 3e773ff
Address bot review comments: sacct -X flag, dead job_type var, stale …
fae2e6a
Fix bench: use PR's submit.sh for master job to get SIGKILL recovery
sbryngelson 3224931
Fix submit_and_monitor_bench.sh: define SCRIPT_DIR before use
sbryngelson 2887def
bench: update Phoenix tmpbuild path to project storage
sbryngelson 1e4f984
Fix bench timeout (240→480) and monitor scancel defeating sacct recovery
sbryngelson 5886f2a
Fix sacct empty-output edge case in run_monitored_slurm_job.sh
sbryngelson 0551dea
bench: dynamic Phoenix GPU partition, per-case logs, downgrade grind …
sbryngelson 16e0f76
bench: address code review findings in GPU partition selection
sbryngelson b396a1c
ci: add gpu-h200 partition to Phoenix test and case-optimization GPU …
sbryngelson 7e5cabe
ci: scancel orphaned SLURM jobs when GitHub Actions cancels the runner
sbryngelson cf4f2a6
Fix Phoenix CPU test: restore build cache to isolate concurrent jobs
sbryngelson 7abbce7
Revert "Fix Phoenix CPU test: restore build cache to isolate concurre…
sbryngelson df23011
Fix Phoenix test: pass explicit GPU flag to test command
sbryngelson 8f586ae
ci: remove self-hosted runner build cache
sbryngelson 24f25f3
ci: nuke entire build dir on attempt 3 of retry_build
sbryngelson 0104233
ci: reduce to 2 attempts, nuke build dir on retry
sbryngelson ffb43f7
ci: revert case-opt to clean: false to preserve SLURM build cache
sbryngelson fb6101d
ci: treat PREEMPTED as non-terminal so --requeue jobs keep being moni…
sbryngelson 68592d7
ci: clean build dir before case-opt pre-build; drop retry
sbryngelson 0775fde
ci: remove dead RETRY_CLEAN_CMD from bench.sh
sbryngelson aa21620
ci: allow Frontier jobs to fail without blocking workflow
sbryngelson 18311b8
ci: fix shellcheck SC2162 - use read -r in while loops
sbryngelson f572dcf
bench: prefer rtx6000/l40s/v100 over h200/h100/a100 for GPU partition
sbryngelson 8f298d1
ci: decouple SLURM submit from monitor for Phoenix jobs (Option 2)
sbryngelson 0819b0e
Merge upstream/master: CCE 19.0.0 workaround, cache/build improvements
sbryngelson 38df383
ci: fix --precision flag and remove Python 3.14 step in github job
sbryngelson 07c4ab0
ci: fix fallback partition message, remove dead RETRY_CLEAN_CMD, fix …
sbryngelson 1c81fc0
ci: submit-job.sh always submits fresh, cancels any stale SLURM job f…
sbryngelson 0a39803
ci: fix heredoc pwd expansion, backtick substitution, combine bench l…
sbryngelson e686654
ci: remove redundant slurm_job_id write, improve bench log output
sbryngelson b97320b
ci: add explanatory comments, fix backtick in submit.sh
sbryngelson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| #!/bin/bash | ||
| # Run monitor_slurm_job.sh and recover if the monitor is killed (e.g. SIGKILL | ||
| # from the runner OS) before the SLURM job completes. When the monitor exits | ||
| # non-zero, sacct is used to verify the job's actual final state; if the SLURM | ||
| # job succeeded we exit 0 so the CI step is not falsely marked as failed. | ||
| # | ||
| # Usage: run_monitored_slurm_job.sh <job_id> <output_file> | ||
|
|
||
| set -euo pipefail | ||
|
|
||
| if [ $# -ne 2 ]; then | ||
| echo "Usage: $0 <job_id> <output_file>" | ||
| exit 1 | ||
| fi | ||
|
|
||
| job_id="$1" | ||
| output_file="$2" | ||
|
|
||
| SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" | ||
|
|
||
| monitor_exit=0 | ||
| bash "$SCRIPT_DIR/monitor_slurm_job.sh" "$job_id" "$output_file" || monitor_exit=$? | ||
|
|
||
| if [ "$monitor_exit" -ne 0 ]; then | ||
| echo "Monitor exited with code $monitor_exit; re-checking SLURM job $job_id final state..." | ||
| # Give the SLURM epilog time to finalize if the job just finished | ||
| sleep 30 | ||
| final_state=$(sacct -j "$job_id" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1 | tr -d ' ' || echo "UNKNOWN") | ||
| final_exit=$(sacct -j "$job_id" --format=ExitCode --noheader --parsable2 2>/dev/null | head -n1 | tr -d ' ' || echo "") | ||
| echo "Final SLURM state=$final_state exit=$final_exit" | ||
| if [ "$final_state" = "COMPLETED" ] && [ "$final_exit" = "0:0" ]; then | ||
| echo "SLURM job $job_id completed successfully despite monitor failure — continuing." | ||
| else | ||
| echo "ERROR: SLURM job $job_id did not complete successfully (state=$final_state exit=$final_exit)" | ||
| exit 1 | ||
| fi | ||
| fi |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This comment was marked as outdated.
Sorry, something went wrong.
Uh oh!
There was an error while loading. Please reload this page.