Skip to content

Commit 0945009

Browse files
authored
Shell completion auto-install and pre-commit hook improvements (#1124)
* Add CI lint gate and local precheck command - Add lint-gate job to test.yml that runs fast checks (formatting, spelling, linting) before expensive test matrix and HPC jobs start - Add concurrency groups to test.yml, coverage.yml, cleanliness.yml, and bench.yml to cancel superseded runs on new pushes - Add ./mfc.sh precheck command for local CI validation before pushing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix precheck.sh portability and usability issues - Add cross-platform hash function (macOS uses md5, Linux uses md5sum) - Validate -j/--jobs argument (require value, must be numeric) - Improve error messages with actionable guidance - Clarify that formatting has been auto-applied when check fails Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Gate benchmarks on Test Suite completion Add wait-for-tests job that polls GitHub API to ensure: - Lint Gate passes first (fast fail) - All Github test jobs complete successfully - Only then do benchmark jobs start This prevents wasting HPC resources on benchmarking code that fails tests, while preserving the existing maintainer approval gate. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Auto-install git pre-commit hook for precheck - Add .githooks/pre-commit that runs ./mfc.sh precheck before commits - Auto-install hook on first ./mfc.sh invocation (symlinks to .git/hooks/) - Hook only installs once; subsequent runs skip if already present - Developers can bypass with: git commit --no-verify Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use dynamic CPU count in pre-commit hook Auto-detect available CPUs for parallel formatting: - Linux: nproc - macOS: sysctl -n hw.ncpu - Fallback: 4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Show CPU count in pre-commit hook output Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Add precheck command to CLI and autocomplete Register precheck in commands.py so it appears in: - Shell tab completion - CLI documentation - ./mfc.sh precheck --help Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Auto-update installed shell completions on regeneration When completion scripts are auto-regenerated, also update the installed completions at ~/.local/share/mfc/completions/ if they exist. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Show source command when completions auto-update When installed shell completions are auto-updated, print a message with the appropriate source command for the user's detected shell. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Always check installed completions on every run Previously, installed completions only updated when source files changed and regeneration occurred. Now we also check if the installed completions are older than the generated ones (e.g., after git pull brings new pre-generated completions). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Prevent directory completion fallback in shell completions - Remove -o bashdefault from bash complete command to prevent falling back to directory completion when no matches found - Add explicit : (no-op) for zsh commands without arguments to prevent default file/directory completion Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Auto-install completions and fix bash completion options - Auto-install completions on first mfc.sh run (via main.py) - Add -o filenames back to bash complete (needed for file completion) - Keep -o bashdefault removed to prevent directory fallback - Simplify code by having __update_installed_completions handle both install and update cases Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Auto-install completions from mfc.sh with shell rc setup Move completion auto-install to mfc.sh so it runs for ALL commands including help, precheck, etc. This ensures completions are always set up on first run. - Install completion files to ~/.local/share/mfc/completions/ - Add source line to .bashrc or fpath to .zshrc - Tell user to restart shell or source the file - main.py now only handles updates when generated files change Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Clarify verbose, debug, and debug-log flag documentation - -v/-vv/-vvv: output verbosity levels - --debug: build with debug compiler flags (for MFC Fortran code) - --debug-log/-d: Python toolchain debug logging (not MFC code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Cap pre-commit hook parallelism at 12 jobs Avoid hogging resources on machines with many cores. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix completion auto-install to check for files, not just directory Previously checked if ~/.local/share/mfc/completions/ existed. Now checks if the actual completion file exists for the user's shell. This handles the edge case of an empty completions directory. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Auto-update outdated shell completions Check if installed completions are older than source files and update them automatically. Shows message with source command only on install or update, silent otherwise. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Auto-activate completions when mfc.sh is sourced When running 'source ./mfc.sh' instead of './mfc.sh', completions are activated immediately in the current shell. This is useful for users who want tab completion without restarting their shell. - './mfc.sh' - installs/updates, shows source command to run - 'source ./mfc.sh' - installs/updates AND activates immediately Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Use workflow_run for benchmarks and extract completion logic - Replace polling-based wait-for-tests job with workflow_run trigger that fires when Test Suite completes (more efficient, no wasted runner minutes) - Extract shell completion setup from mfc.sh to dedicated toolchain/bootstrap/completions.sh script for better maintainability Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Suppress verbose package list from uv install by default Filter out individual package lines (+ pkg==1.0) from uv output while keeping progress info (Resolved, Built, Installed). Use -v flag to see full package list. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Compact splash screen from 45 to 20 lines Remove decorative boxes and condense layout while keeping all essential information: commands with aliases, descriptions, and quick start guide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Pass arguments to python.sh for verbose flag support Now ./mfc.sh init -v and similar commands respect verbosity flags during venv/package installation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Handle ./mfc.sh -v without command (show help, not error) When only flags are given without a command, show help screen instead of passing flags to main.py which would error. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix splash screen to use COMMANDS as single source of truth Use the CLI schema from commands.py instead of hardcoded descriptions for the compact splash screen. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Address AI reviewer feedback on PR #1124 1. Fix workflow_run.actor.login returning re-runner instead of PR author by fetching actual PR author from GitHub API 2. Quote all $@ in mfc.sh to handle arguments with spaces correctly 3. Add existence check for completion source files before copying (prevents errors on fresh clones before generation) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Show full build error output instead of truncating to 40 lines Truncation hides important context when diagnosing build failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix troubleshooting tips to suggest --debug instead of --debug-log --debug-log only enables Python toolchain logging, while --debug enables debug compiler flags which is actually useful for diagnosing build and run failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix flags-before-command detection and shell detection - Scan all args for a non-flag to detect if a command is present, so ./mfc.sh -v build works correctly instead of dropping args - Use ZSH_VERSION instead of $SHELL for shell detection (detects the running shell, not the login shell) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Raise minimum Python version to 3.12 (pyrometheus requires it) The pyrometheus dependency requires Python >= 3.12. The previous minimum of 3.11 would allow bootstrapping but fail at package installation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Lower minimum Python to 3.10 by pinning pyrometheus to pre-3.12 commit Pin pyrometheus to commit 49833404f (before it added a Python >= 3.12 requirement) so MFC can support Python 3.10+. Verified that bootstrap, build, and chemistry test cases all pass with Python 3.10. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add -v flag to all CI build/run/test/bench commands Enables verbose output in CI for easier debugging of failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add -v flag to coverage and GitHub runner CI commands Missed these in the previous commit — adds verbose output to the codecov build/test and the GitHub-hosted runner build/test steps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Revert pyrometheus pin to track git HEAD The Python 3.10 compatibility changes have been merged upstream, so we no longer need to pin to a specific commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove -v flag from HPC CI scripts to fix build stalls The -v flag causes verbose compiler output (-Minfo=inline -Minfo=accel) which fills pipe buffers during GPU compilation, causing the compiler to block on pipe_write and the build to hang indefinitely until SLURM kills the job at the time limit. This was the root cause of CI failures on Phoenix and Frontier GPU jobs that appeared as 3-hour timeouts but were actually build stalls. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix shell detection for zsh users sourcing under bash The completions.sh script runs under bash due to its shebang, so ZSH_VERSION is never set even for zsh users. Add fallback to check $SHELL environment variable which reflects the user's login shell. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Limit push-triggered CI to master branch only Previously, workflows triggered on both push and pull_request events, causing duplicate CI runs when pushing to PR branches. Now push events only trigger CI on master, while PRs are tested via pull_request event. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Fix shell detection to not use $SHELL fallback when in bash The previous fix could incorrectly install zsh completions when running mfc.sh in bash if the user's login shell ($SHELL) was zsh. Now we only fall back to $SHELL when BASH_VERSION is also unset. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Change Frontier SLURM account from CFD154 to ENG160 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Remove -v flag from GitHub/coverage CI and pin pyrometheus - Remove verbose flag from GitHub runner and coverage CI commands to prevent build hangs from pipe buffer exhaustion - Pin pyrometheus to known-good commit 4983340 to fix potential chemistry build failures on Frontier Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Harden CI against transient Frontier/Phoenix failures - Add retry logic (3 attempts) to Frontier and Frontier AMD build scripts to handle transient Lustre I/O errors during compilation - Replace fragile sbatch -W with monitor_slurm_job.sh in all submit scripts (Frontier, Frontier AMD, Phoenix) for resilient job monitoring with exponential backoff and sacct fallback - Add SIGHUP trap to submit scripts to survive login node session drops Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add SIGHUP protection to Frontier build scripts The build step runs directly on the login node (not through sbatch), so it also needs trap '' HUP to survive session drops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Restore -v flag to all CI build and test commands The verbose flag was removed based on a pipe buffer theory that turned out to be wrong — the real issue was SIGHUP from login node session drops, which is now fixed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove -v from GitHub-hosted CI, keep for self-hosted runners Verbose output stalls GitHub-hosted runners (pipe buffer issue) but is safe on Frontier/Phoenix where output goes to SLURM files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix pipe deadlock in -v build mode and restore -v everywhere The streaming build mode (-v) was reading stdout and stderr sequentially, causing a classic subprocess deadlock when the compiler's stderr filled the 64KB pipe buffer. Fix by merging stderr into stdout with subprocess.STDOUT. Now safe to use -v on all CI runners (GitHub-hosted and self-hosted). * Pass --test-all to Build step so post_process builds during dry-run Previously HDF5/Silo/post_process were only built during the Test step, adding build time to test execution. Now the Build step includes --test-all for MPI jobs so all targets are pre-built. * Refactor test.yml: generic self-hosted steps via matrix.cluster - Replace lbl_comp with cluster field that maps to workflow directory names (phoenix, frontier, frontier_amd), making steps generic - Collapse 6 conditional Build/Test steps into 2 generic ones - Merge duplicate Archive Logs blocks into one - Rename OPT1/OPT2 to TEST_ALL/TEST_PCT for clarity
1 parent 7b35b59 commit 0945009

27 files changed

Lines changed: 510 additions & 275 deletions

.github/workflows/bench.yml

Lines changed: 50 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -1,87 +1,85 @@
11
name: 'Benchmark'
22

33
on:
4-
pull_request:
5-
pull_request_review:
6-
types: [submitted]
4+
# Trigger when Test Suite completes (no polling needed)
5+
workflow_run:
6+
workflows: ["Test Suite"]
7+
types: [completed]
78
workflow_dispatch:
89

910
concurrency:
10-
group: ${{ github.workflow }}-${{ github.ref }}
11+
group: ${{ github.workflow }}-${{ github.event.workflow_run.head_branch || github.ref }}
1112
cancel-in-progress: true
1213

1314
jobs:
1415
file-changes:
1516
name: Detect File Changes
17+
# Only run if Test Suite passed (or manual dispatch)
18+
if: github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success'
1619
runs-on: 'ubuntu-latest'
1720
outputs:
1821
checkall: ${{ steps.changes.outputs.checkall }}
22+
pr_number: ${{ steps.pr-info.outputs.pr_number }}
23+
pr_approved: ${{ steps.pr-info.outputs.approved }}
24+
pr_author: ${{ steps.pr-info.outputs.author }}
1925
steps:
2026
- name: Clone
2127
uses: actions/checkout@v4
28+
with:
29+
ref: ${{ github.event.workflow_run.head_sha || github.sha }}
2230

2331
- name: Detect Changes
2432
uses: dorny/paths-filter@v3
2533
id: changes
2634
with:
2735
filters: ".github/file-filter.yml"
2836

29-
wait-for-tests:
30-
name: Wait for Test Suite
31-
runs-on: ubuntu-latest
32-
steps:
33-
- name: Wait for Test Suite to Pass
37+
- name: Get PR Info
38+
id: pr-info
3439
env:
3540
GH_TOKEN: ${{ github.token }}
3641
run: |
37-
echo "Waiting for Test Suite workflow to complete..."
38-
SHA="${{ github.event.pull_request.head.sha || github.sha }}"
39-
40-
# Poll every 60 seconds for up to 3 hours
41-
for i in $(seq 1 180); do
42-
# Get the Test Suite workflow runs for this commit
43-
STATUS=$(gh api repos/${{ github.repository }}/commits/$SHA/check-runs \
44-
--jq '.check_runs[] | select(.name == "Lint Gate") | .conclusion' | head -1)
45-
46-
if [ "$STATUS" = "success" ]; then
47-
echo "Lint Gate passed. Checking test jobs..."
48-
49-
# Check if any Github test jobs failed
50-
FAILED=$(gh api repos/${{ github.repository }}/commits/$SHA/check-runs \
51-
--jq '[.check_runs[] | select(.name | startswith("Github")) | select(.conclusion == "failure")] | length')
52-
53-
if [ "$FAILED" != "0" ]; then
54-
echo "::error::Test Suite has failing jobs. Benchmarks will not run."
55-
exit 1
42+
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
43+
echo "pr_number=" >> $GITHUB_OUTPUT
44+
echo "approved=true" >> $GITHUB_OUTPUT
45+
echo "author=${{ github.actor }}" >> $GITHUB_OUTPUT
46+
else
47+
# Get PR number from workflow_run
48+
PR_NUMBER="${{ github.event.workflow_run.pull_requests[0].number }}"
49+
if [ -n "$PR_NUMBER" ]; then
50+
echo "pr_number=$PR_NUMBER" >> $GITHUB_OUTPUT
51+
52+
# Fetch actual PR author from API (workflow_run.actor is the re-runner, not PR author)
53+
PR_AUTHOR=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER --jq '.user.login')
54+
echo "author=$PR_AUTHOR" >> $GITHUB_OUTPUT
55+
56+
# Check if PR is approved
57+
APPROVED=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/reviews \
58+
--jq '[.[] | select(.state == "APPROVED")] | length')
59+
if [ "$APPROVED" -gt 0 ]; then
60+
echo "approved=true" >> $GITHUB_OUTPUT
61+
else
62+
echo "approved=false" >> $GITHUB_OUTPUT
5663
fi
57-
58-
# Check if Github tests are still running
59-
PENDING=$(gh api repos/${{ github.repository }}/commits/$SHA/check-runs \
60-
--jq '[.check_runs[] | select(.name | startswith("Github")) | select(.conclusion == null)] | length')
61-
62-
if [ "$PENDING" = "0" ]; then
63-
echo "All Test Suite jobs completed successfully!"
64-
exit 0
65-
fi
66-
67-
echo "Tests still running ($PENDING pending)..."
68-
elif [ "$STATUS" = "failure" ]; then
69-
echo "::error::Lint Gate failed. Benchmarks will not run."
70-
exit 1
7164
else
72-
echo "Lint Gate status: ${STATUS:-pending}..."
65+
echo "pr_number=" >> $GITHUB_OUTPUT
66+
echo "approved=false" >> $GITHUB_OUTPUT
67+
echo "author=" >> $GITHUB_OUTPUT
7368
fi
74-
75-
sleep 60
76-
done
77-
78-
echo "::error::Timeout waiting for Test Suite to complete."
79-
exit 1
69+
fi
8070
8171
self:
8272
name: "${{ matrix.name }} (${{ matrix.device }}${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }})"
83-
if: ${{ github.repository=='MFlowCode/MFC' && needs.file-changes.outputs.checkall=='true' && ((github.event_name=='pull_request_review' && github.event.review.state=='approved') || (github.event_name=='pull_request' && (github.event.pull_request.user.login=='sbryngelson' || github.event.pull_request.user.login=='wilfonba'))) }}
84-
needs: [file-changes, wait-for-tests]
73+
if: >
74+
github.repository == 'MFlowCode/MFC' &&
75+
needs.file-changes.outputs.checkall == 'true' &&
76+
(
77+
github.event_name == 'workflow_dispatch' ||
78+
needs.file-changes.outputs.pr_approved == 'true' ||
79+
needs.file-changes.outputs.pr_author == 'sbryngelson' ||
80+
needs.file-changes.outputs.pr_author == 'wilfonba'
81+
)
82+
needs: [file-changes]
8583
strategy:
8684
fail-fast: false
8785
matrix:
@@ -145,6 +143,7 @@ jobs:
145143
- name: Clone - PR
146144
uses: actions/checkout@v4
147145
with:
146+
ref: ${{ github.event.workflow_run.head_sha || github.sha }}
148147
path: pr
149148

150149
- name: Clone - Master

.github/workflows/cleanliness.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
name: Cleanliness
22

3-
on: [push, pull_request, workflow_dispatch]
3+
on:
4+
push:
5+
branches: [master]
6+
pull_request:
7+
workflow_dispatch:
48

59
concurrency:
610
group: ${{ github.workflow }}-${{ github.ref }}

.github/workflows/coverage.yml

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
name: Coverage Check
22

3-
on: [push, pull_request, workflow_dispatch]
3+
on:
4+
push:
5+
branches: [master]
6+
pull_request:
7+
workflow_dispatch:
48

59
concurrency:
610
group: ${{ github.workflow }}-${{ github.ref }}
@@ -39,10 +43,10 @@ jobs:
3943
libfftw3-dev libhdf5-dev libblas-dev liblapack-dev
4044
4145
- name: Build
42-
run: /bin/bash mfc.sh build -j $(nproc) --gcov
46+
run: /bin/bash mfc.sh build -v -j $(nproc) --gcov
4347

4448
- name: Test
45-
run: /bin/bash mfc.sh test -a -j $(nproc)
49+
run: /bin/bash mfc.sh test -v -a -j $(nproc)
4650

4751
- name: Upload coverage reports to Codecov
4852
uses: codecov/codecov-action@v4

.github/workflows/docs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ on:
55
- cron: '0 0 * * *' # This runs every day at midnight UTC
66
workflow_dispatch:
77
push:
8+
branches: [master]
89
pull_request:
910

1011
jobs:

.github/workflows/formatting.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
name: Pretty
22

3-
on: [push, pull_request, workflow_dispatch]
3+
on:
4+
push:
5+
branches: [master]
6+
pull_request:
7+
workflow_dispatch:
48

59
jobs:
610
docs:
Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
#!/bin/bash
22

3+
# Ignore SIGHUP to survive login node session drops
4+
trap '' HUP
5+
36
job_device=$1
47
job_interface=$2
58
run_bench=$3
@@ -15,12 +18,39 @@ fi
1518

1619
. ./mfc.sh load -c f -m g
1720

18-
if [ "$run_bench" == "bench" ]; then
19-
for dir in benchmarks/*/; do
20-
dirname=$(basename "$dir")
21-
./mfc.sh run "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts
22-
done
23-
else
24-
./mfc.sh test -a --dry-run --rdma-mpi -j 8 $build_opts
25-
fi
21+
max_attempts=3
22+
attempt=1
23+
while [ $attempt -le $max_attempts ]; do
24+
echo "Build attempt $attempt of $max_attempts..."
25+
if [ "$run_bench" == "bench" ]; then
26+
build_cmd_ok=true
27+
for dir in benchmarks/*/; do
28+
dirname=$(basename "$dir")
29+
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
30+
build_cmd_ok=false
31+
break
32+
fi
33+
done
34+
else
35+
if ./mfc.sh test -v -a --dry-run --rdma-mpi -j 8 $build_opts; then
36+
build_cmd_ok=true
37+
else
38+
build_cmd_ok=false
39+
fi
40+
fi
41+
42+
if [ "$build_cmd_ok" = true ]; then
43+
echo "Build succeeded on attempt $attempt."
44+
exit 0
45+
fi
46+
47+
if [ $attempt -lt $max_attempts ]; then
48+
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
49+
./mfc.sh clean
50+
sleep 30
51+
fi
52+
attempt=$((attempt + 1))
53+
done
2654

55+
echo "Build failed after $max_attempts attempts."
56+
exit 1

.github/workflows/frontier/submit-bench.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3"
2929
sbatch <<EOT
3030
#!/bin/bash
3131
#SBATCH -JMFC-$job_slug # Job name
32-
#SBATCH -A CFD154 # charge account
32+
#SBATCH -A ENG160 # charge account
3333
#SBATCH -N 1 # Number of nodes required
3434
$sbatch_device_opts
3535
#SBATCH -t 05:59:00 # Duration of the job (Ex: 15 mins)

.github/workflows/frontier/submit.sh

Lines changed: 20 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
set -e
44

5+
# Ignore SIGHUP to survive login node session drops
6+
trap '' HUP
7+
58
usage() {
69
echo "Usage: $0 [script.sh] [cpu|gpu]"
710
}
@@ -26,17 +29,17 @@ fi
2629

2730

2831
job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3"
32+
output_file="$job_slug.out"
2933

30-
sbatch <<EOT
34+
submit_output=$(sbatch <<EOT
3135
#!/bin/bash
3236
#SBATCH -J MFC-$job_slug # Job name
33-
#SBATCH -A CFD154 # charge account
37+
#SBATCH -A ENG160 # charge account
3438
#SBATCH -N 1 # Number of nodes required
3539
$sbatch_device_opts
3640
#SBATCH -t 05:59:00 # Duration of the job (Ex: 15 mins)
37-
#SBATCH -o$job_slug.out # Combined output and error messages file
41+
#SBATCH -o$output_file # Combined output and error messages file
3842
#SBATCH -p extended # Extended partition for shorter queues
39-
#SBATCH -W # Do not exit until the submitted job terminates.
4043
4144
set -e
4245
set -x
@@ -53,4 +56,17 @@ job_interface="$3"
5356
$sbatch_script_contents
5457
5558
EOT
59+
)
60+
61+
job_id=$(echo "$submit_output" | grep -oE '[0-9]+')
62+
if [ -z "$job_id" ]; then
63+
echo "ERROR: Failed to submit job. sbatch output:"
64+
echo "$submit_output"
65+
exit 1
66+
fi
67+
68+
echo "Submitted batch job $job_id"
5669

70+
# Use resilient monitoring instead of sbatch -W
71+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
72+
bash "$SCRIPT_DIR/../../scripts/monitor_slurm_job.sh" "$job_id" "$output_file"

.github/workflows/frontier/test.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ if [ "$job_device" = "gpu" ]; then
1414
fi
1515

1616
if [ "$job_device" = "gpu" ]; then
17-
./mfc.sh test -a --rdma-mpi --max-attempts 3 -j $ngpus $device_opts -- -c frontier
17+
./mfc.sh test -v -a --rdma-mpi --max-attempts 3 -j $ngpus $device_opts -- -c frontier
1818
else
19-
./mfc.sh test -a --max-attempts 3 -j 32 --no-gpu -- -c frontier
19+
./mfc.sh test -v -a --max-attempts 3 -j 32 --no-gpu -- -c frontier
2020
fi
Lines changed: 38 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,8 @@
11
#!/bin/bash
22

3+
# Ignore SIGHUP to survive login node session drops
4+
trap '' HUP
5+
36
job_device=$1
47
job_interface=$2
58
run_bench=$3
@@ -15,12 +18,39 @@ fi
1518

1619
. ./mfc.sh load -c famd -m g
1720

18-
if [ "$run_bench" == "bench" ]; then
19-
for dir in benchmarks/*/; do
20-
dirname=$(basename "$dir")
21-
./mfc.sh run "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts
22-
done
23-
else
24-
./mfc.sh test -a --dry-run -j 8 $build_opts
25-
fi
21+
max_attempts=3
22+
attempt=1
23+
while [ $attempt -le $max_attempts ]; do
24+
echo "Build attempt $attempt of $max_attempts..."
25+
if [ "$run_bench" == "bench" ]; then
26+
build_cmd_ok=true
27+
for dir in benchmarks/*/; do
28+
dirname=$(basename "$dir")
29+
if ! ./mfc.sh run -v "$dir/case.py" --case-optimization -j 8 --dry-run $build_opts; then
30+
build_cmd_ok=false
31+
break
32+
fi
33+
done
34+
else
35+
if ./mfc.sh test -v -a --dry-run -j 8 $build_opts; then
36+
build_cmd_ok=true
37+
else
38+
build_cmd_ok=false
39+
fi
40+
fi
41+
42+
if [ "$build_cmd_ok" = true ]; then
43+
echo "Build succeeded on attempt $attempt."
44+
exit 0
45+
fi
46+
47+
if [ $attempt -lt $max_attempts ]; then
48+
echo "Build failed on attempt $attempt. Cleaning and retrying in 30s..."
49+
./mfc.sh clean
50+
sleep 30
51+
fi
52+
attempt=$((attempt + 1))
53+
done
2654

55+
echo "Build failed after $max_attempts attempts."
56+
exit 1

0 commit comments

Comments
 (0)