Skip to content

Commit b186af2

Browse files
committed
ci: split frontier_amd case-opt into pre-build + run to avoid 2h wall limit
AMD flang case-opt compilation takes close to the 2h hackathon wall limit, leaving no time for the run step. Split into two sequential hackathon GPU jobs: 1. Pre-Build: compiles all benchmarks via --dry-run (build only, no execution) 2. Run: skips build (binaries cached), runs and validates benchmarks Also preserve dependency dirs in prebuild for non-Phoenix clusters (deps are already built by the Fetch Dependencies step, so only clean staging dirs).
1 parent dbf6aec commit b186af2

3 files changed

Lines changed: 22 additions & 13 deletions

File tree

.github/scripts/prebuild-case-optimization.sh

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
#!/bin/bash
22

3-
# Pre-builds all benchmark cases with --case-optimization.
4-
# No GPU hardware needed — compilation only.
3+
# Pre-builds all benchmark cases with --case-optimization using --dry-run so
4+
# binaries are cached before the GPU run job. No simulation is executed.
55
# Can run in two modes:
66
# 1. Direct (Frontier login nodes): pass cluster/device/interface as args
7-
# 2. Inside SLURM (Phoenix): uses $job_device/$job_interface from submit-slurm-job.sh
7+
# 2. Inside SLURM (Phoenix/frontier_amd): uses $job_device/$job_interface
88
# Usage: bash prebuild-case-optimization.sh [<cluster> <device> <interface>]
99

1010
set -e
@@ -22,14 +22,18 @@ case "$cluster" in
2222
*) echo "ERROR: Unknown cluster '$cluster'"; exit 1 ;;
2323
esac
2424

25-
source .github/scripts/clean-build.sh
26-
clean_build
25+
# Phoenix starts fresh (no prior dep build); other clusters pre-build deps via
26+
# build.sh first, so we must preserve them and only clean MFC target staging.
27+
if [ "$cluster" = "phoenix" ]; then
28+
source .github/scripts/clean-build.sh
29+
clean_build
30+
else
31+
find build/staging -maxdepth 1 -regex '.*/[0-9a-f]+' -type d -exec rm -rf {} + 2>/dev/null || true
32+
find build/install -maxdepth 1 -regex '.*/[0-9a-f]+' -type d -exec rm -rf {} + 2>/dev/null || true
33+
fi
2734

2835
. ./mfc.sh load -c "$flag" -m g
2936

30-
# Set GPU build flags from interface — this is always a GPU build.
31-
# Don't use gpu-opts.sh since $job_device may be "cpu" when submitted
32-
# to a CPU SLURM partition (no GPU hardware needed for compilation).
3337
case "$job_interface" in
3438
acc) gpu_opts="--gpu acc" ;;
3539
omp) gpu_opts="--gpu mp" ;;
@@ -38,5 +42,5 @@ esac
3842

3943
for case in benchmarks/*/case.py; do
4044
echo "=== Pre-building: $case ==="
41-
./mfc.sh build -i "$case" --case-optimization $gpu_opts -j 8
45+
./mfc.sh run "$case" --case-optimization $gpu_opts -j 8 --dry-run
4246
done

.github/scripts/run_case_optimization.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,13 +23,14 @@ benchmarks=(
2323

2424
# For Frontier/Frontier AMD: deps were fetched on the login node via --deps-only;
2525
# build case-optimized binaries here on the compute node before running.
26-
# For Phoenix: prebuild-case-optimization.sh already built everything in a prior SLURM job.
26+
# For Phoenix and frontier_amd: prebuild-case-optimization.sh already built
27+
# everything in a prior SLURM job (via --dry-run), so skip the build here.
2728
#
2829
# Clean stale MFC target staging before building. On self-hosted CI runners,
2930
# corrupted intermediate files from a prior failed build (e.g. CCE optcg crash)
3031
# can persist and poison subsequent builds. Each case-opt config gets its own
3132
# hash-named staging dir, but install dirs and other artifacts may be stale.
32-
if [ "$job_cluster" != "phoenix" ]; then
33+
if [ "$job_cluster" != "phoenix" ] && [ "$job_cluster" != "frontier_amd" ]; then
3334
# Clean stale MFC target dirs (hash-named) from prior builds, but
3435
# preserve dependency dirs (hipfort, fftw, etc.) since the compute
3536
# node has no internet to re-fetch them.

.github/workflows/test.yml

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -659,12 +659,16 @@ jobs:
659659
if: matrix.cluster == 'phoenix'
660660
run: bash .github/scripts/submit-slurm-job.sh .github/scripts/prebuild-case-optimization.sh cpu ${{ matrix.interface }} ${{ matrix.cluster }}
661661

662+
- name: Pre-Build (SLURM)
663+
if: matrix.cluster == 'frontier_amd'
664+
run: bash .github/scripts/submit-slurm-job.sh .github/scripts/prebuild-case-optimization.sh gpu ${{ matrix.interface }} ${{ matrix.cluster }}
665+
662666
- name: Build & Run Case-Optimization Tests
663-
if: matrix.cluster != 'phoenix'
667+
if: matrix.cluster != 'phoenix' && matrix.cluster != 'frontier_amd'
664668
run: bash .github/scripts/submit-slurm-job.sh .github/scripts/run_case_optimization.sh ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}
665669

666670
- name: Run Case-Optimization Tests
667-
if: matrix.cluster == 'phoenix'
671+
if: matrix.cluster == 'phoenix' || matrix.cluster == 'frontier_amd'
668672
run: bash .github/scripts/submit-slurm-job.sh .github/scripts/run_case_optimization.sh ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.cluster }}
669673

670674
- name: Cancel SLURM Jobs

0 commit comments

Comments
 (0)