Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
9163a11
Fix Frontier benchmark SLURM: use batch+1:59+normal QOS
Mar 6, 2026
ffe80ec
Fix bench.yml: restore timeout-minutes to 480 (revert accidental 240)
Mar 6, 2026
cfbc023
Remove persistent build cache for self-hosted test runners
sbryngelson Mar 6, 2026
5742030
Remove build cache from benchmark jobs on Phoenix and Frontier
sbryngelson Mar 6, 2026
7edb7c3
Fix submit.sh to survive monitor SIGKILL by re-checking SLURM state
sbryngelson Mar 6, 2026
773f5ad
Extract monitor SIGKILL recovery into shared run_monitored_slurm_job.sh
sbryngelson Mar 6, 2026
1311cbe
Reduce benchmark steps and switch Frontier bench to batch/normal QOS
sbryngelson Mar 5, 2026
644c9e4
Cap bench script parallelism at 64 to fix GNR node failures
sbryngelson Mar 3, 2026
a02f4b2
Disable AVX-512 FP16 to fix build on Granite Rapids nodes
sbryngelson Mar 3, 2026
ba91673
Fix Rich MarkupError crash when build output contains bracket paths
sbryngelson Mar 2, 2026
438627e
Merge branch 'master' into fix/ci-robustness
sbryngelson Mar 6, 2026
3e773ff
Address bot review comments: sacct -X flag, dead job_type var, stale …
Mar 6, 2026
fae2e6a
Fix bench: use PR's submit.sh for master job to get SIGKILL recovery
sbryngelson Mar 6, 2026
3224931
Fix submit_and_monitor_bench.sh: define SCRIPT_DIR before use
sbryngelson Mar 6, 2026
2887def
bench: update Phoenix tmpbuild path to project storage
sbryngelson Mar 7, 2026
1e4f984
Fix bench timeout (240→480) and monitor scancel defeating sacct recovery
sbryngelson Mar 7, 2026
5886f2a
Fix sacct empty-output edge case in run_monitored_slurm_job.sh
sbryngelson Mar 7, 2026
0551dea
bench: dynamic Phoenix GPU partition, per-case logs, downgrade grind …
sbryngelson Mar 8, 2026
16e0f76
bench: address code review findings in GPU partition selection
sbryngelson Mar 8, 2026
b396a1c
ci: add gpu-h200 partition to Phoenix test and case-optimization GPU …
sbryngelson Mar 8, 2026
7e5cabe
ci: scancel orphaned SLURM jobs when GitHub Actions cancels the runner
sbryngelson Mar 8, 2026
cf4f2a6
Fix Phoenix CPU test: restore build cache to isolate concurrent jobs
sbryngelson Mar 8, 2026
7abbce7
Revert "Fix Phoenix CPU test: restore build cache to isolate concurre…
sbryngelson Mar 8, 2026
df23011
Fix Phoenix test: pass explicit GPU flag to test command
sbryngelson Mar 8, 2026
8f586ae
ci: remove self-hosted runner build cache
sbryngelson Mar 8, 2026
24f25f3
ci: nuke entire build dir on attempt 3 of retry_build
sbryngelson Mar 8, 2026
0104233
ci: reduce to 2 attempts, nuke build dir on retry
sbryngelson Mar 8, 2026
ffb43f7
ci: revert case-opt to clean: false to preserve SLURM build cache
sbryngelson Mar 8, 2026
fb6101d
ci: treat PREEMPTED as non-terminal so --requeue jobs keep being moni…
sbryngelson Mar 8, 2026
68592d7
ci: clean build dir before case-opt pre-build; drop retry
sbryngelson Mar 8, 2026
0775fde
ci: remove dead RETRY_CLEAN_CMD from bench.sh
sbryngelson Mar 8, 2026
aa21620
ci: allow Frontier jobs to fail without blocking workflow
sbryngelson Mar 8, 2026
18311b8
ci: fix shellcheck SC2162 - use read -r in while loops
sbryngelson Mar 8, 2026
f572dcf
bench: prefer rtx6000/l40s/v100 over h200/h100/a100 for GPU partition
sbryngelson Mar 9, 2026
8f298d1
ci: decouple SLURM submit from monitor for Phoenix jobs (Option 2)
sbryngelson Mar 9, 2026
0819b0e
Merge upstream/master: CCE 19.0.0 workaround, cache/build improvements
sbryngelson Mar 9, 2026
38df383
ci: fix --precision flag and remove Python 3.14 step in github job
sbryngelson Mar 9, 2026
07c4ab0
ci: fix fallback partition message, remove dead RETRY_CLEAN_CMD, fix …
sbryngelson Mar 9, 2026
1c81fc0
ci: submit-job.sh always submits fresh, cancels any stale SLURM job f…
sbryngelson Mar 9, 2026
0a39803
ci: fix heredoc pwd expansion, backtick substitution, combine bench l…
sbryngelson Mar 9, 2026
e686654
ci: remove redundant slurm_job_id write, improve bench log output
sbryngelson Mar 9, 2026
b97320b
ci: add explanatory comments, fix backtick in submit.sh
sbryngelson Mar 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .github/scripts/run_monitored_slurm_job.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
#!/bin/bash
# Run monitor_slurm_job.sh and recover if the monitor is killed (e.g. SIGKILL
# from the runner OS) before the SLURM job completes. When the monitor exits
# non-zero, sacct is used to verify the job's actual final state; if the SLURM
# job succeeded we exit 0 so the CI step is not falsely marked as failed.
#
# Usage: run_monitored_slurm_job.sh <job_id> <output_file>

set -euo pipefail

if [ $# -ne 2 ]; then
echo "Usage: $0 <job_id> <output_file>"
exit 1
fi

job_id="$1"
output_file="$2"

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

monitor_exit=0
bash "$SCRIPT_DIR/monitor_slurm_job.sh" "$job_id" "$output_file" || monitor_exit=$?

if [ "$monitor_exit" -ne 0 ]; then
echo "Monitor exited with code $monitor_exit; re-checking SLURM job $job_id final state..."
# Give the SLURM epilog time to finalize if the job just finished
sleep 30
final_state=$(sacct -j "$job_id" -n -X -P -o State 2>/dev/null | head -n1 | cut -d'|' -f1 | tr -d ' ' || echo "UNKNOWN")
final_exit=$(sacct -j "$job_id" --format=ExitCode --noheader --parsable2 2>/dev/null | head -n1 | tr -d ' ' || echo "")
echo "Final SLURM state=$final_state exit=$final_exit"
if [ "$final_state" = "COMPLETED" ] && [ "$final_exit" = "0:0" ]; then
echo "SLURM job $job_id completed successfully despite monitor failure — continuing."
else
echo "ERROR: SLURM job $job_id did not complete successfully (state=$final_state exit=$final_exit)"
exit 1
fi
fi
2 changes: 1 addition & 1 deletion .github/workflows/bench.yml
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@ jobs:
runs-on:
group: ${{ matrix.group }}
labels: ${{ matrix.labels }}
timeout-minutes: 480
timeout-minutes: 240
steps:
- name: Clone - PR
uses: actions/checkout@v4
Expand Down
5 changes: 4 additions & 1 deletion .github/workflows/frontier/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,11 @@

source .github/scripts/bench-preamble.sh

# Cap parallel jobs at 64 to avoid overwhelming MPI daemons on large nodes.
n_jobs=$(( $(nproc) > 64 ? 64 : $(nproc) ))

if [ "$job_device" = "gpu" ]; then
./mfc.sh bench --mem 4 -j $n_ranks -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks
else
./mfc.sh bench --mem 1 -j $(nproc) -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks
./mfc.sh bench --mem 1 -j $n_jobs -o "$job_slug.yaml" -- -c $job_cluster $device_opts -n $n_ranks
fi
5 changes: 1 addition & 4 deletions .github/workflows/frontier/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,7 @@ build_opts="$gpu_opts"

. ./mfc.sh load -c $compiler_flag -m $([ "$job_device" = "gpu" ] && echo "g" || echo "c")

# Only set up build cache for test suite, not benchmarks
if [ "$run_bench" != "bench" ]; then
source .github/scripts/setup-build-cache.sh "$cluster_name" "$job_device" "$job_interface"
fi
rm -rf build

source .github/scripts/retry-build.sh
if [ "$run_bench" == "bench" ]; then
Expand Down
18 changes: 5 additions & 13 deletions .github/workflows/frontier/submit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -44,17 +44,10 @@ else
fi

# Select SBATCH params based on job type
if [ "$job_type" = "bench" ]; then
sbatch_account="#SBATCH -A ENG160"
sbatch_time="#SBATCH -t 05:59:00"
sbatch_partition="#SBATCH -p extended"
sbatch_extra=""
else
sbatch_account="#SBATCH -A CFD154"
sbatch_time="#SBATCH -t 01:59:00"
sbatch_partition="#SBATCH -p batch"
sbatch_extra="#SBATCH --qos=normal"
fi
sbatch_account="#SBATCH -A CFD154"
sbatch_time="#SBATCH -t 01:59:00"
sbatch_partition="#SBATCH -p batch"
sbatch_extra="#SBATCH --qos=normal"

shard_suffix=""
if [ -n "$4" ]; then
Expand Down Expand Up @@ -102,5 +95,4 @@ fi

echo "Submitted batch job $job_id"

# Use resilient monitoring instead of sbatch -W
bash "$SCRIPT_DIR/../../scripts/monitor_slurm_job.sh" "$job_id" "$output_file"
bash "$SCRIPT_DIR/../../scripts/run_monitored_slurm_job.sh" "$job_id" "$output_file"
10 changes: 8 additions & 2 deletions .github/workflows/phoenix/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

source .github/scripts/bench-preamble.sh

# Cap parallel jobs at 64 to avoid overwhelming MPI daemons on large nodes
# (GNR nodes have 192 cores but nproc is too aggressive for build/bench).
n_jobs=$(( $(nproc) > 64 ? 64 : $(nproc) ))

tmpbuild=/storage/scratch1/6/sbryngelson3/mytmp_build
currentdir=$tmpbuild/run-$(( RANDOM % 900 ))
mkdir -p $tmpbuild
Expand All @@ -15,10 +19,12 @@ else
bench_opts="--mem 1"
fi

rm -rf build

source .github/scripts/retry-build.sh
RETRY_CLEAN_CMD="./mfc.sh clean" retry_build ./mfc.sh build -j $(nproc) $build_opts || exit 1
RETRY_CLEAN_CMD="./mfc.sh clean" retry_build ./mfc.sh build -j $n_jobs $build_opts || exit 1

./mfc.sh bench $bench_opts -j $(nproc) -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks
./mfc.sh bench $bench_opts -j $n_jobs -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks

sleep 10
rm -rf "$currentdir" || true
Expand Down
3 changes: 1 addition & 2 deletions .github/workflows/phoenix/submit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,5 @@ fi

echo "Submitted batch job $job_id"

# Use resilient monitoring instead of sbatch -W
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
bash "$SCRIPT_DIR/../../scripts/monitor_slurm_job.sh" "$job_id" "$output_file"
bash "$SCRIPT_DIR/../../scripts/run_monitored_slurm_job.sh" "$job_id" "$output_file"
3 changes: 1 addition & 2 deletions .github/workflows/phoenix/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@
source .github/scripts/gpu-opts.sh
build_opts="$gpu_opts"

# Set up persistent build cache
source .github/scripts/setup-build-cache.sh phoenix "$job_device" "$job_interface"
rm -rf build

# Build with retry; smoke-test cached binaries to catch architecture mismatches
# (SIGILL from binaries compiled on a different compute node).
Expand Down
25 changes: 18 additions & 7 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -224,13 +224,24 @@ endif()

if (CMAKE_BUILD_TYPE STREQUAL "Release")
# Processor tuning: Check if we can target the host's native CPU's ISA.
CHECK_FORTRAN_COMPILER_FLAG("-march=native" SUPPORTS_MARCH_NATIVE)
if (SUPPORTS_MARCH_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-march=native>)
else()
CHECK_FORTRAN_COMPILER_FLAG("-mcpu=native" SUPPORTS_MCPU_NATIVE)
if (SUPPORTS_MCPU_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-mcpu=native>)
# Skip for gcov builds — -march=native on newer CPUs (e.g. Granite Rapids)
# can emit instructions the system assembler doesn't support.
if (NOT MFC_GCov)
CHECK_FORTRAN_COMPILER_FLAG("-march=native" SUPPORTS_MARCH_NATIVE)
if (SUPPORTS_MARCH_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-march=native>)
# Disable AVX-512 FP16: gfortran ≥12 emits vmovw instructions on
# Granite Rapids CPUs, but binutils <2.38 cannot assemble them.
# FP16 is unused in MFC's double-precision computations.
CHECK_FORTRAN_COMPILER_FLAG("-mno-avx512fp16" SUPPORTS_MNO_AVX512FP16)
if (SUPPORTS_MNO_AVX512FP16)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-mno-avx512fp16>)
endif()
else()
CHECK_FORTRAN_COMPILER_FLAG("-mcpu=native" SUPPORTS_MCPU_NATIVE)
if (SUPPORTS_MCPU_NATIVE)
add_compile_options($<$<COMPILE_LANGUAGE:Fortran>:-mcpu=native>)
endif()
endif()
endif()

Expand Down
4 changes: 2 additions & 2 deletions benchmarks/5eq_rk3_weno3_hllc/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,8 +191,8 @@
"cyl_coord": "F",
"dt": dt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
Comment on lines +194 to +195

This comment was marked as outdated.

# Simulation Algorithm Parameters
"num_patches": 3,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/hypo_hll/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,8 +44,8 @@
"p": Nz,
"dt": 1e-8,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
# Simulation Algorithm Parameters
"num_patches": 2,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/ibm/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@
"p": Nz,
"dt": mydt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
Comment on lines +51 to +52

This comment was marked as outdated.

# Simulation Algorithm Parameters
"num_patches": 1,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/igr/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,8 @@
"cyl_coord": "F",
"dt": dt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(7 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
Comment on lines +66 to +67

This comment was marked as outdated.

# Simulation Algorithm Parameters
"num_patches": 1,
"model_eqns": 2,
Expand Down
4 changes: 2 additions & 2 deletions benchmarks/viscous_weno5_sgb_acoustic/case.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,8 +94,8 @@
"p": Nz,
"dt": dt,
"t_step_start": 0,
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(6 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(6 * (5 * size + 5)),
"t_step_stop": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
"t_step_save": ARGS["steps"] if ARGS["steps"] is not None else int(2 * (5 * size + 5)),
# Simulation Algorithm Parameters
"num_patches": 2,
"model_eqns": 2,
Expand Down
5 changes: 3 additions & 2 deletions toolchain/mfc/build.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os, typing, hashlib, dataclasses, subprocess, re, time, sys, threading, queue

from rich.panel import Panel
from rich.text import Text
from rich.progress import Progress, SpinnerColumn, BarColumn, TextColumn, TimeElapsedColumn, TaskProgressColumn

from .case import Case
Expand Down Expand Up @@ -273,14 +274,14 @@ def _show_build_error(result: subprocess.CompletedProcess, stage: str):
stdout_text = result.stdout if isinstance(result.stdout, str) else result.stdout.decode('utf-8', errors='replace')
stdout_text = stdout_text.strip()
if stdout_text:
cons.raw.print(Panel(stdout_text, title="Output", border_style="yellow"))
cons.raw.print(Panel(Text(stdout_text), title="Output", border_style="yellow"))

# Show stderr if available
if result.stderr:
stderr_text = result.stderr if isinstance(result.stderr, str) else result.stderr.decode('utf-8', errors='replace')
stderr_text = stderr_text.strip()
if stderr_text:
cons.raw.print(Panel(stderr_text, title="Errors", border_style="red"))
cons.raw.print(Panel(Text(stderr_text), title="Errors", border_style="red"))

cons.print()

Expand Down
Loading