Skip to content

Commit b97320b

Browse files
sbryngelsonclaude
andcommitted
ci: add explanatory comments, fix backtick in submit.sh
- submit.sh: replace backtick with $() for job_slug; add comment that the sed pipeline must stay in sync with submit-job.sh - test.yml: explain clean: false (preserves .slurm_job_id for stale job detection) and continue-on-error on Frontier (CCE compiler instability) - run_parallel_benchmarks.sh: explain GPU partition priority order (prefer smaller/older partitions to leave large nodes for production) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent e686654 commit b97320b

3 files changed

Lines changed: 13 additions & 2 deletions

File tree

.github/scripts/run_parallel_benchmarks.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,9 @@ echo "=========================================="
2424
# both parallel jobs so PR and master always land on the same GPU type.
2525
if [ "$device" = "gpu" ] && [ "$cluster" = "phoenix" ]; then
2626
echo "Selecting Phoenix GPU partition for benchmark consistency..."
27+
# Prefer older/smaller partitions first (rtx6000, l40s, v100) to leave
28+
# large modern nodes (h200, h100, a100) free for production workloads.
29+
# rtx6000 has the most nodes and gives the most consistent baselines.
2730
BENCH_GPU_PARTITION=""
2831
for part in gpu-rtx6000 gpu-l40s gpu-v100 gpu-h200 gpu-h100 gpu-a100; do
2932
# || true: grep -c exits 1 on zero matches (or when sinfo returns no output

.github/workflows/phoenix/submit.sh

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,10 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
2323
# Submit (idempotent — skips resubmission if a live job already exists)
2424
bash "$SCRIPT_DIR/submit-job.sh" "$@"
2525

26-
# Derive the same job slug and file paths as submit-job.sh
27-
job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3"
26+
# Derive the same job slug and file paths as submit-job.sh.
27+
# NOTE: this sed pipeline must stay identical to the one in submit-job.sh —
28+
# if they diverge the id-file will not be found and the monitor will fail.
29+
job_slug="$(basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g')-$2-$3"
2830
output_file="$job_slug.out"
2931
id_file="${job_slug}.slurm_job_id"
3032

.github/workflows/test.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,9 @@ jobs:
155155
name: "${{ matrix.cluster_name }} (${{ matrix.device }}${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }}${{ matrix.shard != '' && format(' [{0}]', matrix.shard) || '' }})"
156156
if: github.repository == 'MFlowCode/MFC' && needs.file-changes.outputs.checkall == 'true' && github.event.pull_request.draft != true
157157
needs: [lint-gate, file-changes]
158+
# Frontier CCE compiler is periodically broken by toolchain updates (e.g.
159+
# cpe/25.03 introduced an IPA SIGSEGV in CCE 19.0.0). Allow Frontier to
160+
# fail without blocking PR merges; Phoenix remains a hard gate.
158161
continue-on-error: ${{ matrix.runner == 'frontier' }}
159162
timeout-minutes: 480
160163
strategy:
@@ -233,6 +236,8 @@ jobs:
233236
- name: Clone
234237
uses: actions/checkout@v4
235238
with:
239+
# clean: false preserves .slurm_job_id files across reruns so
240+
# submit-job.sh can detect and cancel stale SLURM jobs on retry.
236241
clean: false
237242

238243
- name: Build
@@ -294,6 +299,7 @@ jobs:
294299
name: "Case Opt | ${{ matrix.cluster_name }} (${{ matrix.device }}-${{ matrix.interface }})"
295300
if: github.repository == 'MFlowCode/MFC' && needs.file-changes.outputs.checkall == 'true' && github.event.pull_request.draft != true
296301
needs: [lint-gate, file-changes]
302+
# Frontier is non-blocking for the same reason as the self job above.
297303
continue-on-error: ${{ matrix.runner == 'frontier' }}
298304
timeout-minutes: 480
299305
strategy:

0 commit comments

Comments
 (0)