Skip to content

Commit b5dfa1f

Browse files
sbryngelsonclaude
andcommitted
Add test sharding for Frontier CI; switch to batch/hackathon partition
Split Frontier GPU test configs into 2 shards (~75 min each) so they fit within the batch partition's 2h wall time limit. This allows all Frontier SLURM jobs to run concurrently instead of serially on the extended partition (which has a 1-job-per-user limit), reducing total CI wall clock from ~4.5h to ~2h. Changes: - Add --shard CLI argument (e.g., --shard 1/2) with modulo-based round-robin distribution across shards - Switch Frontier submit scripts from extended to batch/hackathon (CFD154 account, 1h59m wall time) - Shard the 3 Frontier GPU matrix entries into 6 (2 shards each) - CPU entries remain unsharded Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7a764d5 commit b5dfa1f

7 files changed

Lines changed: 63 additions & 12 deletions

File tree

.github/workflows/frontier/submit.sh

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,13 @@ output_file="$job_slug.out"
3434
submit_output=$(sbatch <<EOT
3535
#!/bin/bash
3636
#SBATCH -J MFC-$job_slug # Job name
37-
#SBATCH -A ENG160 # charge account
37+
#SBATCH -A CFD154 # charge account
3838
#SBATCH -N 1 # Number of nodes required
3939
$sbatch_device_opts
40-
#SBATCH -t 05:59:00 # Duration of the job (Ex: 15 mins)
40+
#SBATCH -t 01:59:00 # Duration of the job
4141
#SBATCH -o$output_file # Combined output and error messages file
42-
#SBATCH -p extended # Extended partition for shorter queues
42+
#SBATCH -p batch # Batch partition (concurrent jobs)
43+
#SBATCH --qos=hackathon # Hackathon QOS for batch access
4344
4445
set -e
4546
set -x
@@ -50,6 +51,7 @@ echo "Running in $(pwd):"
5051
job_slug="$job_slug"
5152
job_device="$2"
5253
job_interface="$3"
54+
job_shard="$4"
5355
5456
. ./mfc.sh load -c f -m $([ "$2" = "gpu" ] && echo "g" || echo "c")
5557

.github/workflows/frontier/test.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,13 @@ if [ "$job_device" = "gpu" ]; then
1313
fi
1414
fi
1515

16+
shard_opts=""
17+
if [ -n "$job_shard" ]; then
18+
shard_opts="--shard $job_shard"
19+
fi
20+
1621
if [ "$job_device" = "gpu" ]; then
17-
./mfc.sh test -v -a --rdma-mpi --max-attempts 3 -j $ngpus $device_opts -- -c frontier
22+
./mfc.sh test -v -a --rdma-mpi --max-attempts 3 -j $ngpus $device_opts $shard_opts -- -c frontier
1823
else
1924
./mfc.sh test -v -a --max-attempts 3 -j 32 --no-gpu -- -c frontier
2025
fi

.github/workflows/frontier_amd/submit.sh

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,13 @@ output_file="$job_slug.out"
3434
submit_output=$(sbatch <<EOT
3535
#!/bin/bash
3636
#SBATCH -J MFC-$job_slug # Job name
37-
#SBATCH -A ENG160 # charge account
37+
#SBATCH -A CFD154 # charge account
3838
#SBATCH -N 1 # Number of nodes required
3939
$sbatch_device_opts
40-
#SBATCH -t 05:59:00 # Duration of the job (Ex: 15 mins)
40+
#SBATCH -t 01:59:00 # Duration of the job
4141
#SBATCH -o$output_file # Combined output and error messages file
42-
#SBATCH -p extended # Extended partition for shorter queues
42+
#SBATCH -p batch # Batch partition (concurrent jobs)
43+
#SBATCH --qos=hackathon # Hackathon QOS for batch access
4344
4445
set -e
4546
set -x
@@ -50,6 +51,7 @@ echo "Running in $(pwd):"
5051
job_slug="$job_slug"
5152
job_device="$2"
5253
job_interface="$3"
54+
job_shard="$4"
5355
5456
. ./mfc.sh load -c famd -m $([ "$2" = "gpu" ] && echo "g" || echo "c")
5557

.github/workflows/frontier_amd/test.sh

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,13 @@ if [ "$job_device" = "gpu" ]; then
1313
fi
1414
fi
1515

16+
shard_opts=""
17+
if [ -n "$job_shard" ]; then
18+
shard_opts="--shard $job_shard"
19+
fi
20+
1621
if [ "$job_device" = "gpu" ]; then
17-
./mfc.sh test -v -a --max-attempts 3 -j $ngpus $device_opts -- -c frontier_amd
22+
./mfc.sh test -v -a --max-attempts 3 -j $ngpus $device_opts $shard_opts -- -c frontier_amd
1823
else
1924
./mfc.sh test -v -a --max-attempts 3 -j 32 --no-gpu -- -c frontier_amd
2025
fi

.github/workflows/test.yml

Lines changed: 30 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -163,7 +163,7 @@ jobs:
163163
TEST_PCT: ${{ matrix.debug == 'debug' && '-% 20' || '' }}
164164

165165
self:
166-
name: "${{ matrix.cluster_name }} (${{ matrix.device }}${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }})"
166+
name: "${{ matrix.cluster_name }} (${{ matrix.device }}${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }}${{ matrix.shard != '' && format(' [{0}]', matrix.shard) || '' }})"
167167
if: github.repository == 'MFlowCode/MFC' && needs.file-changes.outputs.checkall == 'true'
168168
needs: [lint-gate, file-changes]
169169
continue-on-error: false
@@ -177,43 +177,69 @@ jobs:
177177
cluster_name: 'Georgia Tech | Phoenix'
178178
device: 'gpu'
179179
interface: 'acc'
180+
shard: ''
180181
- runner: 'gt'
181182
cluster: 'phoenix'
182183
cluster_name: 'Georgia Tech | Phoenix'
183184
device: 'gpu'
184185
interface: 'omp'
186+
shard: ''
185187
- runner: 'gt'
186188
cluster: 'phoenix'
187189
cluster_name: 'Georgia Tech | Phoenix'
188190
device: 'cpu'
189191
interface: 'none'
190-
# Frontier (ORNL) — build on login node, test via SLURM
192+
shard: ''
193+
# Frontier (ORNL) — build on login node, GPU tests sharded for batch partition
191194
- runner: 'frontier'
192195
cluster: 'frontier'
193196
cluster_name: 'Oak Ridge | Frontier'
194197
device: 'gpu'
195198
interface: 'acc'
199+
shard: '1/2'
200+
- runner: 'frontier'
201+
cluster: 'frontier'
202+
cluster_name: 'Oak Ridge | Frontier'
203+
device: 'gpu'
204+
interface: 'acc'
205+
shard: '2/2'
196206
- runner: 'frontier'
197207
cluster: 'frontier'
198208
cluster_name: 'Oak Ridge | Frontier'
199209
device: 'gpu'
200210
interface: 'omp'
211+
shard: '1/2'
212+
- runner: 'frontier'
213+
cluster: 'frontier'
214+
cluster_name: 'Oak Ridge | Frontier'
215+
device: 'gpu'
216+
interface: 'omp'
217+
shard: '2/2'
201218
- runner: 'frontier'
202219
cluster: 'frontier'
203220
cluster_name: 'Oak Ridge | Frontier'
204221
device: 'cpu'
205222
interface: 'none'
206-
# Frontier AMD — build on login node, test via SLURM
223+
shard: ''
224+
# Frontier AMD — build on login node, GPU tests sharded for batch partition
225+
- runner: 'frontier'
226+
cluster: 'frontier_amd'
227+
cluster_name: 'Oak Ridge | Frontier (AMD)'
228+
device: 'gpu'
229+
interface: 'omp'
230+
shard: '1/2'
207231
- runner: 'frontier'
208232
cluster: 'frontier_amd'
209233
cluster_name: 'Oak Ridge | Frontier (AMD)'
210234
device: 'gpu'
211235
interface: 'omp'
236+
shard: '2/2'
212237
- runner: 'frontier'
213238
cluster: 'frontier_amd'
214239
cluster_name: 'Oak Ridge | Frontier (AMD)'
215240
device: 'cpu'
216241
interface: 'none'
242+
shard: ''
217243
runs-on:
218244
group: phoenix
219245
labels: ${{ matrix.runner }}
@@ -230,7 +256,7 @@ jobs:
230256
run: bash .github/workflows/${{ matrix.cluster }}/build.sh ${{ matrix.device }} ${{ matrix.interface }}
231257

232258
- name: Test
233-
run: bash .github/workflows/${{ matrix.cluster }}/submit.sh .github/workflows/${{ matrix.cluster }}/test.sh ${{ matrix.device }} ${{ matrix.interface }}
259+
run: bash .github/workflows/${{ matrix.cluster }}/submit.sh .github/workflows/${{ matrix.cluster }}/test.sh ${{ matrix.device }} ${{ matrix.interface }} ${{ matrix.shard }}
234260

235261
- name: Print Logs
236262
if: always()

toolchain/mfc/cli/commands.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -452,6 +452,12 @@
452452
default=False,
453453
dest="dry_run",
454454
),
455+
Argument(
456+
name="shard",
457+
help="Run only a subset of tests (e.g., '1/2' for first half, '2/2' for second half).",
458+
type=str,
459+
default=None,
460+
),
455461
],
456462
mutually_exclusive=[
457463
MutuallyExclusiveGroup(arguments=[

toolchain/mfc/test/test.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -99,6 +99,11 @@ def __filter(cases_) -> typing.List[TestCase]:
9999
skipped_cases += example_cases
100100
cases = [case for case in cases if case not in example_cases]
101101

102+
if ARG("shard") is not None:
103+
shard_idx, shard_count = (int(x) for x in ARG("shard").split("/"))
104+
skipped_cases += [c for i, c in enumerate(cases) if i % shard_count != shard_idx - 1]
105+
cases = [c for i, c in enumerate(cases) if i % shard_count == shard_idx - 1]
106+
102107
if ARG("percent") == 100:
103108
return cases, skipped_cases
104109

0 commit comments

Comments
 (0)