Skip to content

Commit f7d0164

Browse files
sbryngelsonclaude
andcommitted
Detect stale cached binaries and include install/ in retry cleanup
Phoenix compute nodes may have different CPU architectures, causing SIGILL when running binaries cached from a different node. After a successful build, smoke-test syscheck to detect stale installs and trigger a full rebuild. Also include build/install in retry cleanup for all clusters. With per-runner caching there are no concurrent readers sharing the same cache directory, so clearing install is safe. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1adc121 commit f7d0164

3 files changed

Lines changed: 24 additions & 7 deletions

File tree

.github/workflows/frontier/build.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,8 @@ while [ $attempt -le $max_attempts ]; do
5050
fi
5151

5252
if [ $attempt -lt $max_attempts ]; then
53-
echo "Build failed on attempt $attempt. Clearing staging and retrying in 30s..."
54-
rm -rf build/staging build/lock.yaml
53+
echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
54+
rm -rf build/staging build/install build/lock.yaml
5555
sleep 30
5656
fi
5757
attempt=$((attempt + 1))

.github/workflows/frontier_amd/build.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,8 @@ while [ $attempt -le $max_attempts ]; do
5050
fi
5151

5252
if [ $attempt -lt $max_attempts ]; then
53-
echo "Build failed on attempt $attempt. Clearing staging and retrying in 30s..."
54-
rm -rf build/staging build/lock.yaml
53+
echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
54+
rm -rf build/staging build/install build/lock.yaml
5555
sleep 30
5656
fi
5757
attempt=$((attempt + 1))

.github/workflows/phoenix/test.sh

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,30 @@ while [ $attempt -le $max_attempts ]; do
1919
echo "Build attempt $attempt of $max_attempts..."
2020
if ./mfc.sh test -v --dry-run -j 8 $build_opts; then
2121
echo "Build succeeded on attempt $attempt."
22+
23+
# Smoke-test the cached binaries to catch architecture mismatches
24+
# (SIGILL from binaries compiled on a different compute node).
25+
syscheck_bin=$(find build/install -name syscheck -type f 2>/dev/null | head -1)
26+
if [ -n "$syscheck_bin" ] && ! "$syscheck_bin" > /dev/null 2>&1; then
27+
echo "WARNING: syscheck binary crashed — cached install is stale."
28+
if [ $attempt -lt $max_attempts ]; then
29+
echo "Clearing cache and rebuilding..."
30+
rm -rf build/staging build/install build/lock.yaml
31+
sleep 5
32+
attempt=$((attempt + 1))
33+
continue
34+
else
35+
echo "ERROR: syscheck still failing after $max_attempts attempts."
36+
exit 1
37+
fi
38+
fi
39+
2240
break
2341
fi
2442

2543
if [ $attempt -lt $max_attempts ]; then
26-
echo "Build failed on attempt $attempt. Clearing staging and retrying in 30s..."
27-
rm -rf build/staging build/lock.yaml
44+
echo "Build failed on attempt $attempt. Clearing cache and retrying in 30s..."
45+
rm -rf build/staging build/install build/lock.yaml
2846
sleep 30
2947
else
3048
echo "Build failed after $max_attempts attempts."
@@ -43,4 +61,3 @@ if [ "$job_device" = "gpu" ]; then
4361
fi
4462

4563
./mfc.sh test -v --max-attempts 3 -a -j $n_test_threads $device_opts -- -c phoenix
46-

0 commit comments

Comments
 (0)