Summary
After bumping the rocm-systems submodule pin in ROCm/TheRock from 79e85e1468f96a867108043c953e9547c13b4c5e to 5b20c93 (PR ROCm/TheRock#5136), the rocsolver test step on Linux gfx94X hangs partway through daily_lapack/GETRF_NPVT.batched__double/9. The test is [ RUN ] but never produces an [ OK ] / [ FAILED ] line; the GitHub Actions Test action is killed when it hits the 60-minute job timeout. Sister tests at the same parameter index (/9) for other types finish in ~1.5–13 s. This blocks the rocm-systems bump.
Failing CI link
Tests Failed
-
Hangs while running:
daily_lapack/GETRF_NPVT.batched__double/9
Other daily_lapack/GETRF_NPVT.* cases just before it complete in seconds (e.g. batched__float/9 in 1508 ms, batched__double/3 in 215 ms, batched__double/6 in 206 ms). batched__double/9 starts and never returns; the suite then never completes for the rest of the shard.
-
The action is killed at the GitHub Actions 60-minute timeout:
##[error]The action 'Test' has timed out after 60 minutes.
No specific gtest [ FAILED ] line is produced because the wrapper is SIGKILLed before the binary can summarize.
Environment
- Repo: ROCm/TheRock (uses rocm-systems submodule)
- Workflow:
Multi-Arch CI
- Target Archs:
gfx94X-dcgpu
- Platform: Linux self-hosted runner (
linux-gfx942-1gpu-ossci-rocm-...)
- Commit/ref: TheRock PR head
5940efff2e9d1ae6796550d7288e400740823fb3, rocm-systems pin 5b20c93 (compare against last-known-good pin 79e85e1)
Log excerpts
Last few seconds before the timeout:
[ RUN ] daily_lapack/GETRF_NPVT.batched__float/9
[ OK ] daily_lapack/GETRF_NPVT.batched__float/9 (1508 ms)
[ RUN ] daily_lapack/GETRF_NPVT.batched__float/12
[ OK ] daily_lapack/GETRF_NPVT.batched__float/12 (195 ms)
[ RUN ] daily_lapack/GETRF_NPVT.batched__double/0
[ OK ] daily_lapack/GETRF_NPVT.batched__double/0 (2 ms)
[ RUN ] daily_lapack/GETRF_NPVT.batched__double/3
[ OK ] daily_lapack/GETRF_NPVT.batched__double/3 (215 ms)
[ RUN ] daily_lapack/GETRF_NPVT.batched__double/6
[ OK ] daily_lapack/GETRF_NPVT.batched__double/6 (206 ms)
[ RUN ] daily_lapack/GETRF_NPVT.batched__double/9
##[error]The action 'Test' has timed out after 60 minutes.
Reproduce command (from job output):
python build_tools/github_actions/reproduce_test_failure.py \
--run-id 25572004503 \
--repository ROCm/TheRock \
--amdgpu-family gfx94X-dcgpu \
--test-script "python build_tools/github_actions/test_executable_scripts/test_rocsolver.py" \
--total-shards 3 \
--fetch-artifact-args="--blas --tests"
Impact
Blocker for promotion of rocm-systems submodule to TheRock: ROCm/TheRock#5136 (and any other bump from 79e85e1).
Summary
After bumping the
rocm-systemssubmodule pin in ROCm/TheRock from79e85e1468f96a867108043c953e9547c13b4c5eto5b20c93(PR ROCm/TheRock#5136), the rocsolver test step on Linux gfx94X hangs partway throughdaily_lapack/GETRF_NPVT.batched__double/9. The test is[ RUN ]but never produces an[ OK ]/[ FAILED ]line; the GitHub ActionsTestaction is killed when it hits the 60-minute job timeout. Sister tests at the same parameter index (/9) for other types finish in ~1.5–13 s. This blocks the rocm-systems bump.Failing CI link
Linux::release / Test gfx94X-dcgpu / Test rocsolver / Test rocsolver (shard 1/3) (gfx94X-dcgpu)Tests Failed
Hangs while running:
daily_lapack/GETRF_NPVT.batched__double/9Other
daily_lapack/GETRF_NPVT.*cases just before it complete in seconds (e.g.batched__float/9in 1508 ms,batched__double/3in 215 ms,batched__double/6in 206 ms).batched__double/9starts and never returns; the suite then never completes for the rest of the shard.The action is killed at the GitHub Actions 60-minute timeout:
No specific gtest
[ FAILED ]line is produced because the wrapper is SIGKILLed before the binary can summarize.Environment
Multi-Arch CIgfx94X-dcgpulinux-gfx942-1gpu-ossci-rocm-...)5940efff2e9d1ae6796550d7288e400740823fb3, rocm-systems pin5b20c93(compare against last-known-good pin79e85e1)Log excerpts
Last few seconds before the timeout:
Reproduce command (from job output):
Impact
Blocker for promotion of rocm-systems submodule to TheRock: ROCm/TheRock#5136 (and any other bump from
79e85e1).