Skip to content

[Issue] Linux gfx94X: rocsolver daily_lapack/GETRF_NPVT.batched__double/9 hangs after rocm-systems bump from 79e85e1, Test step times out at 60 min #7231

@JeniferC99

Description

@JeniferC99

Summary

After bumping the rocm-systems submodule pin in ROCm/TheRock from 79e85e1468f96a867108043c953e9547c13b4c5e to 5b20c93 (PR ROCm/TheRock#5136), the rocsolver test step on Linux gfx94X hangs partway through daily_lapack/GETRF_NPVT.batched__double/9. The test is [ RUN ] but never produces an [ OK ] / [ FAILED ] line; the GitHub Actions Test action is killed when it hits the 60-minute job timeout. Sister tests at the same parameter index (/9) for other types finish in ~1.5–13 s. This blocks the rocm-systems bump.

Failing CI link

Tests Failed

  1. Hangs while running:

    • daily_lapack/GETRF_NPVT.batched__double/9

    Other daily_lapack/GETRF_NPVT.* cases just before it complete in seconds (e.g. batched__float/9 in 1508 ms, batched__double/3 in 215 ms, batched__double/6 in 206 ms). batched__double/9 starts and never returns; the suite then never completes for the rest of the shard.

  2. The action is killed at the GitHub Actions 60-minute timeout:

    ##[error]The action 'Test' has timed out after 60 minutes.
    

    No specific gtest [ FAILED ] line is produced because the wrapper is SIGKILLed before the binary can summarize.

Environment

  • Repo: ROCm/TheRock (uses rocm-systems submodule)
  • Workflow: Multi-Arch CI
  • Target Archs: gfx94X-dcgpu
  • Platform: Linux self-hosted runner (linux-gfx942-1gpu-ossci-rocm-...)
  • Commit/ref: TheRock PR head 5940efff2e9d1ae6796550d7288e400740823fb3, rocm-systems pin 5b20c93 (compare against last-known-good pin 79e85e1)

Log excerpts

Last few seconds before the timeout:

[ RUN      ] daily_lapack/GETRF_NPVT.batched__float/9
[       OK ] daily_lapack/GETRF_NPVT.batched__float/9 (1508 ms)
[ RUN      ] daily_lapack/GETRF_NPVT.batched__float/12
[       OK ] daily_lapack/GETRF_NPVT.batched__float/12 (195 ms)
[ RUN      ] daily_lapack/GETRF_NPVT.batched__double/0
[       OK ] daily_lapack/GETRF_NPVT.batched__double/0 (2 ms)
[ RUN      ] daily_lapack/GETRF_NPVT.batched__double/3
[       OK ] daily_lapack/GETRF_NPVT.batched__double/3 (215 ms)
[ RUN      ] daily_lapack/GETRF_NPVT.batched__double/6
[       OK ] daily_lapack/GETRF_NPVT.batched__double/6 (206 ms)
[ RUN      ] daily_lapack/GETRF_NPVT.batched__double/9
##[error]The action 'Test' has timed out after 60 minutes.

Reproduce command (from job output):

python build_tools/github_actions/reproduce_test_failure.py \
  --run-id 25572004503 \
  --repository ROCm/TheRock \
  --amdgpu-family gfx94X-dcgpu \
  --test-script "python build_tools/github_actions/test_executable_scripts/test_rocsolver.py" \
  --total-shards 3 \
  --fetch-artifact-args="--blas --tests"

Impact

Blocker for promotion of rocm-systems submodule to TheRock: ROCm/TheRock#5136 (and any other bump from 79e85e1).

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions