[MINOR][CI] Add hard guard that dumps stacks on stalled test forks#2492
Open
Baunsgaard wants to merge 1 commit into
Open
[MINOR][CI] Add hard guard that dumps stacks on stalled test forks#2492Baunsgaard wants to merge 1 commit into
Baunsgaard wants to merge 1 commit into
Conversation
Some Java test forks intermittently stall in a way that surefire's own timeouts never catch, so the job runs until the GitHub Actions cap and is cancelled with no output to diagnose, and the stall does not reproduce locally. Add an outer guard in the docker test entrypoint that watches the test log for a stall (no new line for a window kept just above the per-fork surefire timeout) and an absolute runtime ceiling below the job cap. On either trigger it dumps thread stacks from every JVM in the test process tree via SIGQUIT (relayed into the job log) plus a jstack file backup, then force-kills the tree so the job fails fast with stacks instead of being cancelled empty-handed. Limits are overridable via SYSDS_TEST_STALL_LIMIT and SYSDS_TEST_MAX_RUNTIME. Also set surefire runOrder to alphabetical so a hang reproduces at a stable class boundary, making the responsible class identifiable from the dumps.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The .component.c.** Java test job still has intermittently runs until the 30-minute GitHub Actions cap with no further output: a surefire fork stalls in a way that surefire's own timeouts never catch (a fork wedged around the booter handshake, or a starved maven parent), so neither forkedProcessTimeoutInSeconds nor forkedProcessExitTimeoutInSeconds fires and the job is cancelled with nothing to diagnose. The stall does not reproduce locally, so the only place to capture evidence is CI.
Add an outer guard in the docker test entrypoint that watches the test log for a stall (no new line for a window kept just above the 600s per-fork surefire timeout) and an absolute runtime ceiling below the job cap. On either trigger it force-dumps thread stacks from every JVM in the test process tree via SIGQUIT (relayed into the job log) plus a jstack file backup, then force-kills the tree so the job fails fast WITH stacks instead of being cancelled empty-handed. Limits are overridable via SYSDS_TEST_STALL_LIMIT and SYSDS_TEST_MAX_RUNTIME.
Also set surefire runOrder to alphabetical so the hang reproduces at a stable class boundary across runs, making the responsible class identifiable from the captured dumps.