Skip to content

[MINOR][CI] Add hard guard that dumps stacks on stalled test forks#2492

Open
Baunsgaard wants to merge 1 commit into
apache:mainfrom
Baunsgaard:ci/component-c-hard-guard
Open

[MINOR][CI] Add hard guard that dumps stacks on stalled test forks#2492
Baunsgaard wants to merge 1 commit into
apache:mainfrom
Baunsgaard:ci/component-c-hard-guard

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

The .component.c.** Java test job still has intermittently runs until the 30-minute GitHub Actions cap with no further output: a surefire fork stalls in a way that surefire's own timeouts never catch (a fork wedged around the booter handshake, or a starved maven parent), so neither forkedProcessTimeoutInSeconds nor forkedProcessExitTimeoutInSeconds fires and the job is cancelled with nothing to diagnose. The stall does not reproduce locally, so the only place to capture evidence is CI.

Add an outer guard in the docker test entrypoint that watches the test log for a stall (no new line for a window kept just above the 600s per-fork surefire timeout) and an absolute runtime ceiling below the job cap. On either trigger it force-dumps thread stacks from every JVM in the test process tree via SIGQUIT (relayed into the job log) plus a jstack file backup, then force-kills the tree so the job fails fast WITH stacks instead of being cancelled empty-handed. Limits are overridable via SYSDS_TEST_STALL_LIMIT and SYSDS_TEST_MAX_RUNTIME.

Also set surefire runOrder to alphabetical so the hang reproduces at a stable class boundary across runs, making the responsible class identifiable from the captured dumps.

Some Java test forks intermittently stall in a way that surefire's own
timeouts never catch, so the job runs until the GitHub Actions cap and is
cancelled with no output to diagnose, and the stall does not reproduce
locally.

Add an outer guard in the docker test entrypoint that watches the test log
for a stall (no new line for a window kept just above the per-fork surefire
timeout) and an absolute runtime ceiling below the job cap. On either
trigger it dumps thread stacks from every JVM in the test process tree via
SIGQUIT (relayed into the job log) plus a jstack file backup, then
force-kills the tree so the job fails fast with stacks instead of being
cancelled empty-handed. Limits are overridable via SYSDS_TEST_STALL_LIMIT
and SYSDS_TEST_MAX_RUNTIME.

Also set surefire runOrder to alphabetical so a hang reproduces at a stable
class boundary, making the responsible class identifiable from the dumps.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant