HDDS-14977. Intermittent failure in TestDeadNodeHandler.testOnMessage by chihsuan · Pull Request #10556 · apache/ozone

chihsuan · 2026-06-20T02:49:26Z

What changes were proposed in this pull request?

Test-only change.

Problem. TestDeadNodeHandler.testOnMessage is flaky because it manually drives the node handlers while a live SCM concurrently mutates the shared NetworkTopology. Two background mutations race with the test:

The NodeStateManager periodic health check (default 3s) sees the forced-DEAD node still has a fresh heartbeat and resurrects it (DEAD -> HEALTHY_READONLY), re-adding it to the topology -> expected: <false> but was: <true>.
Changing the DEAD node to IN_SERVICE fires a DEAD_NODE event, and SCM's own DeadNodeHandler asynchronously removes the node, nulling its parent between the add and getParent in HealthyReadOnlyNodeHandler -> NullPointerException: Parent == null.

The production guards from HDDS-14834 are correct; the test just does not isolate itself from these background mutations.

Fix. Two test-only changes, both matching existing patterns in TestSCMNodeManager:

Set the heartbeat process interval high in setup() so the periodic check never fires during the test.
Drain SCM's event queue (processAll) right after the operational-state change, so the async DeadNodeHandler finishes before the test drives the handlers.

The @Flaky tag is removed now that the root causes are gone.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14977

How was this patch tested?

Reproduced each failure deterministically by widening its race window on the unfixed test: shrinking the heartbeat interval to 1ms gave the expected: <false> but was: <true> signature 6/6; injecting a stall around the async removal gave the Parent == null NPE 3/3 at the exact lines reported in the Jira.
Validated each fix by re-running with the same injected stall still in place: the fixed test passes (the periodic check no longer resurrects, and the event queue is drained before the handlers run).
Ran the flaky-test-check workflow for the whole class (test-name=ALL, 10 splits × 10 iterations = 100 runs): all green. https://github.com/chihsuan/ozone/actions/runs/27890936708 An earlier single-method run missed the second window; running ALL exercises the cross-thread interaction that opens it.
Full TestDeadNodeHandler class passes locally over repeated runs; checkstyle clean.

chungen0126 · 2026-06-20T03:46:29Z

Thanks @chihsuan for the PR! We have a workflow in the Ozone GitHub Actions specifically for flaky tests. It would be great if you could share the results there. It makes it much easier for us to review and gives us more confidence in merging your changes.

chihsuan · 2026-06-20T04:29:41Z

Thanks @chihsuan for the PR! We have a workflow in the Ozone GitHub Actions specifically for flaky tests. It would be great if you could share the results there. It makes it much easier for us to review and gives us more confidence in merging your changes.

Thanks @chungen0126! I ran the flaky-test-check workflow on the branch for TestDeadNodeHandler#testOnMessage (10 splits × 10 iterations = 100 runs) and all of them passed:

https://github.com/chihsuan/ozone/actions/runs/27859737780

chungen0126

+1 LGTM

adoroszlai · 2026-06-20T20:25:32Z

Thanks @chihsuan for the patch.

I have tried TestDeadNodeHandler#ALL on the PR branch, got 4/100 failures, all due to:

NullPointerException: Parent == null
	at java.base/java.util.Objects.requireNonNull(Objects.java:259)
	at org.apache.hadoop.hdds.scm.node.HealthyReadOnlyNodeHandler.onMessage(HealthyReadOnlyNodeHandler.java:109)
	at org.apache.hadoop.hdds.scm.node.TestDeadNodeHandler.testOnMessage(TestDeadNodeHandler.java:299)

chihsuan · 2026-06-21T01:36:25Z

Thanks for the review @adoroszlai I'll look into it. 🙏

chihsuan · 2026-06-21T02:47:40Z

Hi @adoroszlai After looking into it, it turned out there was a second, separate cause of the flakiness:

In testOnMessage, setNodeOperationalState(IN_SERVICE) on a DEAD node fires a DEAD_NODE event, and SCM's own DeadNodeHandler then removes the node from the topology asynchronously, racing with the handlers under test (the Parent == null NPE). A single-method run never opened this window.

I fixed by draining SCM's event queue with processAll after the state change, so the async handler completes first, same idiom as TestSCMNodeManager.

Re-ran flaky-test-check with test-name=ALL (100 runs): all green.

https://github.com/chihsuan/ozone/actions/runs/27890936708

Please take another look, thanks!

adoroszlai · 2026-06-21T06:34:53Z

Thanks @chihsuan for the patch, @chungen0126 for the review.

HDDS-14977. Intermittent failure in TestDeadNodeHandler.testOnMessage

ab11c16

chihsuan marked this pull request as ready for review June 20, 2026 03:01

chungen0126 approved these changes Jun 20, 2026

View reviewed changes

chungen0126 requested a review from adoroszlai June 20, 2026 04:36

adoroszlai added the test label Jun 20, 2026

adoroszlai requested a review from ivandika3 June 20, 2026 12:34

HDDS-14977. Drain SCM event queue to avoid async DeadNodeHandler race

fce88e6

adoroszlai merged commit 8b93ea4 into apache:master Jun 21, 2026
32 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14977. Intermittent failure in TestDeadNodeHandler.testOnMessage#10556

HDDS-14977. Intermittent failure in TestDeadNodeHandler.testOnMessage#10556
adoroszlai merged 2 commits into
apache:masterfrom
chihsuan:HDDS-14977

chihsuan commented Jun 20, 2026 •

edited

Loading

Uh oh!

chungen0126 commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 20, 2026

Uh oh!

chungen0126 left a comment

Uh oh!

adoroszlai commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 21, 2026

Uh oh!

chihsuan commented Jun 21, 2026

Uh oh!

Uh oh!

adoroszlai commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chihsuan commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

chungen0126 commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 20, 2026

Uh oh!

chungen0126 left a comment

Choose a reason for hiding this comment

Uh oh!

adoroszlai commented Jun 20, 2026

Uh oh!

chihsuan commented Jun 21, 2026

Uh oh!

chihsuan commented Jun 21, 2026

Uh oh!

Uh oh!

adoroszlai commented Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chihsuan commented Jun 20, 2026 •

edited

Loading