Skip to content

HDDS-14977. Intermittent failure in TestDeadNodeHandler.testOnMessage#10556

Merged
adoroszlai merged 2 commits into
apache:masterfrom
chihsuan:HDDS-14977
Jun 21, 2026
Merged

HDDS-14977. Intermittent failure in TestDeadNodeHandler.testOnMessage#10556
adoroszlai merged 2 commits into
apache:masterfrom
chihsuan:HDDS-14977

Conversation

@chihsuan

@chihsuan chihsuan commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Test-only change.

Problem. TestDeadNodeHandler.testOnMessage is flaky because it manually drives the node handlers while a live SCM concurrently mutates the shared NetworkTopology. Two background mutations race with the test:

  1. The NodeStateManager periodic health check (default 3s) sees the forced-DEAD node still has a fresh heartbeat and resurrects it (DEAD -> HEALTHY_READONLY), re-adding it to the topology -> expected: <false> but was: <true>.
  2. Changing the DEAD node to IN_SERVICE fires a DEAD_NODE event, and SCM's own DeadNodeHandler asynchronously removes the node, nulling its parent between the add and getParent in HealthyReadOnlyNodeHandler -> NullPointerException: Parent == null.

The production guards from HDDS-14834 are correct; the test just does not isolate itself from these background mutations.

Fix. Two test-only changes, both matching existing patterns in TestSCMNodeManager:

  1. Set the heartbeat process interval high in setup() so the periodic check never fires during the test.
  2. Drain SCM's event queue (processAll) right after the operational-state change, so the async DeadNodeHandler finishes before the test drives the handlers.

The @Flaky tag is removed now that the root causes are gone.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14977

How was this patch tested?

  • Reproduced each failure deterministically by widening its race window on the unfixed test: shrinking the heartbeat interval to 1ms gave the expected: <false> but was: <true> signature 6/6; injecting a stall around the async removal gave the Parent == null NPE 3/3 at the exact lines reported in the Jira.
  • Validated each fix by re-running with the same injected stall still in place: the fixed test passes (the periodic check no longer resurrects, and the event queue is drained before the handlers run).
  • Ran the flaky-test-check workflow for the whole class (test-name=ALL, 10 splits × 10 iterations = 100 runs): all green. https://github.com/chihsuan/ozone/actions/runs/27890936708 An earlier single-method run missed the second window; running ALL exercises the cross-thread interaction that opens it.
  • Full TestDeadNodeHandler class passes locally over repeated runs; checkstyle clean.

@chihsuan chihsuan marked this pull request as ready for review June 20, 2026 03:01
@chungen0126

Copy link
Copy Markdown
Contributor

Thanks @chihsuan for the PR! We have a workflow in the Ozone GitHub Actions specifically for flaky tests. It would be great if you could share the results there. It makes it much easier for us to review and gives us more confidence in merging your changes.

@chihsuan

Copy link
Copy Markdown
Contributor Author

Thanks @chihsuan for the PR! We have a workflow in the Ozone GitHub Actions specifically for flaky tests. It would be great if you could share the results there. It makes it much easier for us to review and gives us more confidence in merging your changes.

Thanks @chungen0126! I ran the flaky-test-check workflow on the branch for TestDeadNodeHandler#testOnMessage (10 splits × 10 iterations = 100 runs) and all of them passed:

https://github.com/chihsuan/ozone/actions/runs/27859737780

@chungen0126 chungen0126 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 LGTM

@chungen0126 chungen0126 requested a review from adoroszlai June 20, 2026 04:36
@adoroszlai adoroszlai requested a review from ivandika3 June 20, 2026 12:34
@adoroszlai

Copy link
Copy Markdown
Contributor

Thanks @chihsuan for the patch.

I have tried TestDeadNodeHandler#ALL on the PR branch, got 4/100 failures, all due to:

NullPointerException: Parent == null
	at java.base/java.util.Objects.requireNonNull(Objects.java:259)
	at org.apache.hadoop.hdds.scm.node.HealthyReadOnlyNodeHandler.onMessage(HealthyReadOnlyNodeHandler.java:109)
	at org.apache.hadoop.hdds.scm.node.TestDeadNodeHandler.testOnMessage(TestDeadNodeHandler.java:299)

@chihsuan

Copy link
Copy Markdown
Contributor Author

Thanks for the review @adoroszlai I'll look into it. 🙏

@chihsuan

Copy link
Copy Markdown
Contributor Author

Hi @adoroszlai After looking into it, it turned out there was a second, separate cause of the flakiness:

In testOnMessage, setNodeOperationalState(IN_SERVICE) on a DEAD node fires a DEAD_NODE event, and SCM's own DeadNodeHandler then removes the node from the topology asynchronously, racing with the handlers under test (the Parent == null NPE). A single-method run never opened this window.

I fixed by draining SCM's event queue with processAll after the state change, so the async handler completes first, same idiom as TestSCMNodeManager.

Re-ran flaky-test-check with test-name=ALL (100 runs): all green.

https://github.com/chihsuan/ozone/actions/runs/27890936708

Please take another look, thanks!

@adoroszlai adoroszlai merged commit 8b93ea4 into apache:master Jun 21, 2026
32 of 33 checks passed
@adoroszlai

Copy link
Copy Markdown
Contributor

Thanks @chihsuan for the patch, @chungen0126 for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants