HDDS-14977. Intermittent failure in TestDeadNodeHandler.testOnMessage#10556
Conversation
|
Thanks @chihsuan for the PR! We have a workflow in the Ozone GitHub Actions specifically for flaky tests. It would be great if you could share the results there. It makes it much easier for us to review and gives us more confidence in merging your changes. |
Thanks @chungen0126! I ran the flaky-test-check workflow on the branch for |
|
Thanks @chihsuan for the patch. I have tried |
|
Thanks for the review @adoroszlai I'll look into it. 🙏 |
|
Hi @adoroszlai After looking into it, it turned out there was a second, separate cause of the flakiness: In I fixed by draining SCM's event queue with Re-ran flaky-test-check with https://github.com/chihsuan/ozone/actions/runs/27890936708 Please take another look, thanks! |
|
Thanks @chihsuan for the patch, @chungen0126 for the review. |
What changes were proposed in this pull request?
Test-only change.
Problem.
TestDeadNodeHandler.testOnMessageis flaky because it manually drives the node handlers while a live SCM concurrently mutates the sharedNetworkTopology. Two background mutations race with the test:NodeStateManagerperiodic health check (default 3s) sees the forced-DEADnode still has a fresh heartbeat and resurrects it (DEAD -> HEALTHY_READONLY), re-adding it to the topology ->expected: <false> but was: <true>.DEADnode toIN_SERVICEfires aDEAD_NODEevent, and SCM's ownDeadNodeHandlerasynchronously removes the node, nulling its parent between theaddandgetParentinHealthyReadOnlyNodeHandler->NullPointerException: Parent == null.The production guards from HDDS-14834 are correct; the test just does not isolate itself from these background mutations.
Fix. Two test-only changes, both matching existing patterns in
TestSCMNodeManager:setup()so the periodic check never fires during the test.processAll) right after the operational-state change, so the asyncDeadNodeHandlerfinishes before the test drives the handlers.The
@Flakytag is removed now that the root causes are gone.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14977
How was this patch tested?
expected: <false> but was: <true>signature 6/6; injecting a stall around the async removal gave theParent == nullNPE 3/3 at the exact lines reported in the Jira.flaky-test-checkworkflow for the whole class (test-name=ALL, 10 splits × 10 iterations = 100 runs): all green. https://github.com/chihsuan/ozone/actions/runs/27890936708 An earlier single-method run missed the second window; runningALLexercises the cross-thread interaction that opens it.TestDeadNodeHandlerclass passes locally over repeated runs;checkstyleclean.