[Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator#10705
Conversation
| * {@code abortCheckpointTimeoutFutureWhenIsCompleted()}, so no NPE is thrown. | ||
| */ | ||
| @Test | ||
| @DisabledOnOs(OS.WINDOWS) |
There was a problem hiding this comment.
Why is it disabled on the Windows system?
There was a problem hiding this comment.
Under the Windows system, this test method will throw a FileNotFoundException because HADOOP_HOME is not found. This bug fix only added an if condition and has nothing to do with the environment, so I disabled Windows. If necessary, I can add support for the Windows system.
DanielLeens
left a comment
There was a problem hiding this comment.
I think this patch fixes the NPE symptom, but it still leaves the failure path in an inconsistent state.
notifyCompleted() catches internal failures and calls handleCoordinatorError() (CheckpointCoordinator.java:372-393), which already:
- updates the coordinator status to
FAILED - clears
pendingCheckpoints - resets
pendingCounterto0 - shuts the coordinator down via
cleanPendingCheckpoint()(CheckpointCoordinator.java:860-893)
After control returns to completePendingCheckpoint(), the method still continues with the normal completion flow (CheckpointCoordinator.java:1004-1019). That means:
pendingCounter.decrementAndGet()now runs after the cleanup path has already reset the counter- for final checkpoints,
isCompleted()can still become true and overwrite the failure path withFINISHED/SUSPEND
So the null-check removes one crash, but the coordinator can still report a completed final checkpoint after notifyCompleted() has already failed.
I think we need an early exit once notifyCompleted() has triggered cleanup, for example by returning immediately when the coordinator is already failed/shutdown, or by making notifyCompleted() propagate a failure signal back to completePendingCheckpoint().
Thanks for the suggestion. I've looked into the code again and I'm thinking of making notifyCompleted() return a boolean to handle the flow control. One quick question though: since there are three places calling this method, I plan to add the conditional checks to the other two call sites as well. Does that sound feasible to you? |
|
@hyboll Yes, that direction makes sense to me. The important part is to propagate the failure signal consistently to all three call sites of Right now the same catch-and-cleanup path can be hit from:
So if In particular, So a boolean return value is fine, or an explicit exception / status check, as long as the contract is: once |
… clears pendingCheckpoints
DanielLeens
left a comment
There was a problem hiding this comment.
I re-checked the latest HEAD locally.
The blocker from my previous review looks addressed now:
notifyCompleted()returns a failure signal andallTaskReady()/restoreCoordinator()/completePendingCheckpoint()all bail out on that signal (CheckpointCoordinator.java:361-367,497-500,1009-1016)- the new regression tests now cover all three caller paths (
CheckpointCoordinatorTest.java:427-575)
I do not see the earlier inconsistent-success-path issue in the current revision.
Thanks for following through on the full control-flow fix.
Fixes: #10655
Purpose of this pull request
Fix avoid NPE in completePendingCheckpoint when notifyCompleted clears pendingCheckpoints
Does this PR introduce any user-facing change?
No
How was this patch tested?
Test case is added in the CheckpointCoordinatorTest.testCompletePendingCheckpointShouldNotThrowNPEWhenNotifyCompletedClearsPendingMap
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.