Skip to content

[Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator#10705

Merged
corgy-w merged 4 commits into
apache:devfrom
hyboll:dev
Apr 19, 2026
Merged

[Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator#10705
corgy-w merged 4 commits into
apache:devfrom
hyboll:dev

Conversation

@hyboll
Copy link
Copy Markdown
Contributor

@hyboll hyboll commented Apr 3, 2026

Fixes: #10655

Purpose of this pull request

Fix avoid NPE in completePendingCheckpoint when notifyCompleted clears pendingCheckpoints

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test case is added in the CheckpointCoordinatorTest.testCompletePendingCheckpointShouldNotThrowNPEWhenNotifyCompletedClearsPendingMap

Check list

@github-actions github-actions Bot added the Zeta label Apr 3, 2026
* {@code abortCheckpointTimeoutFutureWhenIsCompleted()}, so no NPE is thrown.
*/
@Test
@DisabledOnOs(OS.WINDOWS)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it disabled on the Windows system?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the Windows system, this test method will throw a FileNotFoundException because HADOOP_HOME is not found. This bug fix only added an if condition and has nothing to do with the environment, so I disabled Windows. If necessary, I can add support for the Windows system.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this patch fixes the NPE symptom, but it still leaves the failure path in an inconsistent state.

notifyCompleted() catches internal failures and calls handleCoordinatorError() (CheckpointCoordinator.java:372-393), which already:

  • updates the coordinator status to FAILED
  • clears pendingCheckpoints
  • resets pendingCounter to 0
  • shuts the coordinator down via cleanPendingCheckpoint() (CheckpointCoordinator.java:860-893)

After control returns to completePendingCheckpoint(), the method still continues with the normal completion flow (CheckpointCoordinator.java:1004-1019). That means:

  • pendingCounter.decrementAndGet() now runs after the cleanup path has already reset the counter
  • for final checkpoints, isCompleted() can still become true and overwrite the failure path with FINISHED / SUSPEND

So the null-check removes one crash, but the coordinator can still report a completed final checkpoint after notifyCompleted() has already failed.

I think we need an early exit once notifyCompleted() has triggered cleanup, for example by returning immediately when the coordinator is already failed/shutdown, or by making notifyCompleted() propagate a failure signal back to completePendingCheckpoint().

@hyboll
Copy link
Copy Markdown
Contributor Author

hyboll commented Apr 5, 2026

I think this patch fixes the NPE symptom, but it still leaves the failure path in an inconsistent state.

notifyCompleted() catches internal failures and calls handleCoordinatorError() (CheckpointCoordinator.java:372-393), which already:

  • updates the coordinator status to FAILED
  • clears pendingCheckpoints
  • resets pendingCounter to 0
  • shuts the coordinator down via cleanPendingCheckpoint() (CheckpointCoordinator.java:860-893)

After control returns to completePendingCheckpoint(), the method still continues with the normal completion flow (CheckpointCoordinator.java:1004-1019). That means:

  • pendingCounter.decrementAndGet() now runs after the cleanup path has already reset the counter
  • for final checkpoints, isCompleted() can still become true and overwrite the failure path with FINISHED / SUSPEND

So the null-check removes one crash, but the coordinator can still report a completed final checkpoint after notifyCompleted() has already failed.

I think we need an early exit once notifyCompleted() has triggered cleanup, for example by returning immediately when the coordinator is already failed/shutdown, or by making notifyCompleted() propagate a failure signal back to completePendingCheckpoint().

Thanks for the suggestion. I've looked into the code again and I'm thinking of making notifyCompleted() return a boolean to handle the flow control. One quick question though: since there are three places calling this method, I plan to add the conditional checks to the other two call sites as well. Does that sound feasible to you?

@apache apache deleted a comment from DanielLeens Apr 6, 2026
@apache apache deleted a comment from DanielLeens Apr 6, 2026
@apache apache deleted a comment from DanielLeens Apr 6, 2026
@apache apache deleted a comment from DanielLeens Apr 6, 2026
@apache apache deleted a comment from DanielLeens Apr 6, 2026
@apache apache deleted a comment from DanielLeens Apr 6, 2026
@DanielLeens
Copy link
Copy Markdown
Contributor

@hyboll Yes, that direction makes sense to me. The important part is to propagate the failure signal consistently to all three call sites of notifyCompleted(), not just completePendingCheckpoint().

Right now the same catch-and-cleanup path can be hit from:

  • completePendingCheckpoint() (CheckpointCoordinator.java:1003)
  • allTaskReady() (CheckpointCoordinator.java:361)
  • restoreCoordinator() (CheckpointCoordinator.java:493)

So if notifyCompleted() fails and handleCoordinatorError() has already switched the coordinator into the failed / shutdown path, each caller needs to stop its normal follow-up logic immediately.

In particular, allTaskReady() and restoreCoordinator() should not continue into scheduling / triggering the next checkpoint after a failed notify, and completePendingCheckpoint() still needs the early exit before pendingCounter.decrementAndGet() and the final FINISHED / SUSPEND transition.

So a boolean return value is fine, or an explicit exception / status check, as long as the contract is: once notifyCompleted() has triggered cleanup, the caller must bail out and not continue the success path.

Copy link
Copy Markdown
Contributor

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the latest HEAD locally.

The blocker from my previous review looks addressed now:

  • notifyCompleted() returns a failure signal and allTaskReady() / restoreCoordinator() / completePendingCheckpoint() all bail out on that signal (CheckpointCoordinator.java:361-367, 497-500, 1009-1016)
  • the new regression tests now cover all three caller paths (CheckpointCoordinatorTest.java:427-575)

I do not see the earlier inconsistent-success-path issue in the current revision.
Thanks for following through on the full control-flow fix.

Copy link
Copy Markdown
Contributor

@dybyte dybyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @hyboll

@corgy-w corgy-w changed the title [Fix][Zeta] Fix NPE in completePendingCheckpoint (#10655) [Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator Apr 19, 2026
@corgy-w corgy-w merged commit b70e84f into apache:dev Apr 19, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [Zeta] NPE in CheckpointCoordinator.completePendingCheckpoint()

6 participants