[Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator by hyboll · Pull Request #10705 · apache/seatunnel

hyboll · 2026-04-03T08:24:29Z

Purpose of this pull request

Fix avoid NPE in completePendingCheckpoint when notifyCompleted clears pendingCheckpoints

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test case is added in the CheckpointCoordinatorTest.testCompletePendingCheckpointShouldNotThrowNPEWhenNotifyCompletedClearsPendingMap

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

… clears pendingCheckpoints (apache#10655)

chl-wxp · 2026-04-03T15:56:59Z

+     * {@code abortCheckpointTimeoutFutureWhenIsCompleted()}, so no NPE is thrown.
+     */
+    @Test
+    @DisabledOnOs(OS.WINDOWS)


Why is it disabled on the Windows system?

Under the Windows system, this test method will throw a FileNotFoundException because HADOOP_HOME is not found. This bug fix only added an if condition and has nothing to do with the environment, so I disabled Windows. If necessary, I can add support for the Windows system.

DanielLeens

I think this patch fixes the NPE symptom, but it still leaves the failure path in an inconsistent state.

notifyCompleted() catches internal failures and calls handleCoordinatorError() (CheckpointCoordinator.java:372-393), which already:

updates the coordinator status to FAILED
clears pendingCheckpoints
resets pendingCounter to 0
shuts the coordinator down via cleanPendingCheckpoint() (CheckpointCoordinator.java:860-893)

After control returns to completePendingCheckpoint(), the method still continues with the normal completion flow (CheckpointCoordinator.java:1004-1019). That means:

pendingCounter.decrementAndGet() now runs after the cleanup path has already reset the counter
for final checkpoints, isCompleted() can still become true and overwrite the failure path with FINISHED / SUSPEND

So the null-check removes one crash, but the coordinator can still report a completed final checkpoint after notifyCompleted() has already failed.

I think we need an early exit once notifyCompleted() has triggered cleanup, for example by returning immediately when the coordinator is already failed/shutdown, or by making notifyCompleted() propagate a failure signal back to completePendingCheckpoint().

hyboll · 2026-04-05T13:13:38Z

I think this patch fixes the NPE symptom, but it still leaves the failure path in an inconsistent state.

notifyCompleted() catches internal failures and calls handleCoordinatorError() (CheckpointCoordinator.java:372-393), which already:

updates the coordinator status to FAILED

clears pendingCheckpoints

resets pendingCounter to 0

shuts the coordinator down via cleanPendingCheckpoint() (CheckpointCoordinator.java:860-893)

After control returns to completePendingCheckpoint(), the method still continues with the normal completion flow (CheckpointCoordinator.java:1004-1019). That means:

pendingCounter.decrementAndGet() now runs after the cleanup path has already reset the counter

for final checkpoints, isCompleted() can still become true and overwrite the failure path with FINISHED / SUSPEND

So the null-check removes one crash, but the coordinator can still report a completed final checkpoint after notifyCompleted() has already failed.

I think we need an early exit once notifyCompleted() has triggered cleanup, for example by returning immediately when the coordinator is already failed/shutdown, or by making notifyCompleted() propagate a failure signal back to completePendingCheckpoint().

Thanks for the suggestion. I've looked into the code again and I'm thinking of making notifyCompleted() return a boolean to handle the flow control. One quick question though: since there are three places calling this method, I plan to add the conditional checks to the other two call sites as well. Does that sound feasible to you?

DanielLeens · 2026-04-06T10:06:18Z

@hyboll Yes, that direction makes sense to me. The important part is to propagate the failure signal consistently to all three call sites of notifyCompleted(), not just completePendingCheckpoint().

Right now the same catch-and-cleanup path can be hit from:

completePendingCheckpoint() (CheckpointCoordinator.java:1003)
allTaskReady() (CheckpointCoordinator.java:361)
restoreCoordinator() (CheckpointCoordinator.java:493)

So if notifyCompleted() fails and handleCoordinatorError() has already switched the coordinator into the failed / shutdown path, each caller needs to stop its normal follow-up logic immediately.

In particular, allTaskReady() and restoreCoordinator() should not continue into scheduling / triggering the next checkpoint after a failed notify, and completePendingCheckpoint() still needs the early exit before pendingCounter.decrementAndGet() and the final FINISHED / SUSPEND transition.

So a boolean return value is fine, or an explicit exception / status check, as long as the contract is: once notifyCompleted() has triggered cleanup, the caller must bail out and not continue the success path.

… clears pendingCheckpoints

DanielLeens

I re-checked the latest HEAD locally.

The blocker from my previous review looks addressed now:

notifyCompleted() returns a failure signal and allTaskReady() / restoreCoordinator() / completePendingCheckpoint() all bail out on that signal (CheckpointCoordinator.java:361-367, 497-500, 1009-1016)
the new regression tests now cover all three caller paths (CheckpointCoordinatorTest.java:427-575)

I do not see the earlier inconsistent-success-path issue in the current revision.
Thanks for following through on the full control-flow fix.

dybyte

LGTM. Thanks @hyboll

huyangbo added 2 commits April 2, 2026 15:41

[Fix][Zeta] fix NPE in completePendingCheckpoint when notifyCompleted…

7abd77f

… clears pendingCheckpoints (apache#10655)

[Fix][Zeta] Handle FileNotFoundException in unit test (apache#10655)

be4dbe6

github-actions Bot added the Zeta label Apr 3, 2026

chl-wxp reviewed Apr 3, 2026

View reviewed changes

DanielLeens suggested changes Apr 4, 2026

View reviewed changes

apache deleted a comment from DanielLeens Apr 6, 2026

huyangbo added 2 commits April 7, 2026 10:53

[Fix][Zeta] fix NPE in completePendingCheckpoint when notifyCompleted…

4477b5b

… clears pendingCheckpoints

[Improve][Docs] Format code style

861bfbd

DanielLeens approved these changes Apr 7, 2026

View reviewed changes

github-actions Bot added the reviewed label Apr 7, 2026

dybyte approved these changes Apr 9, 2026

View reviewed changes

github-actions Bot added the approved label Apr 9, 2026

DanielLeens mentioned this pull request Apr 10, 2026

[Bug] [Zeta] NPE in CheckpointCoordinator.completePendingCheckpoint() #10655

Closed

3 tasks

davidzollo removed the reviewed label Apr 15, 2026

corgy-w changed the title ~~[Fix][Zeta] Fix NPE in completePendingCheckpoint (#10655)~~ [Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator Apr 19, 2026

corgy-w approved these changes Apr 19, 2026

View reviewed changes

corgy-w merged commit b70e84f into apache:dev Apr 19, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator#10705

[Fix][Zeta] Converge notifyCompleted failure handling in CheckpointCoordinator#10705
corgy-w merged 4 commits into
apache:devfrom
hyboll:dev

hyboll commented Apr 3, 2026 •

edited by dybyte

Loading

Uh oh!

chl-wxp Apr 3, 2026

Uh oh!

hyboll Apr 5, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

hyboll commented Apr 5, 2026

Uh oh!

DanielLeens commented Apr 6, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

dybyte left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

hyboll commented Apr 3, 2026 • edited by dybyte Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

chl-wxp Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

hyboll Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

hyboll commented Apr 5, 2026

Uh oh!

DanielLeens commented Apr 6, 2026

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

dybyte left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hyboll commented Apr 3, 2026 •

edited by dybyte

Loading