[FLINK-38543] Change the overall UC restore process, JM and task initialization by 1996fanrui · Pull Request #27862 · apache/flink

1996fanrui · 2026-03-31T12:22:02Z

This PR depends on #27782, #27783 and #27861

What is the purpose of the change

[FLINK-38543] Change the overall UC restore process, JM and task initialization

Brief change log

[FLINK-38543][checkpoint] Fix Mailbox loop interrupted before recovery finished
[FLINK-38543][checkpoint] Introduce bufferFilteringCompleteFuture for earlier RUNNING state transition
[FLINK-38543][checkpoint] Change overall UC restore process for checkpoint during recovery

Verifying this change

Tons of unit tests

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive):no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector:no

Documentation

Does this pull request introduce a new feature? no

flinkbot · 2026-03-31T12:29:35Z

CI report:

355c85d Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

…de toBeConsumedBuffers

…ot for recovered buffers

…r availability for recovered buffers

…n if it is blocked to ensure the checkpoint barrier can be handled by downstream task Priority events (e.g. unaligned checkpoint barriers) must notify downstream even when the subpartition is blocked. During recovery, once the upstream output channel state is fully restored, a RECOVERY_COMPLETION event (EndOfOutputChannelStateEvent) is emitted. This event blocks the subpartition to prevent the upstream from sending new data while the downstream is still consuming recovered buffers. The subpartition remains blocked until the downstream finishes consuming all recovered buffers from every channel and calls resumeConsumption() to unblock. If a checkpoint is triggered while the downstream is still consuming recovered buffers, the upstream receives an unaligned checkpoint barrier and adds it to this blocked subpartition. The barrier must still be delivered to the downstream immediately, otherwise the checkpoint will hang until it times out.

… physical channels

pnowojski · 2026-04-03T13:21:56Z

flink-runtime/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java

+        // Return allOf result instead of thenRun result.
+        // thenRun returns a NEW future that completes after the callback finishes.
+        // Since suspend() runs on the async thread and just sends a poison mail,
+        // the mailbox loop can exit before suspend() returns, causing isDone() to be false.
+        CompletableFuture<Void> allRecoveredFuture =
+                CompletableFuture.allOf(recoveredFutures.toArray(new CompletableFuture[0]));
+        allRecoveredFuture.thenRun(mailboxProcessor::suspend);
+        return allRecoveredFuture;


Please add a test coverage for this issue.

Also, isn't this worthy of a separate bug fix? Can not this cause some critical problems?

This race condition only manifests with the new checkpointingDuringRecovery logic because bufferFilteringCompleteFuture completes much earlier (when state is written) than stateConsumedFuture (when state is consumed), creating a wider race window — so no separate JIRA is needed. The race is inherently non-deterministic and cannot be reliably reproduced in a unit test; the existing checkState(allGatesRecoveredFuture.isDone()) in restoreInternal() serves as the runtime assertion.

pnowojski · 2026-04-03T13:27:52Z

.../main/java/org/apache/flink/runtime/io/network/partition/consumer/RecoveredInputChannel.java

+        if (inputGate.isCheckpointingDuringRecoveryEnabled()) {
+            Preconditions.checkState(
+                    bufferFilteringCompleteFuture.isDone(), "buffer filtering is not complete");


I think it would be better invariant to always checkState this, regarldess if isCheckpointingDuringRecoveryEnabled is enabled or not and that bufferFilteringCompleteFuture should be always completed.

Otherwise we might have an incosistent state where bufferFilteringCompleteFuture.isDone() == false while stateConsumedFuture.isDone() == true, which doesn't make sense.

pnowojski · 2026-04-03T13:29:34Z

...-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/InputGate.java

+     * future completes before {@link #getStateConsumedFuture()}, enabling earlier RUNNING state
+     * transition when unaligned checkpoint during recovery is enabled.
+     */
+    public abstract CompletableFuture<Void> getBufferFilteringCompleteFuture();


This commit lacks a test coverage. You should add some a single simple test to for example for UnionInputGate - that would indirectly test for RecoveredInputChannel and SingleInputGate

Added UnionInputGateTest.testBufferFilteringCompleteFutureAggregation

…y finished Return allOf result instead of thenRun result. thenRun returns a NEW future that completes after the callback finishes. Since suspend() runs on the async thread and just sends a poison mail, the mailbox loop can exit before suspend() returns, causing isDone() to be false. This race is practically only triggered when checkpointingDuringRecovery is enabled, because bufferFilteringCompleteFuture completes much earlier (when state is written) than stateConsumedFuture (when state is consumed), creating a wider race window.

… earlier RUNNING state transition

…point during recovery

1996fanrui

Thanks @pnowojski for the review, all comments are addressed.

1996fanrui force-pushed the 38543/change-overall-uc-restore-process-for-checkpointint-during-recovery branch 2 times, most recently from fe6cb30 to f7b708b Compare March 31, 2026 17:29

1996fanrui added 2 commits March 31, 2026 20:35

[hotfix][network] Fix LocalInputChannel.getBuffersInUseCount to inclu…

7ced5fb

…de toBeConsumedBuffers

[FLINK-39018][checkpoint] Support LocalInputChannel checkpoint snapsh…

7c9a38d

…ot for recovered buffers

1996fanrui force-pushed the 38543/change-overall-uc-restore-process-for-checkpointint-during-recovery branch 2 times, most recently from 907bf3a to 7f75e13 Compare March 31, 2026 19:15

1996fanrui added 2 commits April 2, 2026 17:15

[FLINK-39018][network] Fix LocalInputChannel priority event and buffe…

16fbdbf

…r availability for recovered buffers

1996fanrui force-pushed the 38543/change-overall-uc-restore-process-for-checkpointint-during-recovery branch from 7f75e13 to 3dcb525 Compare April 2, 2026 20:53

[FLINK-39018][network] Buffer migration from RecoveredInputChannel to…

bb071b2

… physical channels

pnowojski reviewed Apr 3, 2026

View reviewed changes

1996fanrui force-pushed the 38543/change-overall-uc-restore-process-for-checkpointint-during-recovery branch from 3dcb525 to e9d3b49 Compare April 3, 2026 13:58

1996fanrui added 3 commits April 3, 2026 18:41

[FLINK-38543][checkpoint] Introduce bufferFilteringCompleteFuture for…

b8d4970

… earlier RUNNING state transition

[FLINK-38543][checkpoint] Change overall UC restore process for check…

355c85d

…point during recovery

1996fanrui force-pushed the 38543/change-overall-uc-restore-process-for-checkpointint-during-recovery branch from e9d3b49 to 355c85d Compare April 3, 2026 16:51

1996fanrui commented Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-38543] Change the overall UC restore process, JM and task initialization#27862

[FLINK-38543] Change the overall UC restore process, JM and task initialization#27862
1996fanrui wants to merge 8 commits intoapache:masterfrom
1996fanrui:38543/change-overall-uc-restore-process-for-checkpointint-during-recovery

1996fanrui commented Mar 31, 2026 •

edited

Loading

Uh oh!

flinkbot commented Mar 31, 2026 •

edited

Loading

Uh oh!

pnowojski Apr 3, 2026

Uh oh!

pnowojski Apr 3, 2026

Uh oh!

1996fanrui Apr 3, 2026

Uh oh!

pnowojski Apr 3, 2026

Uh oh!

1996fanrui Apr 3, 2026

Uh oh!

pnowojski Apr 3, 2026

Uh oh!

1996fanrui Apr 3, 2026

Uh oh!

1996fanrui left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

1996fanrui commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

pnowojski Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

pnowojski Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

1996fanrui Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

pnowojski Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

1996fanrui Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

pnowojski Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

1996fanrui Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

1996fanrui left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

1996fanrui commented Mar 31, 2026 •

edited

Loading

flinkbot commented Mar 31, 2026 •

edited

Loading