Skip to content

[FLINK-39439] Fix savepoint history entry removed from status before dispose succeeds#1089

Merged
gaborgsomogyi merged 1 commit into
apache:mainfrom
gaborgsomogyi:FLINK-39439
Apr 15, 2026
Merged

[FLINK-39439] Fix savepoint history entry removed from status before dispose succeeds#1089
gaborgsomogyi merged 1 commit into
apache:mainfrom
gaborgsomogyi:FLINK-39439

Conversation

@gaborgsomogyi
Copy link
Copy Markdown
Contributor

What is the purpose of the change

In legacy mode, cleanupSavepointHistoryLegacy removed the savepoint entry from the deployment status before calling the Flink dispose API. If the job was down at that moment, the dispose call failed silently and the entry was already gone - leaving the savepoint files on storage permanently with no path to cleanup.

Fix: invert the order - call dispose first, only remove the status entry on success. On failure the entry is retained and retried on the next reconcile cycle. For count-based eviction a break is used on failure since entries are oldest-first and skipping the oldest would still orphan it. For age-based eviction each entry is evaluated independently so cleanup continues on failure.

Tests added (SnapshotObserverLegacyTest):

  • testCountBasedDisposeRetainsEntryOnFailure - entry kept when dispose fails (count path)
  • testAgeBasedDisposeRetainsEntryOnFailure - entry kept when dispose fails (age path)
  • testDisposeRetryOnSubsequentReconcile - entry successfully cleaned up on the next reconcile after dispose recovers

Brief change log

Fixed the issue and added tests.

Verifying this change

Existing + new automated tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@gaborgsomogyi gaborgsomogyi merged commit 04065a0 into apache:main Apr 15, 2026
120 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants