[FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM by gyfora · Pull Request #1073 · apache/flink-kubernetes-operator

gyfora · 2026-03-19T13:28:15Z

Change log

The current savepoint upgrade logic for terminal jobs assumes that the JM cleans up all HA state promptly when the job finishes. Under some circumstances such as large CPU/memory pressure this is not guaranteed and the operator mistakes this to HA meta available for last-state upgrades.

By the time the job starts that HA metadata may be gone and the job can start with an empty state leading to complete state loss. We have observed this for some jobs under CPU throttling.

This change fixes the logic for this upgrades to make sure savepoint upgrade can always proceed for terminal jobs when they are observed in their state correctly.

Verifying this change

Unit tests added

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changes to the CustomResourceDescriptors: no
Core observer or reconciler logic that is regularly executed: yes

## TL;DR Fixes three bugs in `FlinkBlueGreenDeployment` that compound to leave a transition silently restoring from a stale savepoint: | # | Bug | Fix | |---|-----|-----| | 1 | `savepointTriggerId` not cleared on abort → next deploy stuck FAILING+IGNORE | Clear it in `abortDeployment` (mirrors upstream [a00bf27](apache@a00bf27)) | | 2 | `setLastReconciledSpec` runs before `startTransition` → failed spec marked reconciled, retries no-op | Stamp only on success (mirrors upstream [apache#1072](apache#1072)) | | 3 | Suspended child carries stale `status.upgradeSavepointPath` → JM restores from old savepoint on retry | Delete the failed child in `abortDeployment` so the next transition fresh-creates it | Verified in unit test and end-to-end on `data-stg` — see [Validation](#Staging-validation-on-data-stg) below. ## Issue 1 — `savepointTriggerId` not cleared on abort `abortDeployment` left `status.savepointTriggerId` in place. The next reconcile tried to re-fetch the trigger from the active side's JM, which had TTL-evicted it (~300s), threw `Could not fetch savepoint with triggerId: X`, and left FBGD stuck in FAILING + IGNORE. Required a `restartNonce` bump to escape. ## Issue 2 — `setLastReconciledSpec` stamped before `startTransition` returns In `checkAndInitiateDeployment`, `setLastReconciledSpec` ran *before* `startTransition`. If `startTransition` threw (e.g., from Issue 1), the failed spec was already marked reconciled, so the next reconcile saw `specDiff = IGNORE` and silently did nothing until another spec change came in. ## Issue 3 — Stale `status.upgradeSavepointPath` shadows fresh `spec.initialSavepointPath` `AbstractJobReconciler.restoreJob()` reads `status.jobStatus.upgradeSavepointPath`, not `spec.initialSavepointPath`, in the `terminal && !isHaMetadataAvailable` upgrade branch. A suspended child retained its stale `upgradeSavepointPath` from the previous failed deploy. On retry, FBGD wrote a fresh `spec.initialSavepointPath`, but `createOrReplace` does not touch the status subresource — so the stale status value won and the JM restored from an arbitrarily old savepoint while the transition appeared successful. Upstream [apache#1073](apache#1073) does NOT fix this: its `!isHaMetadataAvailable || isJmAccessible` gate is satisfied with empty HA, so the operator still routes through the stale-status path. **Fix**: `abortDeployment` now deletes the failed child instead of suspending it. The next transition fresh-creates the child via `createOrReplace`'s POST path, with no status to be stale. Falls back to `suspendFlinkDeployment` if the delete call fails, to avoid an inconsistent state. ## Tests - `verifyFailedChildDeletedOnAbort` (new): forces a failed transition, asserts the failed child is gone, then asserts the retry fresh-creates GREEN with the fresh savepoint in spec and no carry-over status. - `verifyFailureDuringTransition` / `verifyFailureBeforeFirstDeployment`: updated to assert the failed child is absent (not SUSPENDED) after abort. - `verifySavepointFetchFailureRecovery`: asserts the failed spec is NOT present in `lastReconciledSpec` (Issue 2). ## Test plan - [x] `FlinkBlueGreenDeploymentControllerTest` → 60/60 pass - [x] `*ControllerTest` → 148/148 pass (no regressions in FlinkDeployment / SessionJob / StateSnapshot) - [x] Deployed to `data-stg`; forced abort + retry on `beamperfk8s1`; verified all three fixes ([details below](#Staging-validation-on-data-stg)) ## Staging validation on data-stg Forced a failed transition on `beamperfk8s1` with a bad image tag, then retried with a good image. | Time (UTC) | Event | |------------|-------| | 15:08:40 | First transition: triggerId `5d75a09efb99...`, savepoint `6966a02076a5` → new GREEN | | 15:09:42 | `Aborting deployment 'beamperfk8s1-green', rolling B/G deployment back to ACTIVE_BLUE` | | 15:09:44 | GREEN audit: `Status[Job] \| Error \| DELETED \| ErrImagePull` | | 15:11:40 | Retry: NEW triggerId `5f1d597953fe...` (different from first) | | 15:11:55 | Fresh savepoint `b0d173acab2e` → fresh GREEN | | 15:12:56 | GREEN JM log: `Starting job ... from savepoint .../savepoint-4d93e8-b0d173acab2e` | | 15:13:23 | Old BLUE deleted (normal post-transition cleanup) | - **1**: retry used a new triggerId, not the TTL-evicted first one - **2**: retry transition actually ran (would have been IGNORE'd if the failed spec had been stamped) - **3**: failed GREEN was DELETED (not suspended); fresh GREEN's JM restored from the fresh `b0d173acab2e`, not the prior attempt's `6966a02076a5` ## Production incident this resolves May 7 `demandingestion`: a transient deploy failure left stale `savepointTriggerId` and `status.upgradeSavepointPath` on a suspended GREEN; the failed spec was stamped as reconciled, freezing retries; when the trigger TTL evicted and a deploy went through, the JM silently restored from a 9-day-old savepoint. Full RCA: [Deploy Failure Investigation](https://docs.google.com/document/d/1rwN2TABe-3VtkHWBttLbLlW_P6Q-TMs0cuAs6fd-cQA/edit). --------- Co-authored-by: jennifer-xiong25 <jennifer.xiong@shopify.com>

[FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM

5a448da

gyfora force-pushed the FLINK-39270 branch from d376b5f to 5a448da Compare April 13, 2026 12:04

gaborgsomogyi approved these changes Apr 13, 2026

View reviewed changes

csviri approved these changes Apr 13, 2026

View reviewed changes

gyfora merged commit c046a80 into apache:main Apr 13, 2026
119 checks passed

jhung-lyft mentioned this pull request May 21, 2026

Fix FBGD stale-state bugs on aborted transitions lyft/flink-kubernetes-operator#3

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM#1073

[FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM#1073
gyfora merged 1 commit into
apache:mainfrom
gyfora:FLINK-39270

gyfora commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

gyfora commented Mar 19, 2026

Change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants