[FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM#1073
Merged
Conversation
gaborgsomogyi
approved these changes
Apr 13, 2026
csviri
approved these changes
Apr 13, 2026
3 tasks
jhung-lyft
added a commit
to lyft/flink-kubernetes-operator
that referenced
this pull request
May 22, 2026
## TL;DR Fixes three bugs in `FlinkBlueGreenDeployment` that compound to leave a transition silently restoring from a stale savepoint: | # | Bug | Fix | |---|-----|-----| | 1 | `savepointTriggerId` not cleared on abort → next deploy stuck FAILING+IGNORE | Clear it in `abortDeployment` (mirrors upstream [a00bf27](apache@a00bf27)) | | 2 | `setLastReconciledSpec` runs before `startTransition` → failed spec marked reconciled, retries no-op | Stamp only on success (mirrors upstream [apache#1072](apache#1072)) | | 3 | Suspended child carries stale `status.upgradeSavepointPath` → JM restores from old savepoint on retry | Delete the failed child in `abortDeployment` so the next transition fresh-creates it | Verified in unit test and end-to-end on `data-stg` — see [Validation](#Staging-validation-on-data-stg) below. ## Issue 1 — `savepointTriggerId` not cleared on abort `abortDeployment` left `status.savepointTriggerId` in place. The next reconcile tried to re-fetch the trigger from the active side's JM, which had TTL-evicted it (~300s), threw `Could not fetch savepoint with triggerId: X`, and left FBGD stuck in FAILING + IGNORE. Required a `restartNonce` bump to escape. ## Issue 2 — `setLastReconciledSpec` stamped before `startTransition` returns In `checkAndInitiateDeployment`, `setLastReconciledSpec` ran *before* `startTransition`. If `startTransition` threw (e.g., from Issue 1), the failed spec was already marked reconciled, so the next reconcile saw `specDiff = IGNORE` and silently did nothing until another spec change came in. ## Issue 3 — Stale `status.upgradeSavepointPath` shadows fresh `spec.initialSavepointPath` `AbstractJobReconciler.restoreJob()` reads `status.jobStatus.upgradeSavepointPath`, not `spec.initialSavepointPath`, in the `terminal && !isHaMetadataAvailable` upgrade branch. A suspended child retained its stale `upgradeSavepointPath` from the previous failed deploy. On retry, FBGD wrote a fresh `spec.initialSavepointPath`, but `createOrReplace` does not touch the status subresource — so the stale status value won and the JM restored from an arbitrarily old savepoint while the transition appeared successful. Upstream [apache#1073](apache#1073) does NOT fix this: its `!isHaMetadataAvailable || isJmAccessible` gate is satisfied with empty HA, so the operator still routes through the stale-status path. **Fix**: `abortDeployment` now deletes the failed child instead of suspending it. The next transition fresh-creates the child via `createOrReplace`'s POST path, with no status to be stale. Falls back to `suspendFlinkDeployment` if the delete call fails, to avoid an inconsistent state. ## Tests - `verifyFailedChildDeletedOnAbort` (new): forces a failed transition, asserts the failed child is gone, then asserts the retry fresh-creates GREEN with the fresh savepoint in spec and no carry-over status. - `verifyFailureDuringTransition` / `verifyFailureBeforeFirstDeployment`: updated to assert the failed child is absent (not SUSPENDED) after abort. - `verifySavepointFetchFailureRecovery`: asserts the failed spec is NOT present in `lastReconciledSpec` (Issue 2). ## Test plan - [x] `FlinkBlueGreenDeploymentControllerTest` → 60/60 pass - [x] `*ControllerTest` → 148/148 pass (no regressions in FlinkDeployment / SessionJob / StateSnapshot) - [x] Deployed to `data-stg`; forced abort + retry on `beamperfk8s1`; verified all three fixes ([details below](#Staging-validation-on-data-stg)) ## Staging validation on data-stg Forced a failed transition on `beamperfk8s1` with a bad image tag, then retried with a good image. | Time (UTC) | Event | |------------|-------| | 15:08:40 | First transition: triggerId `5d75a09efb99...`, savepoint `6966a02076a5` → new GREEN | | 15:09:42 | `Aborting deployment 'beamperfk8s1-green', rolling B/G deployment back to ACTIVE_BLUE` | | 15:09:44 | GREEN audit: `Status[Job] \| Error \| DELETED \| ErrImagePull` | | 15:11:40 | Retry: NEW triggerId `5f1d597953fe...` (different from first) | | 15:11:55 | Fresh savepoint `b0d173acab2e` → fresh GREEN | | 15:12:56 | GREEN JM log: `Starting job ... from savepoint .../savepoint-4d93e8-b0d173acab2e` | | 15:13:23 | Old BLUE deleted (normal post-transition cleanup) | - **1**: retry used a new triggerId, not the TTL-evicted first one - **2**: retry transition actually ran (would have been IGNORE'd if the failed spec had been stamped) - **3**: failed GREEN was DELETED (not suspended); fresh GREEN's JM restored from the fresh `b0d173acab2e`, not the prior attempt's `6966a02076a5` ## Production incident this resolves May 7 `demandingestion`: a transient deploy failure left stale `savepointTriggerId` and `status.upgradeSavepointPath` on a suspended GREEN; the failed spec was stamped as reconciled, freezing retries; when the trigger TTL evicted and a deploy went through, the JM silently restored from a 9-day-old savepoint. Full RCA: [Deploy Failure Investigation](https://docs.google.com/document/d/1rwN2TABe-3VtkHWBttLbLlW_P6Q-TMs0cuAs6fd-cQA/edit). --------- Co-authored-by: jennifer-xiong25 <jennifer.xiong@shopify.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change log
The current savepoint upgrade logic for terminal jobs assumes that the JM cleans up all HA state promptly when the job finishes. Under some circumstances such as large CPU/memory pressure this is not guaranteed and the operator mistakes this to HA meta available for last-state upgrades.
By the time the job starts that HA metadata may be gone and the job can start with an empty state leading to complete state loss. We have observed this for some jobs under CPU throttling.
This change fixes the logic for this upgrades to make sure savepoint upgrade can always proceed for terminal jobs when they are observed in their state correctly.
Verifying this change
Unit tests added
Does this pull request potentially affect one of the following parts:
CustomResourceDescriptors: no