Skip to content

[FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM#1073

Merged
gyfora merged 1 commit into
apache:mainfrom
gyfora:FLINK-39270
Apr 13, 2026
Merged

[FLINK-39270] Fix savepoint/last-state upgrade state loss under slow JM#1073
gyfora merged 1 commit into
apache:mainfrom
gyfora:FLINK-39270

Conversation

@gyfora

@gyfora gyfora commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Change log

The current savepoint upgrade logic for terminal jobs assumes that the JM cleans up all HA state promptly when the job finishes. Under some circumstances such as large CPU/memory pressure this is not guaranteed and the operator mistakes this to HA meta available for last-state upgrades.

By the time the job starts that HA metadata may be gone and the job can start with an empty state leading to complete state loss. We have observed this for some jobs under CPU throttling.

This change fixes the logic for this upgrades to make sure savepoint upgrade can always proceed for terminal jobs when they are observed in their state correctly.

Verifying this change

Unit tests added

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: yes

@gyfora gyfora merged commit c046a80 into apache:main Apr 13, 2026
119 checks passed
jhung-lyft added a commit to lyft/flink-kubernetes-operator that referenced this pull request May 22, 2026
## TL;DR
Fixes three bugs in `FlinkBlueGreenDeployment` that compound to leave a
transition silently restoring from a stale savepoint:

  | # | Bug | Fix |
  |---|-----|-----|
| 1 | `savepointTriggerId` not cleared on abort → next deploy stuck
FAILING+IGNORE | Clear it in `abortDeployment` (mirrors upstream
[a00bf27](apache@a00bf27))
|
| 2 | `setLastReconciledSpec` runs before `startTransition` → failed
spec marked reconciled, retries no-op | Stamp only on success (mirrors
upstream
[apache#1072](apache#1072))
|
| 3 | Suspended child carries stale `status.upgradeSavepointPath` → JM
restores from old savepoint on retry | Delete the failed child in
`abortDeployment` so the next transition fresh-creates it |

Verified in unit test and end-to-end on `data-stg` — see
[Validation](#Staging-validation-on-data-stg) below.

  ## Issue 1 — `savepointTriggerId` not cleared on abort

`abortDeployment` left `status.savepointTriggerId` in place. The next
reconcile tried to re-fetch the trigger from the active side's JM, which
had TTL-evicted it (~300s), threw `Could not fetch savepoint with
triggerId: X`, and left FBGD stuck in FAILING + IGNORE. Required a
`restartNonce` bump to escape.

## Issue 2 — `setLastReconciledSpec` stamped before `startTransition`
returns

In `checkAndInitiateDeployment`, `setLastReconciledSpec` ran *before*
`startTransition`. If `startTransition` threw (e.g., from Issue 1), the
failed spec was already marked reconciled, so the next reconcile saw
`specDiff = IGNORE` and silently did nothing until another spec change
came in.

## Issue 3 — Stale `status.upgradeSavepointPath` shadows fresh
`spec.initialSavepointPath`

`AbstractJobReconciler.restoreJob()` reads
`status.jobStatus.upgradeSavepointPath`, not
`spec.initialSavepointPath`, in the `terminal && !isHaMetadataAvailable`
upgrade branch. A suspended child retained its stale
`upgradeSavepointPath` from the previous failed deploy. On retry, FBGD
wrote a fresh
`spec.initialSavepointPath`, but `createOrReplace` does not touch the
status subresource — so the stale status value won and the JM restored
from an arbitrarily old savepoint while the transition appeared
successful.

Upstream
[apache#1073](apache#1073)
does NOT fix this: its `!isHaMetadataAvailable || isJmAccessible` gate
is satisfied with empty HA, so the operator still routes through the
stale-status path.

**Fix**: `abortDeployment` now deletes the failed child instead of
suspending it. The next transition fresh-creates the child via
`createOrReplace`'s POST path, with no status to be stale. Falls back to
`suspendFlinkDeployment` if the delete call fails, to avoid an
inconsistent state.

  ## Tests

- `verifyFailedChildDeletedOnAbort` (new): forces a failed transition,
asserts the failed child is gone, then asserts the retry fresh-creates
GREEN with the fresh savepoint in spec and no carry-over status.
- `verifyFailureDuringTransition` /
`verifyFailureBeforeFirstDeployment`: updated to assert the failed child
is absent (not SUSPENDED) after abort.
- `verifySavepointFetchFailureRecovery`: asserts the failed spec is NOT
present in `lastReconciledSpec` (Issue 2).

  ## Test plan

  - [x] `FlinkBlueGreenDeploymentControllerTest` → 60/60 pass
- [x] `*ControllerTest` → 148/148 pass (no regressions in
FlinkDeployment / SessionJob / StateSnapshot)
- [x] Deployed to `data-stg`; forced abort + retry on `beamperfk8s1`;
verified all three fixes ([details
below](#Staging-validation-on-data-stg))


  ## Staging validation on data-stg

Forced a failed transition on `beamperfk8s1` with a bad image tag, then
retried with a good image.

  | Time (UTC) | Event |
  |------------|-------|
| 15:08:40 | First transition: triggerId `5d75a09efb99...`, savepoint
`6966a02076a5` → new GREEN |
| 15:09:42 | `Aborting deployment 'beamperfk8s1-green', rolling B/G
deployment back to ACTIVE_BLUE` |
| 15:09:44 | GREEN audit: `Status[Job] \| Error \| DELETED \|
ErrImagePull` |
| 15:11:40 | Retry: NEW triggerId `5f1d597953fe...` (different from
first) |
  | 15:11:55 | Fresh savepoint `b0d173acab2e` → fresh GREEN |
| 15:12:56 | GREEN JM log: `Starting job ... from savepoint
.../savepoint-4d93e8-b0d173acab2e` |
  | 15:13:23 | Old BLUE deleted (normal post-transition cleanup) |

  - **1**: retry used a new triggerId, not the TTL-evicted first one
- **2**: retry transition actually ran (would have been IGNORE'd if the
failed spec had been stamped)
- **3**: failed GREEN was DELETED (not suspended); fresh GREEN's JM
restored from the fresh `b0d173acab2e`, not the prior attempt's
`6966a02076a5`

  ## Production incident this resolves

May 7 `demandingestion`: a transient deploy failure left stale
`savepointTriggerId` and `status.upgradeSavepointPath` on a suspended
GREEN; the failed spec was stamped as reconciled, freezing retries; when
the trigger TTL evicted and a deploy went through, the JM silently
restored from a 9-day-old savepoint. Full RCA: [Deploy Failure
Investigation](https://docs.google.com/document/d/1rwN2TABe-3VtkHWBttLbLlW_P6Q-TMs0cuAs6fd-cQA/edit).

---------

Co-authored-by: jennifer-xiong25 <jennifer.xiong@shopify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants