Fix FBGD stale-state bugs on aborted transitions by jhung-lyft · Pull Request #3 · lyft/flink-kubernetes-operator

jhung-lyft · 2026-05-20T21:26:36Z

TL;DR

Fixes three bugs in FlinkBlueGreenDeployment that compound to leave a transition silently restoring from a stale savepoint:

#	Bug	Fix
1	`savepointTriggerId` not cleared on abort → next deploy stuck FAILING+IGNORE	Clear it in `abortDeployment` (mirrors upstream a00bf278)
2	`setLastReconciledSpec` runs before `startTransition` → failed spec marked reconciled, retries no-op	Stamp only on success (mirrors upstream #1072)
3	Suspended child carries stale `status.upgradeSavepointPath` → JM restores from old savepoint on retry	Delete the failed child in `abortDeployment` so the next transition fresh-creates it

Verified in unit test and end-to-end on data-stg — see Validation below.

Issue 1 — `savepointTriggerId` not cleared on abort

abortDeployment left status.savepointTriggerId in place. The next reconcile tried to re-fetch the trigger from the active side's JM, which had TTL-evicted it (~300s), threw Could not fetch savepoint with triggerId: X, and left FBGD stuck in FAILING + IGNORE. Required a restartNonce bump to escape.

Issue 2 — `setLastReconciledSpec` stamped before `startTransition` returns

In checkAndInitiateDeployment, setLastReconciledSpec ran before startTransition. If startTransition threw (e.g., from Issue 1), the failed spec was already marked reconciled, so the next reconcile saw specDiff = IGNORE and silently did nothing until another spec change came in.

Issue 3 — Stale `status.upgradeSavepointPath` shadows fresh `spec.initialSavepointPath`

AbstractJobReconciler.restoreJob() reads status.jobStatus.upgradeSavepointPath, not spec.initialSavepointPath, in the terminal && !isHaMetadataAvailable upgrade branch. A suspended child retained its stale upgradeSavepointPath from the previous failed deploy. On retry, FBGD wrote a fresh
spec.initialSavepointPath, but createOrReplace does not touch the status subresource — so the stale status value won and the JM restored from an arbitrarily old savepoint while the transition appeared successful.

Upstream #1073 does NOT fix this: its !isHaMetadataAvailable || isJmAccessible gate is satisfied with empty HA, so the operator still routes through the stale-status path.

Fix: abortDeployment now deletes the failed child instead of suspending it. The next transition fresh-creates the child via createOrReplace's POST path, with no status to be stale. Falls back to suspendFlinkDeployment if the delete call fails, to avoid an inconsistent state.

Tests

verifyFailedChildDeletedOnAbort (new): forces a failed transition, asserts the failed child is gone, then asserts the retry fresh-creates GREEN with the fresh savepoint in spec and no carry-over status.
verifyFailureDuringTransition / verifyFailureBeforeFirstDeployment: updated to assert the failed child is absent (not SUSPENDED) after abort.
verifySavepointFetchFailureRecovery: asserts the failed spec is NOT present in lastReconciledSpec (Issue 2).

Test plan

FlinkBlueGreenDeploymentControllerTest → 60/60 pass
*ControllerTest → 148/148 pass (no regressions in FlinkDeployment / SessionJob / StateSnapshot)
Deployed to data-stg; forced abort + retry on beamperfk8s1; verified all three fixes (details below)

Staging validation on data-stg

Forced a failed transition on beamperfk8s1 with a bad image tag, then retried with a good image.

Time (UTC)	Event
15:08:40	First transition: triggerId `5d75a09efb99...`, savepoint `6966a02076a5` → new GREEN
15:09:42	`Aborting deployment 'beamperfk8s1-green', rolling B/G deployment back to ACTIVE_BLUE`
15:09:44	GREEN audit: `Status[Job] \| Error \| DELETED \| ErrImagePull`
15:11:40	Retry: NEW triggerId `5f1d597953fe...` (different from first)
15:11:55	Fresh savepoint `b0d173acab2e` → fresh GREEN
15:12:56	GREEN JM log: `Starting job ... from savepoint .../savepoint-4d93e8-b0d173acab2e`
15:13:23	Old BLUE deleted (normal post-transition cleanup)

1: retry used a new triggerId, not the TTL-evicted first one
2: retry transition actually ran (would have been IGNORE'd if the failed spec had been stamped)
3: failed GREEN was DELETED (not suspended); fresh GREEN's JM restored from the fresh b0d173acab2e, not the prior attempt's 6966a02076a5

Production incident this resolves

Full RCA: Deploy Failure Investigation.

A suspended child FD left behind by an aborted transition keeps its status subresource on the cluster. AbstractJobReconciler.restoreJob() in the terminal+no-HA upgrade branch reads from status.jobStatus.upgrade- SavepointPath, not from spec.initialSavepointPath. createOrReplace on the next transition writes a fresh spec but does not touch the status subresource, so the operator silently restores the job from the stale savepoint recorded in status. abortDeployment now deletes the failed child instead of suspending it. A fresh deploy on the next transition forces the FD reconciler into the first-deployment branch, which reads spec.initialSavepointPath correctly. If the delete call fails, fall back to suspendFlinkDeployment to avoid leaving the resource in an inconsistent state. Updated tests: - verifyFailureDuringTransition / verifyFailureBeforeFirstDeployment: assert the failed child is absent (not SUSPENDED) after abort. - verifyFailedChildDeletedOnAbort (replaces the earlier sync test): asserts the failed child is gone after abort and the next transition fresh-creates GREEN with no carry-over status.upgradeSavepointPath. Reverts the post-deployCluster status sync added in the prior version of this commit; deletion makes the sync unnecessary and avoids the watch- event race where the FD reconciler could snapshot status before the sync landed.

jhung-lyft · 2026-05-22T15:47:48Z

/ptal @ardakuyumcu @maheepm-lyft @aniruddha-lyft #streaming-compute-dev-prs

maheepm-lyft

I think this is ok to merge and roll out internally. However, I am against maintaining separate forks indefinitely. I would strongly suggest we open a ticket with Apache and push this fix to the upstream operator repo as well.

aniruddha-lyft · 2026-05-22T16:51:15Z

What's the plan for submitting to upstream ? can you open ticket with Apache project ?
Aside from unit testing, is there a way to repro this deterministically and test the fix ?

jhung-lyft · 2026-05-22T17:26:14Z

What's the plan for submitting to upstream ? can you open ticket with Apache project ?

Aside from unit testing, is there a way to repro this deterministically and test the fix ?

i've reproduced them and tested on staging with beampergk8s, details in Staging validation on data-stg section of description

opened a ticket to track contributing back to upstream: https://lyft.atlassian.net/browse/STRMCMP-1989

jennifer-xiong25 added 2 commits May 20, 2026 16:00

[hotfix] Clear stale savepointTriggerId on B/G transition abort

6742249

last reconciled bug

fd36457

jhung-lyft changed the title ~~Fix FBGD stale-state bugs on aborted transitions (Issues #1, #2, #3)~~ Fix FBGD stale-state bugs on aborted transitions May 21, 2026

jhung-lyft force-pushed the lyft-fbgd-stale-state-fix branch 2 times, most recently from 2a4d082 to ff954a6 Compare May 21, 2026 16:18

jhung-lyft force-pushed the lyft-fbgd-stale-state-fix branch from ff954a6 to e23b979 Compare May 22, 2026 14:37

jhung-lyft changed the base branch from main to lyft-release May 22, 2026 15:44

maheepm-lyft approved these changes May 22, 2026

View reviewed changes

jhung-lyft merged commit 1d4b5e5 into lyft-release May 22, 2026
1 of 2 checks passed

jhung-lyft deleted the lyft-fbgd-stale-state-fix branch May 22, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix FBGD stale-state bugs on aborted transitions#3

Fix FBGD stale-state bugs on aborted transitions#3
jhung-lyft merged 3 commits into
lyft-releasefrom
lyft-fbgd-stale-state-fix

jhung-lyft commented May 20, 2026 •

edited

Loading

Uh oh!

jhung-lyft commented May 22, 2026

Uh oh!

maheepm-lyft left a comment

Uh oh!

aniruddha-lyft commented May 22, 2026

Uh oh!

jhung-lyft commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jhung-lyft commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Issue 1 — savepointTriggerId not cleared on abort

Issue 2 — setLastReconciledSpec stamped before startTransition returns

Issue 3 — Stale status.upgradeSavepointPath shadows fresh spec.initialSavepointPath

Tests

Test plan

Staging validation on data-stg

Production incident this resolves

Uh oh!

jhung-lyft commented May 22, 2026

Uh oh!

maheepm-lyft left a comment

Choose a reason for hiding this comment

Uh oh!

aniruddha-lyft commented May 22, 2026

Uh oh!

jhung-lyft commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jhung-lyft commented May 20, 2026 •

edited

Loading

Issue 1 — `savepointTriggerId` not cleared on abort

Issue 2 — `setLastReconciledSpec` stamped before `startTransition` returns

Issue 3 — Stale `status.upgradeSavepointPath` shadows fresh `spec.initialSavepointPath`