Skip to content

Fix FBGD stale-state bugs on aborted transitions#3

Merged
jhung-lyft merged 3 commits into
lyft-releasefrom
lyft-fbgd-stale-state-fix
May 22, 2026
Merged

Fix FBGD stale-state bugs on aborted transitions#3
jhung-lyft merged 3 commits into
lyft-releasefrom
lyft-fbgd-stale-state-fix

Conversation

@jhung-lyft

@jhung-lyft jhung-lyft commented May 20, 2026

Copy link
Copy Markdown

TL;DR

Fixes three bugs in FlinkBlueGreenDeployment that compound to leave a transition silently restoring from a stale savepoint:

# Bug Fix
1 savepointTriggerId not cleared on abort → next deploy stuck FAILING+IGNORE Clear it in abortDeployment (mirrors upstream a00bf278)
2 setLastReconciledSpec runs before startTransition → failed spec marked reconciled, retries no-op Stamp only on success (mirrors upstream #1072)
3 Suspended child carries stale status.upgradeSavepointPath → JM restores from old savepoint on retry Delete the failed child in abortDeployment so the next transition fresh-creates it

Verified in unit test and end-to-end on data-stg — see Validation below.

Issue 1 — savepointTriggerId not cleared on abort

abortDeployment left status.savepointTriggerId in place. The next reconcile tried to re-fetch the trigger from the active side's JM, which had TTL-evicted it (~300s), threw Could not fetch savepoint with triggerId: X, and left FBGD stuck in FAILING + IGNORE. Required a restartNonce bump to escape.

Issue 2 — setLastReconciledSpec stamped before startTransition returns

In checkAndInitiateDeployment, setLastReconciledSpec ran before startTransition. If startTransition threw (e.g., from Issue 1), the failed spec was already marked reconciled, so the next reconcile saw specDiff = IGNORE and silently did nothing until another spec change came in.

Issue 3 — Stale status.upgradeSavepointPath shadows fresh spec.initialSavepointPath

AbstractJobReconciler.restoreJob() reads status.jobStatus.upgradeSavepointPath, not spec.initialSavepointPath, in the terminal && !isHaMetadataAvailable upgrade branch. A suspended child retained its stale upgradeSavepointPath from the previous failed deploy. On retry, FBGD wrote a fresh
spec.initialSavepointPath, but createOrReplace does not touch the status subresource — so the stale status value won and the JM restored from an arbitrarily old savepoint while the transition appeared successful.

Upstream #1073 does NOT fix this: its !isHaMetadataAvailable || isJmAccessible gate is satisfied with empty HA, so the operator still routes through the stale-status path.

Fix: abortDeployment now deletes the failed child instead of suspending it. The next transition fresh-creates the child via createOrReplace's POST path, with no status to be stale. Falls back to suspendFlinkDeployment if the delete call fails, to avoid an inconsistent state.

Tests

  • verifyFailedChildDeletedOnAbort (new): forces a failed transition, asserts the failed child is gone, then asserts the retry fresh-creates GREEN with the fresh savepoint in spec and no carry-over status.
  • verifyFailureDuringTransition / verifyFailureBeforeFirstDeployment: updated to assert the failed child is absent (not SUSPENDED) after abort.
  • verifySavepointFetchFailureRecovery: asserts the failed spec is NOT present in lastReconciledSpec (Issue 2).

Test plan

  • FlinkBlueGreenDeploymentControllerTest → 60/60 pass
  • *ControllerTest → 148/148 pass (no regressions in FlinkDeployment / SessionJob / StateSnapshot)
  • Deployed to data-stg; forced abort + retry on beamperfk8s1; verified all three fixes (details below)

Staging validation on data-stg

Forced a failed transition on beamperfk8s1 with a bad image tag, then retried with a good image.

Time (UTC) Event
15:08:40 First transition: triggerId 5d75a09efb99..., savepoint 6966a02076a5 → new GREEN
15:09:42 Aborting deployment 'beamperfk8s1-green', rolling B/G deployment back to ACTIVE_BLUE
15:09:44 GREEN audit: Status[Job] | Error | DELETED | ErrImagePull
15:11:40 Retry: NEW triggerId 5f1d597953fe... (different from first)
15:11:55 Fresh savepoint b0d173acab2e → fresh GREEN
15:12:56 GREEN JM log: Starting job ... from savepoint .../savepoint-4d93e8-b0d173acab2e
15:13:23 Old BLUE deleted (normal post-transition cleanup)
  • 1: retry used a new triggerId, not the TTL-evicted first one
  • 2: retry transition actually ran (would have been IGNORE'd if the failed spec had been stamped)
  • 3: failed GREEN was DELETED (not suspended); fresh GREEN's JM restored from the fresh b0d173acab2e, not the prior attempt's 6966a02076a5

Production incident this resolves

Full RCA: Deploy Failure Investigation.

@jhung-lyft jhung-lyft changed the title Fix FBGD stale-state bugs on aborted transitions (Issues #1, #2, #3) Fix FBGD stale-state bugs on aborted transitions May 21, 2026
@jhung-lyft jhung-lyft force-pushed the lyft-fbgd-stale-state-fix branch 2 times, most recently from 2a4d082 to ff954a6 Compare May 21, 2026 16:18
A suspended child FD left behind by an aborted transition keeps its
status subresource on the cluster. AbstractJobReconciler.restoreJob() in
the terminal+no-HA upgrade branch reads from status.jobStatus.upgrade-
SavepointPath, not from spec.initialSavepointPath. createOrReplace on
the next transition writes a fresh spec but does not touch the status
subresource, so the operator silently restores the job from the stale
savepoint recorded in status.

abortDeployment now deletes the failed child instead of suspending it.
A fresh deploy on the next transition forces the FD reconciler into the
first-deployment branch, which reads spec.initialSavepointPath correctly.
If the delete call fails, fall back to suspendFlinkDeployment to avoid
leaving the resource in an inconsistent state.

Updated tests:
- verifyFailureDuringTransition / verifyFailureBeforeFirstDeployment:
  assert the failed child is absent (not SUSPENDED) after abort.
- verifyFailedChildDeletedOnAbort (replaces the earlier sync test):
  asserts the failed child is gone after abort and the next transition
  fresh-creates GREEN with no carry-over status.upgradeSavepointPath.

Reverts the post-deployCluster status sync added in the prior version of
this commit; deletion makes the sync unnecessary and avoids the watch-
event race where the FD reconciler could snapshot status before the sync
landed.
@jhung-lyft jhung-lyft force-pushed the lyft-fbgd-stale-state-fix branch from ff954a6 to e23b979 Compare May 22, 2026 14:37
@jhung-lyft jhung-lyft changed the base branch from main to lyft-release May 22, 2026 15:44
@jhung-lyft

Copy link
Copy Markdown
Author

/ptal @ardakuyumcu @maheepm-lyft @aniruddha-lyft #streaming-compute-dev-prs

@maheepm-lyft maheepm-lyft left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok to merge and roll out internally. However, I am against maintaining separate forks indefinitely. I would strongly suggest we open a ticket with Apache and push this fix to the upstream operator repo as well.

@aniruddha-lyft

Copy link
Copy Markdown
  1. What's the plan for submitting to upstream ? can you open ticket with Apache project ?
  2. Aside from unit testing, is there a way to repro this deterministically and test the fix ?

@jhung-lyft

Copy link
Copy Markdown
Author
  1. What's the plan for submitting to upstream ? can you open ticket with Apache project ?
  2. Aside from unit testing, is there a way to repro this deterministically and test the fix ?

i've reproduced them and tested on staging with beampergk8s, details in Staging validation on data-stg section of description

opened a ticket to track contributing back to upstream: https://lyft.atlassian.net/browse/STRMCMP-1989

@jhung-lyft jhung-lyft merged commit 1d4b5e5 into lyft-release May 22, 2026
1 of 2 checks passed
@jhung-lyft jhung-lyft deleted the lyft-fbgd-stale-state-fix branch May 22, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants