Fix FBGD stale-state bugs on aborted transitions#3
Merged
Conversation
2a4d082 to
ff954a6
Compare
A suspended child FD left behind by an aborted transition keeps its status subresource on the cluster. AbstractJobReconciler.restoreJob() in the terminal+no-HA upgrade branch reads from status.jobStatus.upgrade- SavepointPath, not from spec.initialSavepointPath. createOrReplace on the next transition writes a fresh spec but does not touch the status subresource, so the operator silently restores the job from the stale savepoint recorded in status. abortDeployment now deletes the failed child instead of suspending it. A fresh deploy on the next transition forces the FD reconciler into the first-deployment branch, which reads spec.initialSavepointPath correctly. If the delete call fails, fall back to suspendFlinkDeployment to avoid leaving the resource in an inconsistent state. Updated tests: - verifyFailureDuringTransition / verifyFailureBeforeFirstDeployment: assert the failed child is absent (not SUSPENDED) after abort. - verifyFailedChildDeletedOnAbort (replaces the earlier sync test): asserts the failed child is gone after abort and the next transition fresh-creates GREEN with no carry-over status.upgradeSavepointPath. Reverts the post-deployCluster status sync added in the prior version of this commit; deletion makes the sync unnecessary and avoids the watch- event race where the FD reconciler could snapshot status before the sync landed.
ff954a6 to
e23b979
Compare
Author
|
/ptal @ardakuyumcu @maheepm-lyft @aniruddha-lyft #streaming-compute-dev-prs |
maheepm-lyft
approved these changes
May 22, 2026
maheepm-lyft
left a comment
There was a problem hiding this comment.
I think this is ok to merge and roll out internally. However, I am against maintaining separate forks indefinitely. I would strongly suggest we open a ticket with Apache and push this fix to the upstream operator repo as well.
|
Author
i've reproduced them and tested on staging with beampergk8s, details in Staging validation on data-stg section of description opened a ticket to track contributing back to upstream: https://lyft.atlassian.net/browse/STRMCMP-1989 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Fixes three bugs in
FlinkBlueGreenDeploymentthat compound to leave a transition silently restoring from a stale savepoint:savepointTriggerIdnot cleared on abort → next deploy stuck FAILING+IGNOREabortDeployment(mirrors upstream a00bf278)setLastReconciledSpecruns beforestartTransition→ failed spec marked reconciled, retries no-opstatus.upgradeSavepointPath→ JM restores from old savepoint on retryabortDeploymentso the next transition fresh-creates itVerified in unit test and end-to-end on
data-stg— see Validation below.Issue 1 —
savepointTriggerIdnot cleared on abortabortDeploymentleftstatus.savepointTriggerIdin place. The next reconcile tried to re-fetch the trigger from the active side's JM, which had TTL-evicted it (~300s), threwCould not fetch savepoint with triggerId: X, and left FBGD stuck in FAILING + IGNORE. Required arestartNoncebump to escape.Issue 2 —
setLastReconciledSpecstamped beforestartTransitionreturnsIn
checkAndInitiateDeployment,setLastReconciledSpecran beforestartTransition. IfstartTransitionthrew (e.g., from Issue 1), the failed spec was already marked reconciled, so the next reconcile sawspecDiff = IGNOREand silently did nothing until another spec change came in.Issue 3 — Stale
status.upgradeSavepointPathshadows freshspec.initialSavepointPathAbstractJobReconciler.restoreJob()readsstatus.jobStatus.upgradeSavepointPath, notspec.initialSavepointPath, in theterminal && !isHaMetadataAvailableupgrade branch. A suspended child retained its staleupgradeSavepointPathfrom the previous failed deploy. On retry, FBGD wrote a freshspec.initialSavepointPath, butcreateOrReplacedoes not touch the status subresource — so the stale status value won and the JM restored from an arbitrarily old savepoint while the transition appeared successful.Upstream #1073 does NOT fix this: its
!isHaMetadataAvailable || isJmAccessiblegate is satisfied with empty HA, so the operator still routes through the stale-status path.Fix:
abortDeploymentnow deletes the failed child instead of suspending it. The next transition fresh-creates the child viacreateOrReplace's POST path, with no status to be stale. Falls back tosuspendFlinkDeploymentif the delete call fails, to avoid an inconsistent state.Tests
verifyFailedChildDeletedOnAbort(new): forces a failed transition, asserts the failed child is gone, then asserts the retry fresh-creates GREEN with the fresh savepoint in spec and no carry-over status.verifyFailureDuringTransition/verifyFailureBeforeFirstDeployment: updated to assert the failed child is absent (not SUSPENDED) after abort.verifySavepointFetchFailureRecovery: asserts the failed spec is NOT present inlastReconciledSpec(Issue 2).Test plan
FlinkBlueGreenDeploymentControllerTest→ 60/60 pass*ControllerTest→ 148/148 pass (no regressions in FlinkDeployment / SessionJob / StateSnapshot)data-stg; forced abort + retry onbeamperfk8s1; verified all three fixes (details below)Staging validation on data-stg
Forced a failed transition on
beamperfk8s1with a bad image tag, then retried with a good image.5d75a09efb99..., savepoint6966a02076a5→ new GREENAborting deployment 'beamperfk8s1-green', rolling B/G deployment back to ACTIVE_BLUEStatus[Job] | Error | DELETED | ErrImagePull5f1d597953fe...(different from first)b0d173acab2e→ fresh GREENStarting job ... from savepoint .../savepoint-4d93e8-b0d173acab2eb0d173acab2e, not the prior attempt's6966a02076a5Production incident this resolves
Full RCA: Deploy Failure Investigation.