Skip to content

[FLINK-38915] FlinkBlueGreenDeplomynet in place suspension handler#1053

Merged
gyfora merged 1 commit into
apache:mainfrom
Shopify:jk.bg-in-place-restarts
Jan 30, 2026
Merged

[FLINK-38915] FlinkBlueGreenDeplomynet in place suspension handler#1053
gyfora merged 1 commit into
apache:mainfrom
Shopify:jk.bg-in-place-restarts

Conversation

@james-kan-shopify
Copy link
Copy Markdown
Contributor

What is the purpose of the change

Tackles: https://issues.apache.org/jira/browse/FLINK-38915

Improve blue/green suspend/resume behavior: allow in-place suspension/resume without spawning new deployments, propagate spec changes while suspended, block suspend during transitions, and fix BG status sync bugs.

Brief change log

  • Add SUSPEND/RESUME diff detection and in-place handling in BlueGreenDeploymentService. (This means if suspension was done on blue, the pipeline will be resumed on blue when state is set back to running).
  • Block suspend requests during blue/green transitions until transition completes (Post transition will execute the suspend).
  • Block initial deployment when job.state=SUSPENDED.

Verifying this change

This change added tests and can be verified as follows:

  • FlinkBlueGreenDeploymentControllerTest: suspend/resume in-place, suspend during transition blocked, initial suspended rejection.
  • FlinkBlueGreenDeploymentSpecDiffTest: SUSPEND/RESUME diff detection.

Does this pull request potentially affect one of the following parts:

  • Dependencies: no
  • Public API (CustomResourceDescriptors): no
  • Core observer or reconciler logic: yes (blue/green suspend/resume paths, status sync)

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@james-kan-shopify james-kan-shopify marked this pull request as ready for review January 19, 2026 21:16
return patchFlinkDeployment(context, currentBlueGreenDeploymentType);
}

// Check if child is currently suspended - if so, just patch specs without restart
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a benefit of patching the spec if the deployment is suspended? A RESUME command can/will override these changes anyway, am I correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Yes, a RESUME will reconcile all changes made during suspension. However, without this patch, the FlinkDeployment and FlinkBlueGreenDeployment will be out of sync during that entire time.

We think keeping them in sync provides a better user experience: users can inspect the child FlinkDeployment at any time and see the exact spec that will be executed, rather than having to track changes across resources. We're hoping to eventually make the FlinkBlueGreenDeployment the single source of truth, with the child always reflecting the parent's current desired state.

That said, we're open to other perspectives! If you think the extra patch call isn't worth it, we'd be happy to discuss!

return false;
}
return deployment.getStatus().getLifecycleState() == ResourceLifecycleState.SUSPENDED
&& isChildSuspended(deployment);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now isFlinkDeploymentSuspended is always called after isChildSuspended, it seems we don't need to make another call to it in 454.

Copy link
Copy Markdown
Contributor Author

@james-kan-shopify james-kan-shopify Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that! Ya, that's a redundant check. Updated.

@james-kan-shopify james-kan-shopify force-pushed the jk.bg-in-place-restarts branch 3 times, most recently from eb4717a to a636429 Compare January 21, 2026 03:32
@gyfora gyfora merged commit 2621c9a into apache:main Jan 30, 2026
219 of 235 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants