Skip to content

FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP)#1043

Draft
schongloo wants to merge 8 commits into
apache:mainfrom
schongloo:flip-504
Draft

FLIP-504: Blue/Green Deployments for Flink on Kubernetes - Phase 2 (WIP)#1043
schongloo wants to merge 8 commits into
apache:mainfrom
schongloo:flip-504

Conversation

@schongloo

@schongloo schongloo commented Dec 5, 2025

Copy link
Copy Markdown
Contributor

What is the purpose of the change

This draft PR presents the initial idea for Advanced Coordination for Blue/Green Deployments.

FLIP-503 introduced the Basic Blue/Green Deployment functionality to the Flink K8s Operator. It was very straightforward, simply transitioning to the second deployment once it's considered stable. This Advanced version brings about the notion of "record-level" coordination between the 2 deployments to have no data duplication and exactly once semantics while preserving a smooth transition.

This functionality is NOT ready for the general public as some edge cases have only simple workarounds or remain unaddressed. Some examples of this are:

  • From the client-side, the GateProcessFunction simplifies the communication/writing to the ConfigMap from only 1 subtask; however if there's no data traffic flowing through that subtask index, no communication from the pipeline will occur.

The main goals of this Draft PR are:

  • For the community to take a quick look at the current functionality already mentioned at the Flink Forward 2025 Conference
  • To get feedback and improvement suggestions

Brief change log

  • Introduced a new TransitionMode in the FlinkDeploymentTemplateSpec (BASIC/ADVANCED)
  • In this mode the operator declares a ConfigMap to drive and communicate with each "pair" of Blue/Green Flink jobs
  • New GateProcessFunction abstract class defines the overall contract to implement record level filtering.
  • Client-side pipelines will communicate with the Blue/Green Controller via the aforementioned ConfigMap via a new (concrete) WatermarkGateProcessFunction, which in turn controls the flow of the records.
  • *New module flink-kubernetes-operator-bluegreen-client, a distributable artifact with the client-side API
  • *New module flink-kubernetes-operator-bluegreen-agent, a JVM agent that transparently injects gate operators into any Flink Application-mode job — no user code changes required

Transparent Gate Injection

When transitionMode: ADVANCED is used and bluegreen.gate.strategy is set in flinkConfiguration, the operator automatically:

  1. Injects the -javaagent flag into the JobManager JVM options
  2. Adds an init container to the JobManager pod that copies the agent JAR from the operator image into a shared volume, making it available at /opt/flink/lib/bluegreen-agent.jar
  3. Sets bluegreen.gate.injection.enabled: "true" to activate injection at runtime

The following properties are set transparently by the operator and do not need to be specified by the user:

  • bluegreen.gate.injection.enabled
  • env.java.opts.jobmanager (the -javaagent flag is appended automatically)

The only user-facing configuration needed is:

  flinkConfiguration:
    bluegreen.gate.strategy: "WATERMARK"
    bluegreen.gate.watermark.extractor-class: "com.example.MyWatermarkExtractor"

Gate injection position controls where in the StreamGraph the gate operator is inserted:

  • BEFORE_SINK (default) — gate is placed immediately before each sink. Suitable for single-sink pipelines.
  • AFTER_SOURCE — gate is placed immediately after each source. Preferred for fan-out DAGs where multiple sinks share upstream operators, avoiding redundant gate instances.
  flinkConfiguration:
    bluegreen.gate.injection.position: "AFTER_SOURCE"  # optional, default is BEFORE_SINK

To opt out of automatic injection and place the Gate programmatically, only specify bluegreen.gate.injection.enabled: "false" explicitly.

Verifying this change

This change added tests and can be verified as follows:

  • Added a set of Unit Tests under WatermarkGateProcessFunctionTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changes to the CustomResourceDescriptors: (yes / no)
  • Core observer or reconciler logic that is regularly executed: (yes / no)

Documentation

image image

@schongloo

Copy link
Copy Markdown
Contributor Author

FYI @drossos @ryanvanhuuksloot
CC @gyfora

Additional defensive configMap keys check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant