Skip to content

[FLINK-37515] Basic support for Blue/Green deployments #969

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: release-1.11
Choose a base branch
from

Conversation

schongloo
Copy link

What is the purpose of the change

This pull request adds basic support for Blue/Green deployments as outlined by FLIP-503 (https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677648).

Brief change log

  • A new FlinkBlueGreenDeploymentController controller, with the capability of managing the lifecycle of these deployments.
  • A new corresponding CRD is introduced
  • The new FlinkBlueGreenDeploymentController will manage this CRD and hide from the user the details of the actual Blue/Green (Active/StandBy) jobs.
  • Delegate the lifecycle of the actual Jobs to the existing FlinkDeployment controller.

Verifying this change

  • Test to verify a single basic deployment
  • Test to verify a proper transition Blue -> Green
  • Test to verify correct behavior when encountering an error before a transition
  • Test to verify correct behavior when encountering an error during a transition

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changes to the CustomResourceDescriptors: no
  • Core observer or reconciler logic that is regularly executed: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? docs / JavaDocs. UML diagrams for all scenarios added under docs/content/docs

@davidradl
Copy link

I am concerned about the number of TODOs. Things like dealing with errors and timing conditions should be in place in the initial drop of code - unless we can mark this capability as beta or the like. WDYT @gyfora ?

@schongloo schongloo requested a review from davidradl April 11, 2025 17:41
@schongloo
Copy link
Author

@davidradl thanks for your review/comments. I've cleaned up all the outdated/invalid TODOs which pointed me 1 missing unit test, added. At this point the few remaining TODOs do not interfere with the main functionality and are not critical/open for discussion.

… BASIC is supported).Tests to confirm the deployments are deleted accurately after the specified deletion delay.
@schongloo schongloo marked this pull request as draft April 15, 2025 18:06
@hansh0801
Copy link

👍

@hansh0801
Copy link

when this feature gonna be released?

@schongloo schongloo marked this pull request as ready for review April 30, 2025 13:22
@schongloo
Copy link
Author

Hi @hansh0801, trying to release it ASAP but I also want to make sure the logic and contracts are as robust as possible to minimize changes later. Looks pretty stable in my current testing and I've reopened it for review. Thanks!

@hansh0801
Copy link

cool ! and I'm also waiting for blue green phase 2 :)
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=337677650

@schongloo
Copy link
Author

@hansh0801 indeed... Phase 2 will take a little bit longer due to its complexity but already working on it.

@schongloo schongloo changed the base branch from release-1.11 to main April 30, 2025 21:14
@schongloo schongloo changed the base branch from main to release-1.11 April 30, 2025 22:33
private long deploymentReadyTimestamp;

/** Information about the TaskManagers for the scale subresource. */
private TaskManagerInfo taskManager;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to support this? I feel that the scale logic using TM count is not really used feature for regular FlinkDeployments either

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could remove it for now and only add it if there is a need

Comment on lines +112 to +119
if (deploymentStatus == null) {
deploymentStatus = new FlinkBlueGreenDeploymentStatus();
return patchStatusUpdateControl(
bgDeployment,
deploymentStatus,
FlinkBlueGreenDeploymentState.INITIALIZING_BLUE,
null)
.rescheduleAfter(100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BlueGreenDeployment could simply implement initStatus and this can be removed completely.

Comment on lines 575 to 583
private boolean hasSpecChanged(
FlinkBlueGreenDeploymentSpec newSpec, FlinkBlueGreenDeploymentStatus deploymentStatus) {

String lastReconciledSpec = deploymentStatus.getLastReconciledSpec();
String newSpecSerialized = SpecUtils.serializeObject(newSpec, "spec");

// TODO: in FLIP-504 check here the TransitionMode has not been changed

return !lastReconciledSpec.equals(newSpecSerialized);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to use the DiffBuilder/ existing utilities to get the type of the spec change here and make sure we don't trigger deployments on non-upgrade changes.

Things like operator config changes etc should not go through the Blue/Green flow I believe but simply applied on the currently active deployment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me get familiar with the DiffBuilder and see how to incorporate it

Comment on lines +660 to +663
boolean deleted =
deletedStatus.size() == 1
&& deletedStatus.get(0).getKind().equals("FlinkDeployment");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to wait here until the deletion is actually finished (the object is not there anymore?) that would require some additional logic here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not introduce waiting logic. I immediately reschedule a reconciliation afterwards which will re-execute this portion if the deployment is not yet gone.

TRANSITIONING_TO_BLUE,

/** Identifies the system is transitioning from "Blue" to "Green". */
TRANSITIONING_TO_GREEN,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what state are we in during shutdown?

return objectMapper.writeValueAsString(wrapper);
} catch (JsonProcessingException e) {
throw new RuntimeException(
"Could not serialize " + wrapperKey + ", this indicates a bug...", e);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should be include object.toString() in the error also?

@@ -101,6 +103,7 @@ rules:
- apiGroups:
- flink.apache.org
resources:
- flinkbluegreendeployments/status
- flinkdeployments/status

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this maybe a basic question. Are flinkbluegreendeployments a type of flinkdeployments. Would we expect to see them as flinkdeployments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the original idea but we shifted to a has-is relationship rather than a is-a one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants