OCPCLOUD-2792: Aggregate controllers conditions under ClusterOperator #246

damdo · 2024-12-13T16:12:05Z

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

--

I suggest reviewing by commit

openshift-ci · 2024-12-13T16:12:11Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-12-13T16:12:21Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from damdo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-01-23T10:10:09Z

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

repurpose unsupported controller to a broader clusteroperator one

rename controllers struct to align them

standardize setting and aggregation of controllers conditions

standardize setting and aggregation of controllers conditions: update testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-01-23T10:16:11Z

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

repurpose unsupported controller to a broader clusteroperator one

rename controllers struct to align them

standardize setting and aggregation of controllers conditions

standardize setting and aggregation of controllers conditions: update testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-01-23T10:16:37Z

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

I suggest reviewing by commit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-01-23T10:16:51Z

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

--

I suggest reviewing by commit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

damdo · 2025-01-23T10:19:56Z

/assign @JoelSpeed @theobarberbany @nrb

… testing

damdo · 2025-01-23T17:32:24Z

/test unit

nrb · 2025-01-23T21:07:43Z

cmd/cluster-capi-operator/main.go

 		} else {
+			// The ClusterOperator Controller must run in all cases.


I think this comment should be above the setupClusterOperatorController

nrb · 2025-01-23T21:17:14Z

cmd/cluster-capi-operator/main.go

@@ -298,24 +298,24 @@ func setupReconcilers(mgr manager.Manager, infra *configv1.Infrastructure, platf
 		Platform:                    platform,
 		Infra:                       infra,
 	}).SetupWithManager(mgr); err != nil {
-		klog.Error(err, "unable to create controller", "controller", "CoreCluster")
+		klog.Error(err, "unable to create core cluster controller", "controller", "CoreCluster")


I'm probably missing context, but each of the klog.Error calls here repeat the controller's name and controller, which seems redundant. Do we have a special handler installed? I don't see it in this file.

It's a structured logging format here, so controller is adding a new key`, and the next string is the value. The logging through this function makes sense given the structured logging context

nrb · 2025-01-23T21:20:22Z

cmd/cluster-capi-operator/main.go

+	sourceNamespace := flag.String(
+		"source-namespace",
+		controllers.DefaultMAPIManagedNamespace,
+		"The namespace where MAPI components will run.",


The code appears to only use it for secrets; are we actually getting other MAPI components with it? If not, the variable name could be a little more descriptive up here.

nrb · 2025-01-23T21:24:45Z

pkg/controllers/infracluster/infracluster_controller.go

+
+// setControllerConditionDegraded sets the InfraClusterController conditions to a degraded state.
+//
+//nolint:unused


I'm reviewing commit-wise, so this might be used later in the PR, but if not, we should delete it and add it when it's actually called.

Yeah, if this isn't used, lets dump it

nrb · 2025-01-23T21:39:53Z

pkg/controllers/clusteroperator/clusteroperator_controller_test.go

+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected},
+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected},
+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected},
+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerDegradedCondition, Status: configv1.ConditionTrue, Reason: "Error", Message: "reconciler error"},


Maybe put a comment at the end of this line to indicate it's the one that's degraded, since these lines are all so similar. Otherwise it's commented 24 lines away.

nrb · 2025-01-23T21:40:55Z

pkg/controllers/clusteroperator/clusteroperator_controller_test.go

+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected},
+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected},
+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected},
+				{LastTransitionTime: metav1.Now(), Type: operatorstatus.SecretSyncControllerAvailableCondition, Status: configv1.ConditionFalse, Reason: "Error", Message: "persistent reconcile error"},


Similar here, perhaps mark the line that's expected to trigger the desired test state

nrb · 2025-01-23T21:54:39Z

cmd/cluster-capi-operator/main.go

+
+// getFeatureGates is used to fetch the current feature gates from the cluster.
+// We use this to check if the machine api migration is actually enabled or not.
+func getFeatureGates(mgr ctrl.Manager) (featuregates.FeatureGateAccess, error) {


Not really a review of this code, but I'm a little surprised we don't have a helper for this in https://github.com/openshift/library-go/blob/4ea50293b28af3d28f547c670f52c175af9e4427/pkg/features/features.go.

Perhaps we should 🤔

I thought the same when touching this function. And I agree :D

nrb

Overall, it looks good. Thanks for the detailed tests!

I would strongly suggest removing any function that's got no-lint: unused, even though it might be useful in the future. https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it

JoelSpeed

Reading through this, it appears that several of our controller can never go degraded, is that really true?

I'm also not sure I like the way we are tying the available and degraded conditions together. It appears that there is no way for us to ever say that we are not available? I would expect that after cluster bootstrap, that's probably true, but, if we are degraded=true on first reconcile, we also claim we are available, which is likely misleading.

I would expect a flow through the controller that mimics what CAPI does to be more accurate

Eg

1. Fetch clusteroperator at top of Reconcile
2. Extract conditions owned by our SSA manager name
3. Set up defer to update (via SSA) the conditions on the clusteroperator
4. If Available condition is not set, set it to false
5. As we traverse the function, update the list of extracted conditions with individual and specific conditions
6. Only once we know we are available, set available true

With this, we would have an accurate view of the available condition, and the degraded condition could be handled separately where appropriate.

It would also make it easier in the future to add upgradeable into the mix for the controllers if and when we need that

JoelSpeed · 2025-01-24T10:02:11Z

cmd/cluster-capi-operator/main.go

 		setupWebhooks(mgr)
 	case configv1.AzurePlatformType:
 		azureCloudEnvironment := getAzureCloudEnvironment(infra.Status.PlatformStatus)
 		if azureCloudEnvironment == configv1.AzureStackCloud {
 			klog.Infof("Detected Azure Cloud Environment %q on platform %q is not supported, skipping capi controllers setup", azureCloudEnvironment, platform)
-			setupUnsupportedController(mgr, managedNamespace)
+			setupClusterOperatorController(mgr, managedNamespace, currentFeatureGates, true)


Would it make sense to set up the clusteroperator controller before the switch, and remove it from setupReconcilers? Then we don't need this else, or the default case down below?

JoelSpeed · 2025-01-24T10:02:51Z

cmd/cluster-capi-operator/main.go

+	// The ClusterOperator Controller must run under all circumstances as it manages the ClusterOperator object for this operator.
+	setupClusterOperatorController(mgr, managedNamespace, currentFeatureGates, false)


Yeah, as with previous comment, maybe move this up a level and before the switch to avoid calling it in three different places if it must always run

JoelSpeed · 2025-01-24T10:03:52Z

cmd/cluster-capi-operator/main.go

@@ -298,24 +298,24 @@ func setupReconcilers(mgr manager.Manager, infra *configv1.Infrastructure, platf
 		Platform:                    platform,
 		Infra:                       infra,
 	}).SetupWithManager(mgr); err != nil {
-		klog.Error(err, "unable to create controller", "controller", "CoreCluster")
+		klog.Error(err, "unable to create core cluster controller", "controller", "CoreCluster")


It's a structured logging format here, so controller is adding a new key`, and the next string is the value. The logging through this function makes sense given the structured logging context

JoelSpeed · 2025-01-24T10:04:58Z

cmd/cluster-capi-operator/main.go

+
+// getFeatureGates is used to fetch the current feature gates from the cluster.
+// We use this to check if the machine api migration is actually enabled or not.
+func getFeatureGates(mgr ctrl.Manager) (featuregates.FeatureGateAccess, error) {


Perhaps we should 🤔

JoelSpeed · 2025-01-24T10:05:47Z

cmd/cluster-capi-operator/main.go

+		return nil, fmt.Errorf("failed to create config client: %w", err)
+	}
+
+	configInformers := configinformers.NewSharedInformerFactory(configClient, 10*time.Minute)


Minor: This probably doesn't need to resync so quickly

JoelSpeed · 2025-01-24T10:41:14Z

pkg/operatorstatus/operator_status.go

+
+	availableCondition := newAvailableCondition(anyAvailableMissing, availableMsg, releaseVersion)
+	degradedCondition := newDegradedCondition(anyDegradedMissing, degradedMsg)
+	progressingCondition := newCondition(configv1.OperatorProgressing, configv1.ConditionFalse, "", "")


At some point the CO should be progressing, during an upgrade of operands, I wonder why this isn't represented here?

JoelSpeed · 2025-01-24T10:42:38Z

pkg/operatorstatus/operator_status.go

+		{CapiInstallerControllerAvailableCondition, CapiInstallerControllerDegradedCondition},
+		{CoreClusterControllerAvailableCondition, CoreClusterControllerDegradedCondition},
+		{InfraClusterControllerAvailableCondition, InfraClusterControllerDegradedCondition},
+		{KubeconfigControllerAvailableCondition, KubeconfigControllerDegradedCondition},
+		{SecretSyncControllerAvailableCondition, SecretSyncControllerDegradedCondition},


Since not all of these have any reason to be degraded, do we require them all to always set a degraded condition?

Would it be better to factor this some way so that we only actually have conditions that are likely to change over time?

JoelSpeed · 2025-01-24T10:43:49Z

pkg/operatorstatus/operator_status.go

+	switch {
+	case availableCondition.Status == "True" && degradedCondition.Status == "False":
+		upgradeableStatus = configv1.ConditionTrue
+	case availableCondition.Status == "False" || degradedCondition.Status == "True":
+		upgradeableStatus = configv1.ConditionFalse
+	}


Is this based on some document explaining how the conditions should be working?

JoelSpeed · 2025-01-24T10:45:09Z

pkg/operatorstatus/operator_status.go

+		return newCondition(configv1.OperatorAvailable, configv1.ConditionTrue, ReasonAsExpected, fmt.Sprintf("Cluster CAPI Operator is available at %s", releaseVersion))
+	}
+
+	return newCondition(configv1.OperatorAvailable, configv1.ConditionFalse, "ControllersNotAvailable", fmt.Sprintf("The following controllers available conditions are not as expected: %s", strings.Join(messages, ", ")))


As far as I could tell from reviewing the other controllers, we will never report available false, which I think is probably an error right?

JoelSpeed · 2025-01-24T10:45:58Z

pkg/operatorstatus/operator_status.go

+		Status:             status,
+		Reason:             reason,
+		Message:            message,
+		LastTransitionTime: metav1.Now(),


Does the code applying these take care of making sure LastTransitionTime doesn't tick up on every reconcile?

damdo · 2025-01-24T11:12:01Z

Reading through this, it appears that several of our controller can never go degraded, is that really true?
I'm also not sure I like the way we are tying the available and degraded conditions together. It appears that there is no way for us to ever say that we are not available? I would expect that after cluster bootstrap, that's probably true, but, if we are degraded=true on first reconcile, we also claim we are available, which is likely misleading.

@JoelSpeed I agree with your points here. In fact this was my goal when writing this code. It was to put in place the logic for setting of controller level conditions and aggregation for them at the operator level (this is why for example I temporarily left in setControllerConditionDegraded even if they are not used ATM).

I didn't want to necessarily decide myself where we should be degraded, available, or their counterparts in each controller.
That's why, for example, I didn't set degraded or unavailable conditions in each controller, but only the happy ones.
I would like to have an open discussion on where to set which conditions in each controller.

I'm also happy to pull in folks from the CVO/upgrades team if we want to get their opinion on what's the best posture in terms of conditions at various moments of the operator/cluster lifecycle.

openshift-ci · 2025-04-15T00:17:09Z

@damdo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-serial	`1736044`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/e2e-aws-ovn-techpreview	`1736044`	link	true	`/test e2e-aws-ovn-techpreview`
ci/prow/e2e-azure-ovn-techpreview	`1736044`	link	false	`/test e2e-azure-ovn-techpreview`
ci/prow/e2e-gcp-ovn-techpreview	`1736044`	link	true	`/test e2e-gcp-ovn-techpreview`
ci/prow/e2e-openstack-ovn-techpreview	`1736044`	link	true	`/test e2e-openstack-ovn-techpreview`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-merge-robot · 2025-04-15T00:17:18Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2024

damdo mentioned this pull request Jan 15, 2025

OCPBUGS-29815: fix: always update clusteroperator status versions when differing #248

Merged

damdo force-pushed the clusteroperator-aggregated-conditions branch from 4a8e8dc to e074001 Compare January 15, 2025 17:41

damdo force-pushed the clusteroperator-aggregated-conditions branch from c7dc13a to 1f123e2 Compare January 23, 2025 10:08

damdo changed the title ~~Aggregate controllers conditions under ClusterOperator~~ OCPCLOUD-2792: Aggregate controllers conditions under ClusterOperator Jan 23, 2025

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 23, 2025

damdo marked this pull request as ready for review January 23, 2025 10:17

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2025

openshift-ci bot requested review from racheljpg and sub-mod January 23, 2025 10:17

openshift-ci bot assigned JoelSpeed, nrb and theobarberbany Jan 23, 2025

damdo added 4 commits January 23, 2025 12:28

repurpose unsupported controller to a broader clusteroperator one

09d7867

rename controllers struct to align them

29fb8b6

standardize setting and aggregation of controllers conditions

267aaff

standardize setting and aggregation of controllers conditions: update…

89bde08

… testing

damdo force-pushed the clusteroperator-aggregated-conditions branch from 1f123e2 to 89bde08 Compare January 23, 2025 11:31

conditionally aggregate the machineapimigration controllers conditions

1736044

damdo force-pushed the clusteroperator-aggregated-conditions branch from 4d1da92 to 1736044 Compare January 23, 2025 16:00

nrb reviewed Jan 23, 2025

View reviewed changes

JoelSpeed reviewed Jan 24, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2025

damdo marked this pull request as draft April 15, 2025 14:06

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2025

		} else {
		// The ClusterOperator Controller must run in all cases.

		// The ClusterOperator Controller must run under all circumstances as it manages the ClusterOperator object for this operator.
		setupClusterOperatorController(mgr, managedNamespace, currentFeatureGates, false)

OCPCLOUD-2792: Aggregate controllers conditions under ClusterOperator #246

Are you sure you want to change the base?

OCPCLOUD-2792: Aggregate controllers conditions under ClusterOperator #246

Uh oh!

Conversation

damdo commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Dec 13, 2024

Uh oh!

openshift-ci bot commented Dec 13, 2024

Uh oh!

openshift-ci-robot commented Jan 23, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 23, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 23, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Jan 23, 2025 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

damdo commented Jan 23, 2025

Uh oh!

damdo commented Jan 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrb left a comment

Choose a reason for hiding this comment

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

damdo commented Jan 24, 2025

Uh oh!

openshift-ci bot commented Apr 15, 2025

Uh oh!

damdo commented Dec 13, 2024 •

edited

Loading

openshift-ci-robot commented Jan 23, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 23, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 23, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 23, 2025 •

edited by openshift-ci bot

Loading