Skip to content

OCPCLOUD-2792: Aggregate controllers conditions under ClusterOperator #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

damdo
Copy link
Member

@damdo damdo commented Dec 13, 2024

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

--

I suggest reviewing by commit

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 13, 2024
Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Dec 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from damdo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damdo damdo force-pushed the clusteroperator-aggregated-conditions branch from 4a8e8dc to e074001 Compare January 15, 2025 17:41
@damdo damdo force-pushed the clusteroperator-aggregated-conditions branch from c7dc13a to 1f123e2 Compare January 23, 2025 10:08
@damdo damdo changed the title Aggregate controllers conditions under ClusterOperator OCPCLOUD-2792: Aggregate controllers conditions under ClusterOperator Jan 23, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 23, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2025

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

  • repurpose unsupported controller to a broader clusteroperator one
  • rename controllers struct to align them
  • standardize setting and aggregation of controllers conditions
  • standardize setting and aggregation of controllers conditions: update testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2025

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

  • repurpose unsupported controller to a broader clusteroperator one
  • rename controllers struct to align them
  • standardize setting and aggregation of controllers conditions
  • standardize setting and aggregation of controllers conditions: update testing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2025

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

I suggest reviewing by commit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 23, 2025

@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.

This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.

To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.

--

I suggest reviewing by commit

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@damdo damdo marked this pull request as ready for review January 23, 2025 10:17
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2025
@openshift-ci openshift-ci bot requested review from racheljpg and sub-mod January 23, 2025 10:17
@damdo
Copy link
Member Author

damdo commented Jan 23, 2025

/assign @JoelSpeed @theobarberbany @nrb

@damdo damdo force-pushed the clusteroperator-aggregated-conditions branch from 1f123e2 to 89bde08 Compare January 23, 2025 11:31
@damdo damdo force-pushed the clusteroperator-aggregated-conditions branch from 4d1da92 to 1736044 Compare January 23, 2025 16:00
@damdo
Copy link
Member Author

damdo commented Jan 23, 2025

/test unit

} else {
// The ClusterOperator Controller must run in all cases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment should be above the setupClusterOperatorController

@@ -298,24 +298,24 @@ func setupReconcilers(mgr manager.Manager, infra *configv1.Infrastructure, platf
Platform: platform,
Infra: infra,
}).SetupWithManager(mgr); err != nil {
klog.Error(err, "unable to create controller", "controller", "CoreCluster")
klog.Error(err, "unable to create core cluster controller", "controller", "CoreCluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably missing context, but each of the klog.Error calls here repeat the controller's name and controller, which seems redundant. Do we have a special handler installed? I don't see it in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a structured logging format here, so controller is adding a new key`, and the next string is the value. The logging through this function makes sense given the structured logging context

sourceNamespace := flag.String(
"source-namespace",
controllers.DefaultMAPIManagedNamespace,
"The namespace where MAPI components will run.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code appears to only use it for secrets; are we actually getting other MAPI components with it? If not, the variable name could be a little more descriptive up here.


// setControllerConditionDegraded sets the InfraClusterController conditions to a degraded state.
//
//nolint:unused
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reviewing commit-wise, so this might be used later in the PR, but if not, we should delete it and add it when it's actually called.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if this isn't used, lets dump it

{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected},
{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected},
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected},
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerDegradedCondition, Status: configv1.ConditionTrue, Reason: "Error", Message: "reconciler error"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put a comment at the end of this line to indicate it's the one that's degraded, since these lines are all so similar. Otherwise it's commented 24 lines away.

{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected},
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected},
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected},
{LastTransitionTime: metav1.Now(), Type: operatorstatus.SecretSyncControllerAvailableCondition, Status: configv1.ConditionFalse, Reason: "Error", Message: "persistent reconcile error"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar here, perhaps mark the line that's expected to trigger the desired test state


// getFeatureGates is used to fetch the current feature gates from the cluster.
// We use this to check if the machine api migration is actually enabled or not.
func getFeatureGates(mgr ctrl.Manager) (featuregates.FeatureGateAccess, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really a review of this code, but I'm a little surprised we don't have a helper for this in https://github.com/openshift/library-go/blob/4ea50293b28af3d28f547c670f52c175af9e4427/pkg/features/features.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the same when touching this function. And I agree :D

Copy link
Contributor

@nrb nrb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks good. Thanks for the detailed tests!

I would strongly suggest removing any function that's got no-lint: unused, even though it might be useful in the future. https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading through this, it appears that several of our controller can never go degraded, is that really true?

I'm also not sure I like the way we are tying the available and degraded conditions together. It appears that there is no way for us to ever say that we are not available? I would expect that after cluster bootstrap, that's probably true, but, if we are degraded=true on first reconcile, we also claim we are available, which is likely misleading.

I would expect a flow through the controller that mimics what CAPI does to be more accurate

Eg

1. Fetch clusteroperator at top of Reconcile
2. Extract conditions owned by our SSA manager name
3. Set up defer to update (via SSA) the conditions on the clusteroperator
4. If Available condition is not set, set it to false
5. As we traverse the function, update the list of extracted conditions with individual and specific conditions
6. Only once we know we are available, set available true

With this, we would have an accurate view of the available condition, and the degraded condition could be handled separately where appropriate.

It would also make it easier in the future to add upgradeable into the mix for the controllers if and when we need that

setupWebhooks(mgr)
case configv1.AzurePlatformType:
azureCloudEnvironment := getAzureCloudEnvironment(infra.Status.PlatformStatus)
if azureCloudEnvironment == configv1.AzureStackCloud {
klog.Infof("Detected Azure Cloud Environment %q on platform %q is not supported, skipping capi controllers setup", azureCloudEnvironment, platform)
setupUnsupportedController(mgr, managedNamespace)
setupClusterOperatorController(mgr, managedNamespace, currentFeatureGates, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to set up the clusteroperator controller before the switch, and remove it from setupReconcilers? Then we don't need this else, or the default case down below?

Comment on lines +320 to +321
// The ClusterOperator Controller must run under all circumstances as it manages the ClusterOperator object for this operator.
setupClusterOperatorController(mgr, managedNamespace, currentFeatureGates, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, as with previous comment, maybe move this up a level and before the switch to avoid calling it in three different places if it must always run

@@ -298,24 +298,24 @@ func setupReconcilers(mgr manager.Manager, infra *configv1.Infrastructure, platf
Platform: platform,
Infra: infra,
}).SetupWithManager(mgr); err != nil {
klog.Error(err, "unable to create controller", "controller", "CoreCluster")
klog.Error(err, "unable to create core cluster controller", "controller", "CoreCluster")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a structured logging format here, so controller is adding a new key`, and the next string is the value. The logging through this function makes sense given the structured logging context


// getFeatureGates is used to fetch the current feature gates from the cluster.
// We use this to check if the machine api migration is actually enabled or not.
func getFeatureGates(mgr ctrl.Manager) (featuregates.FeatureGateAccess, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should 🤔

return nil, fmt.Errorf("failed to create config client: %w", err)
}

configInformers := configinformers.NewSharedInformerFactory(configClient, 10*time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This probably doesn't need to resync so quickly


availableCondition := newAvailableCondition(anyAvailableMissing, availableMsg, releaseVersion)
degradedCondition := newDegradedCondition(anyDegradedMissing, degradedMsg)
progressingCondition := newCondition(configv1.OperatorProgressing, configv1.ConditionFalse, "", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point the CO should be progressing, during an upgrade of operands, I wonder why this isn't represented here?

Comment on lines +220 to +224
{CapiInstallerControllerAvailableCondition, CapiInstallerControllerDegradedCondition},
{CoreClusterControllerAvailableCondition, CoreClusterControllerDegradedCondition},
{InfraClusterControllerAvailableCondition, InfraClusterControllerDegradedCondition},
{KubeconfigControllerAvailableCondition, KubeconfigControllerDegradedCondition},
{SecretSyncControllerAvailableCondition, SecretSyncControllerDegradedCondition},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since not all of these have any reason to be degraded, do we require them all to always set a degraded condition?

Would it be better to factor this some way so that we only actually have conditions that are likely to change over time?

Comment on lines +248 to +253
switch {
case availableCondition.Status == "True" && degradedCondition.Status == "False":
upgradeableStatus = configv1.ConditionTrue
case availableCondition.Status == "False" || degradedCondition.Status == "True":
upgradeableStatus = configv1.ConditionFalse
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this based on some document explaining how the conditions should be working?

return newCondition(configv1.OperatorAvailable, configv1.ConditionTrue, ReasonAsExpected, fmt.Sprintf("Cluster CAPI Operator is available at %s", releaseVersion))
}

return newCondition(configv1.OperatorAvailable, configv1.ConditionFalse, "ControllersNotAvailable", fmt.Sprintf("The following controllers available conditions are not as expected: %s", strings.Join(messages, ", ")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I could tell from reviewing the other controllers, we will never report available false, which I think is probably an error right?

Status: status,
Reason: reason,
Message: message,
LastTransitionTime: metav1.Now(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code applying these take care of making sure LastTransitionTime doesn't tick up on every reconcile?

@damdo
Copy link
Member Author

damdo commented Jan 24, 2025

Reading through this, it appears that several of our controller can never go degraded, is that really true?
I'm also not sure I like the way we are tying the available and degraded conditions together. It appears that there is no way for us to ever say that we are not available? I would expect that after cluster bootstrap, that's probably true, but, if we are degraded=true on first reconcile, we also claim we are available, which is likely misleading.

@JoelSpeed I agree with your points here. In fact this was my goal when writing this code. It was to put in place the logic for setting of controller level conditions and aggregation for them at the operator level (this is why for example I temporarily left in setControllerConditionDegraded even if they are not used ATM).

I didn't want to necessarily decide myself where we should be degraded, available, or their counterparts in each controller.
That's why, for example, I didn't set degraded or unavailable conditions in each controller, but only the happy ones.
I would like to have an open discussion on where to set which conditions in each controller.

I'm also happy to pull in folks from the CVO/upgrades team if we want to get their opinion on what's the best posture in terms of conditions at various moments of the operator/cluster lifecycle.

Copy link
Contributor

openshift-ci bot commented Apr 15, 2025

@damdo: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial 1736044 link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-ovn-techpreview 1736044 link true /test e2e-aws-ovn-techpreview
ci/prow/e2e-azure-ovn-techpreview 1736044 link false /test e2e-azure-ovn-techpreview
ci/prow/e2e-gcp-ovn-techpreview 1736044 link true /test e2e-gcp-ovn-techpreview
ci/prow/e2e-openstack-ovn-techpreview 1736044 link true /test e2e-openstack-ovn-techpreview

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 15, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@damdo damdo marked this pull request as draft April 15, 2025 14:06
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants