-
Notifications
You must be signed in to change notification settings - Fork 43
OCPCLOUD-2792: Aggregate controllers conditions under ClusterOperator #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4a8e8dc
to
e074001
Compare
c7dc13a
to
1f123e2
Compare
@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@damdo: This pull request references OCPCLOUD-2792 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/assign @JoelSpeed @theobarberbany @nrb |
1f123e2
to
89bde08
Compare
4d1da92
to
1736044
Compare
/test unit |
} else { | ||
// The ClusterOperator Controller must run in all cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment should be above the setupClusterOperatorController
@@ -298,24 +298,24 @@ func setupReconcilers(mgr manager.Manager, infra *configv1.Infrastructure, platf | |||
Platform: platform, | |||
Infra: infra, | |||
}).SetupWithManager(mgr); err != nil { | |||
klog.Error(err, "unable to create controller", "controller", "CoreCluster") | |||
klog.Error(err, "unable to create core cluster controller", "controller", "CoreCluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably missing context, but each of the klog.Error
calls here repeat the controller's name and controller
, which seems redundant. Do we have a special handler installed? I don't see it in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a structured logging format here, so controller
is adding a new key`, and the next string is the value. The logging through this function makes sense given the structured logging context
sourceNamespace := flag.String( | ||
"source-namespace", | ||
controllers.DefaultMAPIManagedNamespace, | ||
"The namespace where MAPI components will run.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code appears to only use it for secrets; are we actually getting other MAPI components with it? If not, the variable name could be a little more descriptive up here.
|
||
// setControllerConditionDegraded sets the InfraClusterController conditions to a degraded state. | ||
// | ||
//nolint:unused |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reviewing commit-wise, so this might be used later in the PR, but if not, we should delete it and add it when it's actually called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if this isn't used, lets dump it
{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected}, | ||
{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected}, | ||
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected}, | ||
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerDegradedCondition, Status: configv1.ConditionTrue, Reason: "Error", Message: "reconciler error"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe put a comment at the end of this line to indicate it's the one that's degraded, since these lines are all so similar. Otherwise it's commented 24 lines away.
{LastTransitionTime: metav1.Now(), Type: operatorstatus.KubeconfigControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected}, | ||
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerAvailableCondition, Status: configv1.ConditionTrue, Reason: operatorstatus.ReasonAsExpected}, | ||
{LastTransitionTime: metav1.Now(), Type: operatorstatus.CapiInstallerControllerDegradedCondition, Status: configv1.ConditionFalse, Reason: operatorstatus.ReasonAsExpected}, | ||
{LastTransitionTime: metav1.Now(), Type: operatorstatus.SecretSyncControllerAvailableCondition, Status: configv1.ConditionFalse, Reason: "Error", Message: "persistent reconcile error"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar here, perhaps mark the line that's expected to trigger the desired test state
|
||
// getFeatureGates is used to fetch the current feature gates from the cluster. | ||
// We use this to check if the machine api migration is actually enabled or not. | ||
func getFeatureGates(mgr ctrl.Manager) (featuregates.FeatureGateAccess, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really a review of this code, but I'm a little surprised we don't have a helper for this in https://github.com/openshift/library-go/blob/4ea50293b28af3d28f547c670f52c175af9e4427/pkg/features/features.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same when touching this function. And I agree :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, it looks good. Thanks for the detailed tests!
I would strongly suggest removing any function that's got no-lint: unused
, even though it might be useful in the future. https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this, it appears that several of our controller can never go degraded, is that really true?
I'm also not sure I like the way we are tying the available and degraded conditions together. It appears that there is no way for us to ever say that we are not available? I would expect that after cluster bootstrap, that's probably true, but, if we are degraded=true on first reconcile, we also claim we are available, which is likely misleading.
I would expect a flow through the controller that mimics what CAPI does to be more accurate
Eg
1. Fetch clusteroperator at top of Reconcile
2. Extract conditions owned by our SSA manager name
3. Set up defer to update (via SSA) the conditions on the clusteroperator
4. If Available condition is not set, set it to false
5. As we traverse the function, update the list of extracted conditions with individual and specific conditions
6. Only once we know we are available, set available true
With this, we would have an accurate view of the available condition, and the degraded condition could be handled separately where appropriate.
It would also make it easier in the future to add upgradeable into the mix for the controllers if and when we need that
setupWebhooks(mgr) | ||
case configv1.AzurePlatformType: | ||
azureCloudEnvironment := getAzureCloudEnvironment(infra.Status.PlatformStatus) | ||
if azureCloudEnvironment == configv1.AzureStackCloud { | ||
klog.Infof("Detected Azure Cloud Environment %q on platform %q is not supported, skipping capi controllers setup", azureCloudEnvironment, platform) | ||
setupUnsupportedController(mgr, managedNamespace) | ||
setupClusterOperatorController(mgr, managedNamespace, currentFeatureGates, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to set up the clusteroperator controller before the switch, and remove it from setupReconcilers
? Then we don't need this else, or the default case down below?
// The ClusterOperator Controller must run under all circumstances as it manages the ClusterOperator object for this operator. | ||
setupClusterOperatorController(mgr, managedNamespace, currentFeatureGates, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as with previous comment, maybe move this up a level and before the switch to avoid calling it in three different places if it must always run
@@ -298,24 +298,24 @@ func setupReconcilers(mgr manager.Manager, infra *configv1.Infrastructure, platf | |||
Platform: platform, | |||
Infra: infra, | |||
}).SetupWithManager(mgr); err != nil { | |||
klog.Error(err, "unable to create controller", "controller", "CoreCluster") | |||
klog.Error(err, "unable to create core cluster controller", "controller", "CoreCluster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a structured logging format here, so controller
is adding a new key`, and the next string is the value. The logging through this function makes sense given the structured logging context
|
||
// getFeatureGates is used to fetch the current feature gates from the cluster. | ||
// We use this to check if the machine api migration is actually enabled or not. | ||
func getFeatureGates(mgr ctrl.Manager) (featuregates.FeatureGateAccess, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should 🤔
return nil, fmt.Errorf("failed to create config client: %w", err) | ||
} | ||
|
||
configInformers := configinformers.NewSharedInformerFactory(configClient, 10*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: This probably doesn't need to resync so quickly
|
||
availableCondition := newAvailableCondition(anyAvailableMissing, availableMsg, releaseVersion) | ||
degradedCondition := newDegradedCondition(anyDegradedMissing, degradedMsg) | ||
progressingCondition := newCondition(configv1.OperatorProgressing, configv1.ConditionFalse, "", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point the CO should be progressing, during an upgrade of operands, I wonder why this isn't represented here?
{CapiInstallerControllerAvailableCondition, CapiInstallerControllerDegradedCondition}, | ||
{CoreClusterControllerAvailableCondition, CoreClusterControllerDegradedCondition}, | ||
{InfraClusterControllerAvailableCondition, InfraClusterControllerDegradedCondition}, | ||
{KubeconfigControllerAvailableCondition, KubeconfigControllerDegradedCondition}, | ||
{SecretSyncControllerAvailableCondition, SecretSyncControllerDegradedCondition}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since not all of these have any reason to be degraded, do we require them all to always set a degraded condition?
Would it be better to factor this some way so that we only actually have conditions that are likely to change over time?
switch { | ||
case availableCondition.Status == "True" && degradedCondition.Status == "False": | ||
upgradeableStatus = configv1.ConditionTrue | ||
case availableCondition.Status == "False" || degradedCondition.Status == "True": | ||
upgradeableStatus = configv1.ConditionFalse | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this based on some document explaining how the conditions should be working?
return newCondition(configv1.OperatorAvailable, configv1.ConditionTrue, ReasonAsExpected, fmt.Sprintf("Cluster CAPI Operator is available at %s", releaseVersion)) | ||
} | ||
|
||
return newCondition(configv1.OperatorAvailable, configv1.ConditionFalse, "ControllersNotAvailable", fmt.Sprintf("The following controllers available conditions are not as expected: %s", strings.Join(messages, ", "))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I could tell from reviewing the other controllers, we will never report available false, which I think is probably an error right?
Status: status, | ||
Reason: reason, | ||
Message: message, | ||
LastTransitionTime: metav1.Now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the code applying these take care of making sure LastTransitionTime
doesn't tick up on every reconcile?
@JoelSpeed I agree with your points here. In fact this was my goal when writing this code. It was to put in place the logic for setting of controller level conditions and aggregation for them at the operator level (this is why for example I temporarily left in I didn't want to necessarily decide myself where we should be degraded, available, or their counterparts in each controller. I'm also happy to pull in folks from the CVO/upgrades team if we want to get their opinion on what's the best posture in terms of conditions at various moments of the operator/cluster lifecycle. |
@damdo: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The current ClusterOperator status conditions—Available, Progressing, Upgradeable, and Degraded—are set by the corecluster controller independently of the status of other controllers.
This approach does not align with the intended purpose of these conditions, which are meant to reflect the overall status of the operator, considering all the controllers it manages.
To address this, we should introduce controller-level conditions similar to the top-level ones. These conditions would influence an aggregated top-level status, which a new controller (the clusteroperator controller) would then consolidate into the Available, Progressing, Upgradeable, and Degraded conditions.
--
I suggest reviewing by commit