Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

Open
jakefhyde opened this issue Feb 7, 2025 · 10 comments · May be fixed by #11952 or cahillsf/cluster-api#5
Assignees
Labels
area/conditions Issues or PRs related to conditions help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jakefhyde
Copy link
Contributor

jakefhyde commented Feb 7, 2025

What steps did you take and what happened?

Clusters that do not use machine deployments or machine pools (for example, configuring a cluster with machines manually) will cause the capi-controller-manager to endlessly write error logs whenever the cluster status is updated. The capi-controller-manager will show the following logs:

E0207 14:31:48.116038       1 cluster_controller_status.go:838] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment's RollingOut conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"
E0207 14:31:48.116068       1 cluster_controller_status.go:915] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingUp conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"
E0207 14:31:48.116081       1 cluster_controller_status.go:992] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingDown conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"

The requisite code can be found here:

We are able to workaround this by setting the conditions to false so that they are present.

What did you expect to happen?

My expectation is that during the v1beta1 -> v1beta2 migration, the status is set to Unknown if the aggregate conditions are not present.

Cluster API version

v1.9.4

Kubernetes version

v1.30.2

Anything else you would like to add?

We implement a bring your own host style provisioning (not related to the byoh provider) in which users can register nodes freely, leaving lifecycle management to the user. Although a less common provisioning model, I imagine this could also affect clusters with control plane providers which have not been updated that also provision machines via manual definition.

Label(s) to be applied

/kind bug
/area conditions

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/conditions Issues or PRs related to conditions needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 7, 2025
@jakefhyde jakefhyde changed the title Cluster fails to provision if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions Feb 7, 2025
@jakefhyde
Copy link
Contributor Author

Apologies for the rename, there was a little confusion on my end. We have a test case where the etcd plane is scaled to 0, a new etcd machine is created, and we perform an etcd restore on top of that. These logs were being printed endlessly, so I had erroneously assumed they were related. Although we skip draining with the machine.cluster.x-k8s.io/exclude-node-draining annotation, we weren't doing the same for volume detachment. I think this was fine previously purely by happy accident, and the addition of alwaysReconcile helped expose this race condition.

That all being said, would it be possible to lower the log level for those messages until v1beta2 goes live? I'm content to just leave the conditions on there for now, otherwise it fills the logs and makes debugging quite difficult.

@chrischdi
Copy link
Member

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Feb 19, 2025
@chrischdi
Copy link
Member

/help

@k8s-ci-robot
Copy link
Contributor

@chrischdi:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 19, 2025
@fabriziopandini
Copy link
Member

Note, I will try to take a look at this in the context of the work I'm doing for #11474, but I'm not sure if and when I will get to this (if someone else wants to take care of this before me, feel free to do it!)

@cahillsf
Copy link
Member

cahillsf commented Mar 4, 2025

id be interested in taking a look at this. not super familiar with the new conditions but seems like a small change to this logic should address @fabriziopandini?

if len(sourceObjs) == 0 {
return nil, errors.New("sourceObjs can't be empty")
}

@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 4, 2025

@cahillsf I don't think it is correct to change the logic in the NewAggregateCondition func, it is the calling code that should not compute an aggregation if there are no conditions to aggregate

The issue was initially reporting:

Clusters that do not use machine deployments or machine pools (for example, configuring a cluster with machines manually) will cause the capi-controller-manager to endlessly write error logs whenever the cluster status is updated.

The write errors were about RollingOut, ScalingUp and ScalingDown conditions at cluster level.
If I look at how those conditions are computed, we already have code that avoids to call aggregate if there are no MD and MP, e.g.

if controlPlane == nil && len(machinePools.Items)+len(machineDeployments.Items) == 0 {
v1beta2conditions.Set(cluster, metav1.Condition{
Type: clusterv1.ClusterScalingDownV1Beta2Condition,
Status: metav1.ConditionFalse,
Reason: clusterv1.ClusterNotScalingDownV1Beta2Reason,
})
return
}

However it looks like this logic doesn't take into account if the control plane object is reporting the optional ScalingUp condition or not (it only checks for controlPlane == nil).

This can be probably fixed by moving the existing check right before calling NewAggregateCondition, and replacing the if condition (currently controlPlane == nil && len(machinePools.Items)+len(machineDeployments.Items) == 0) with if len(ws)==0

Another option is to report "Conditions ScalingDown not yet reported from ..." from the CP if the condition is missing, but it seems wrong given that according to our contract conditions are optional

@cahillsf @chrischdi @sbueringer opinions?

@cahillsf
Copy link
Member

cahillsf commented Mar 4, 2025

ah, i see -- yes from what i understand this approach makes sense to me:

This can be probably fixed by moving the existing check right before calling NewAggregateCondition, and replacing the if condition (currently controlPlane == nil && len(machinePools.Items)+len(machineDeployments.Items) == 0) with if len(ws)==0

@sbueringer
Copy link
Member

Sounds good

@cahillsf
Copy link
Member

cahillsf commented Mar 9, 2025

/assign cahillsf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/conditions Issues or PRs related to conditions help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants