CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

jakefhyde · 2025-02-07T19:41:55Z

What steps did you take and what happened?

Clusters that do not use machine deployments or machine pools (for example, configuring a cluster with machines manually) will cause the capi-controller-manager to endlessly write error logs whenever the cluster status is updated. The capi-controller-manager will show the following logs:

E0207 14:31:48.116038       1 cluster_controller_status.go:838] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment's RollingOut conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"
E0207 14:31:48.116068       1 cluster_controller_status.go:915] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingUp conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"
E0207 14:31:48.116081       1 cluster_controller_status.go:992] "Failed to aggregate ControlPlane, MachinePool, MachineDeployment, MachineSet's ScalingDown conditions" err="sourceObjs can't be empty" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="<namespace/cluster>" namespace="<namespace>" name="<name>" reconcileID="71e43209-4e27-41d9-8470-f1c1c33901c1"

The requisite code can be found here:

We are able to workaround this by setting the conditions to false so that they are present.

What did you expect to happen?

My expectation is that during the v1beta1 -> v1beta2 migration, the status is set to Unknown if the aggregate conditions are not present.

Cluster API version

v1.9.4

Kubernetes version

v1.30.2

Anything else you would like to add?

We implement a bring your own host style provisioning (not related to the byoh provider) in which users can register nodes freely, leaving lifecycle management to the user. Although a less common provisioning model, I imagine this could also affect clusters with control plane providers which have not been updated that also provision machines via manual definition.

Label(s) to be applied

/kind bug
/area conditions

The text was updated successfully, but these errors were encountered:

jakefhyde · 2025-02-07T23:34:04Z

Apologies for the rename, there was a little confusion on my end. We have a test case where the etcd plane is scaled to 0, a new etcd machine is created, and we perform an etcd restore on top of that. These logs were being printed endlessly, so I had erroneously assumed they were related. Although we skip draining with the machine.cluster.x-k8s.io/exclude-node-draining annotation, we weren't doing the same for volume detachment. I think this was fine previously purely by happy accident, and the addition of alwaysReconcile helped expose this race condition.

That all being said, would it be possible to lower the log level for those messages until v1beta2 goes live? I'm content to just leave the conditions on there for now, otherwise it fills the logs and makes debugging quite difficult.

chrischdi · 2025-02-19T14:57:35Z

/triage accepted
/priority important-soon

chrischdi · 2025-02-19T15:10:54Z

/help

k8s-ci-robot · 2025-02-19T15:11:16Z

@chrischdi:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fabriziopandini · 2025-02-19T17:57:52Z

Note, I will try to take a look at this in the context of the work I'm doing for #11474, but I'm not sure if and when I will get to this (if someone else wants to take care of this before me, feel free to do it!)

cahillsf · 2025-03-04T02:26:33Z

id be interested in taking a look at this. not super familiar with the new conditions but seems like a small change to this logic should address @fabriziopandini?

cluster-api/util/conditions/v1beta2/aggregate.go

Lines 61 to 63 in 3cfb41d

    
           if len(sourceObjs) == 0 { 
        
           	return nil, errors.New("sourceObjs can't be empty") 
        
           }

fabriziopandini · 2025-03-04T13:16:58Z

@cahillsf I don't think it is correct to change the logic in the NewAggregateCondition func, it is the calling code that should not compute an aggregation if there are no conditions to aggregate

The issue was initially reporting:

Clusters that do not use machine deployments or machine pools (for example, configuring a cluster with machines manually) will cause the capi-controller-manager to endlessly write error logs whenever the cluster status is updated.

The write errors were about RollingOut, ScalingUp and ScalingDown conditions at cluster level.
If I look at how those conditions are computed, we already have code that avoids to call aggregate if there are no MD and MP, e.g.

cluster-api/internal/controllers/cluster/cluster_controller_status.go

Lines 942 to 949 in ad64bf6

    
           if controlPlane == nil && len(machinePools.Items)+len(machineDeployments.Items) == 0 { 
        
           	v1beta2conditions.Set(cluster, metav1.Condition{ 
        
           		Type:   clusterv1.ClusterScalingDownV1Beta2Condition, 
        
           		Status: metav1.ConditionFalse, 
        
           		Reason: clusterv1.ClusterNotScalingDownV1Beta2Reason, 
        
           	}) 
        
           	return 
        
           }

However it looks like this logic doesn't take into account if the control plane object is reporting the optional ScalingUp condition or not (it only checks for controlPlane == nil).

This can be probably fixed by moving the existing check right before calling NewAggregateCondition, and replacing the if condition (currently controlPlane == nil && len(machinePools.Items)+len(machineDeployments.Items) == 0) with if len(ws)==0

Another option is to report "Conditions ScalingDown not yet reported from ..." from the CP if the condition is missing, but it seems wrong given that according to our contract conditions are optional

@cahillsf @chrischdi @sbueringer opinions?

cahillsf · 2025-03-04T14:25:52Z

ah, i see -- yes from what i understand this approach makes sense to me:

This can be probably fixed by moving the existing check right before calling NewAggregateCondition, and replacing the if condition (currently controlPlane == nil && len(machinePools.Items)+len(machineDeployments.Items) == 0) with if len(ws)==0

sbueringer · 2025-03-05T06:37:31Z

Sounds good

cahillsf · 2025-03-09T23:03:35Z

/assign cahillsf

jakefhyde changed the title ~~Cluster fails to provision if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions~~ CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions Feb 7, 2025

jakefhyde mentioned this issue Feb 7, 2025

Fix capi conditions and deletion of etcd plane rancher/rancher#49053

Closed

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Feb 19, 2025

k8s-ci-robot assigned cahillsf Mar 9, 2025

This was referenced Mar 10, 2025

🐛 Do not compute aggregate cluster conditions when no sourceObjs cahillsf/cluster-api#5

Draft

🐛 Modify calling agg cluster conditions #11952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

jakefhyde commented Feb 7, 2025 •

edited

Loading

jakefhyde commented Feb 7, 2025

chrischdi commented Feb 19, 2025

chrischdi commented Feb 19, 2025

k8s-ci-robot commented Feb 19, 2025

fabriziopandini commented Feb 19, 2025

cahillsf commented Mar 4, 2025

fabriziopandini commented Mar 4, 2025 •

edited

Loading

cahillsf commented Mar 4, 2025

sbueringer commented Mar 5, 2025

cahillsf commented Mar 9, 2025

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

CAPI logs filled with error messages if no machine deployments/pools exist and ControlPlane does not implement v1beta2 conditions #11820

Comments

jakefhyde commented Feb 7, 2025 • edited Loading

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

jakefhyde commented Feb 7, 2025

chrischdi commented Feb 19, 2025

chrischdi commented Feb 19, 2025

k8s-ci-robot commented Feb 19, 2025

Guidelines

fabriziopandini commented Feb 19, 2025

cahillsf commented Mar 4, 2025

fabriziopandini commented Mar 4, 2025 • edited Loading

cahillsf commented Mar 4, 2025

sbueringer commented Mar 5, 2025

cahillsf commented Mar 9, 2025

jakefhyde commented Feb 7, 2025 •

edited

Loading

fabriziopandini commented Mar 4, 2025 •

edited

Loading