Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to surface arbitrary node conditions at machine level #11826

Open
sm4ll-3gg opened this issue Feb 10, 2025 · 12 comments
Open

Provide a way to surface arbitrary node conditions at machine level #11826

sm4ll-3gg opened this issue Feb 10, 2025 · 12 comments
Labels
area/machinehealthcheck Issues or PRs related to machinehealthchecks kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@sm4ll-3gg
Copy link

What would you like to be added (User Story)?

As operator I'd like to be able to reflect workload cluster's Node status on relevant Machine resources without any remediation.

Detailed Description

We're looking for a way to obtain more control over the MD's (and KCP in the future) rolling update process. First of all, we want to reflect custom conditions, that NPD set, on the Machine's Ready condition in order to pause rolling update until all checks are succeeded. It'll be possible out of the box by using MachineHealthCheck after v1beta2 is released, but MHC is tightly coupled with the remediation feature that we don't want to use.

In this case we want to reflect custom health checks from workload cluster on it's lifecycle management, controlled by CAPI.

Since the resource is named MachineHealthCheck, not MachineSelfHealing, I suppose it'd be ok to plug out remediation when we want to. It could be implemented just like one optional boolean field remediationDisabled which will be backward compatible.

Anything else you would like to add?

Considered alternative solutions:

  • Use cluster.x-k8s.io/skip-remediation annotation on Machine resources: it's implicit and error prone since the annotation can be easily forgotten when creating a new MD;
  • Use maxUnhealthy: 0: deprecated and has no alternatives for now (Deprecate MachineHealthCheck MaxUnhealthy and UnhealthyRange #10722)
  • Implement custom no-op remediation template: looks more like workaround than a solution
  • Implement custom controller that will be doing the same things like MHC, but without remediation: but why not to just patch existing feature instead of duplication?

Label(s) to be applied

/kind feature
/area machinehealthcheck
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/machinehealthcheck Issues or PRs related to machinehealthchecks needs-priority Indicates an issue lacks a `priority/foo` label and requires one. labels Feb 10, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 10, 2025
@chrischdi
Copy link
Member

The MachineHealthCheck does not start the deletion, it only sets the remediation condition.

You should be able to turn off the remediation by configuring e.g. MaxInFlight for a MachineDeployment at machinedeployment.spec,strategy.remediation.maxInFlight:

maxInFlight determines how many in flight remediations should happen at the same time.

Remediation only happens on the MachineSet with the most current revision, while
older MachineSets (usually present during rollout operations) aren't allowed to remediate.

Note: In general (independent of remediations), unhealthy machines are always
prioritized during scale down operations over healthy ones.

MaxInFlight can be set to a fixed number or a percentage.
Example: when this is set to 20%, the MachineSet controller deletes at most 20% of
the desired replicas.

If not set, remediation is limited to all machines (bounded by replicas)
under the active MachineSet's management.

@sm4ll-3gg
Copy link
Author

sm4ll-3gg commented Feb 11, 2025

The MachineHealthCheck does not start the deletion, it only sets the remediation condition.

You should be able to turn off the remediation by configuring e.g. MaxInFlight for a MachineDeployment at machinedeployment.spec,strategy.remediation.maxInFlight:

maxInFlight determines how many in flight remediations should happen at the same time.

Remediation only happens on the MachineSet with the most current revision, while
older MachineSets (usually present during rollout operations) aren't allowed to remediate.

Note: In general (independent of remediations), unhealthy machines are always
prioritized during scale down operations over healthy ones.

MaxInFlight can be set to a fixed number or a percentage.
Example: when this is set to 20%, the MachineSet controller deletes at most 20% of
the desired replicas.

If not set, remediation is limited to all machines (bounded by replicas)
under the active MachineSet's management.

I see, thanks! But I guess it's still implicit and equivalent to the "use annotation" solution, since the only place to control remediation (which is initiated by an MHC probe failure) is MD configuration. We'd like to do it centrally, by configuring MHC, and not be able to forget to set some MD field and get undesired behavior.

What do you think of it?

@chrischdi
Copy link
Member

Just for me to understand the use-case better:

How does v1beta2 solve your use-case? Is it due to the new Machine.spec.readinessGates?

I think if v1beta2 already solves this for you, then we should not start adding another feature we later on can hardly get rid of.

Note: you can already today use the readinessGates and observe the conditions in .status.v1beta2.conditions.

@sm4ll-3gg
Copy link
Author

Just for me to understand the use-case better:

How does v1beta2 solve your use-case? Is it due to the new Machine.spec.readinessGates?

I think if v1beta2 already solves this for you, then we should not start adding another feature we later on can hardly get rid of.

Note: you can already today use the readinessGates and observe the conditions in .status.v1beta2.conditions.

Oh, let me explain. The simplified goal is: we want an MD to pause its rolling update if more than maxUnavailable machines are unhealthy. The healthiness we define as Readiness + custom NPD conditions on node.
For now (v1beta1) it's impossible because MachineSet counts ReadyReplicas strictly as count of ready nodes in workload cluster. It means that MS ignores Machine readiness that we can control unlike Node readiness.

v1beta2 introduces improved status calculation that fixes the issue above:

Change the semantic of ReadyReplicas counters to use Machine's Ready condition instead of Node's Ready condition. (so everywhere Ready is used for a Machine it always means the same thing)

So, in v1beta2 we'll be able to achieve our goal by making Machines not ready and we consider two solutions:

  • Use MHC without remediation
  • Implement controller to set custom condition on Machines + use the new readiness gates feature.

The second one looks for us as "copy MHC as is, but without remediation". We prefer to improve existing feature to cover a wider range of use cases :)

Related issue to the theme: #11023

@fabriziopandini
Copy link
Member

IMO the proper way forward is to Implement controller to set custom condition on Machines + use the new readiness gates feature.

I think that the other option, use MHC without remediation, not only will be confusing from a user POV, but it also makes it impossible to use MHC to respond to failures of those machines down the line (which is why this component exist).

@sm4ll-3gg
Copy link
Author

IMO the proper way forward is to Implement controller to set custom condition on Machines + use the new readiness gates feature.

I understand your opinion, we considered such solution also. However, we realized that we probably need exactly what MHC does and the "custom" controller could be just copy-paste of the original what looks a bit weird from my perspective.

I think that the other option, use MHC without remediation, not only will be confusing from a user POV, but it also makes it impossible to use MHC to respond to failures of those machines down the line (which is why this component exist).

Why do you think it will be confusing for other users? What we want it to be able to perform healthchecking, but without self-healing. I suppose it's pretty clear for any user when the MachineHealthCheck resource is able to perform health check only.


Let me give you a little more details :)
We have many bare metal servers for our machines and we don't want the related machines to be deleted automatically due to expensiveness of the operation. For some failures we want to response with self-healing from the inside of the workload clusters because not all our clusters are CAPI provisioned and we want to have same tooling for all of them. Another failures we can't heal automatically at all and, for now, we want a human to decide what to do with them. We go to the automation ofc, but it takes time.

In our case, we don't need self-healing to be performed from management cluster at all, but do want healthchecking. That's why I believe that pluggable remediation for the MachineHealthCheck resource has the right to life. It's completely optional and any user will be free to decide if he/she needs remediation

@chrischdi
Copy link
Member

As I mentioned above:

The MachineHealthCheck does not start the deletion, it only sets the remediation condition.

So you would still would like the MachineHealthCheck controller to set that condition right?

So the change you would like to consider instead would go to the controller which is actually doing the remediation based on this condition.
So this change would need to go into Machine Controller and any control plane provider you are using which for me seems to be a non-trivial change.

Also from UX point of view it would be confusing to users to have one cluster, where the condition gets set and results in remediation, but at the other cluster it does not happen.

So I also am +1 on going with the custom controller solution, because then you also can define your own condition for this which clearly signals what's going on.

@sm4ll-3gg
Copy link
Author

So you would still would like the MachineHealthCheck controller to set that condition right?

Not really, I'd suggest the MHC controller not to set the condition (neither create an external remediation resource) if the remediation is disabled in CR. This kind of solution affects the MHC controller only.

I'm not 100% sure yet, but I suppose that exiting after the healthCheckTargets for this case should solve the issue. Here:

Thus, all targets will be healthchecked and the HealthCheckSucceeded condition will be updated, but no remediation logic (neither in the machine controller, nor the external one) will be triggered.

@fabriziopandini
Copy link
Member

Why do you think it will be confusing for other users?

If NPD reports FrequentKubeletRestart condition with a message, and this is relevant for machines, surfacing this condition instead of an opaque "HealthCheckSuccedeed" provides a better UXon the machine itself and even more at higher levels where the system will report that x Machine are "not healthy (not to be remediated by MachineDeployment/MachineSet)".

Also, "HealthCheckSuccedeed" = false without a remediation is already used today when MHC detects that there are too many remediation in flight, and by re-using the same signal for FrequentKubeletRestart or NPD conditions will be confusing.

Not to mention External remediation, but I did not have time to check this properly.

In our case, we don't need self-healing to be performed from management cluster at all, but do want healthchecking.

TBH, it seems to me that what you need is not health checking, but a way to surface node conditions at machine level, which is an interesting idea we should probably explore down the line.

However, we realized that we probably need exactly what MHC does and the "custom" controller could be just copy-paste of the original what looks a bit weird from my perspective

I personally don't think we should conflate this new requirement (surface node conditions at machine level) with MHC, no matter if this could be convenient for developers or not.

@sm4ll-3gg
Copy link
Author

TBH, it seems to me that what you need is not health checking, but a way to surface node conditions at machine level, which is an interesting idea we should probably explore down the line.

Yes, we just call it health checking for CAPI machines. Sorry if it confused you :)
Basically, I'm open to discuss any solution (as part of MHC or not). I believe it would be useful for the other users as well, that's why I'd like to see it as part of CAPI.


Personally I think its really close to what MHC does and I don't see any reason why pluggable remediation would confuse any user. The UX issues you talk about take place for the MHC even now: the machine will report "opaque" HealthCheckSucceeded: false condition if MHC thinks it's unhealthy. The remediation can be disabled via annotation (which is really implicit), or short-circuited, or the external remediation can take a lot of time. All of these events will result in the machine being unready for some time, just as if the fix was disabled at the CR level

Why do you think this is a UX issue of the proposed solution, rather than an improvement point for the entire HMC?

@fabriziopandini fabriziopandini changed the title Opt-out remediation for MachineHealthCheck Provide a way to surface arbitrary node conditions at machine level Mar 5, 2025
@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 5, 2025

Renamed the issue to better surface the outcome of the discussion so far.

Trying to move forward, we should now do some research to find a solution for surfacing arbitrary node conditions at machine level

My hot take is that we can leverage the mechanism existing in the machine controller that already takes care of surfacing the NodeReady and NodeHealthy (which is an aggregation of a few node conditions) on the Machine.

The missing part to be figured out is how to make the list of additional conditions to be configurable, which, depending on which solution we choose, could imply also changes in ClusterClasses, Cluster, KCP (control plane contract), MD, MP, MS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machinehealthcheck Issues or PRs related to machinehealthchecks kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants