-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a way to surface arbitrary node conditions at machine level #11826
Comments
This issue is currently awaiting triage. If CAPI contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The MachineHealthCheck does not start the deletion, it only sets the remediation condition. You should be able to turn off the remediation by configuring e.g.
|
I see, thanks! But I guess it's still implicit and equivalent to the "use annotation" solution, since the only place to control remediation (which is initiated by an MHC probe failure) is MD configuration. We'd like to do it centrally, by configuring MHC, and not be able to forget to set some MD field and get undesired behavior. What do you think of it? |
Just for me to understand the use-case better: How does v1beta2 solve your use-case? Is it due to the new I think if v1beta2 already solves this for you, then we should not start adding another feature we later on can hardly get rid of. Note: you can already today use the readinessGates and observe the conditions in |
Oh, let me explain. The simplified goal is: we want an MD to pause its rolling update if more than v1beta2 introduces improved status calculation that fixes the issue above:
So, in v1beta2 we'll be able to achieve our goal by making Machines not ready and we consider two solutions:
The second one looks for us as "copy MHC as is, but without remediation". We prefer to improve existing feature to cover a wider range of use cases :) Related issue to the theme: #11023 |
IMO the proper way forward is to Implement controller to set custom condition on Machines + use the new readiness gates feature. I think that the other option, use MHC without remediation, not only will be confusing from a user POV, but it also makes it impossible to use MHC to respond to failures of those machines down the line (which is why this component exist). |
I understand your opinion, we considered such solution also. However, we realized that we probably need exactly what MHC does and the "custom" controller could be just copy-paste of the original what looks a bit weird from my perspective.
Why do you think it will be confusing for other users? What we want it to be able to perform healthchecking, but without self-healing. I suppose it's pretty clear for any user when the MachineHealthCheck resource is able to perform health check only. Let me give you a little more details :) In our case, we don't need self-healing to be performed from management cluster at all, but do want healthchecking. That's why I believe that pluggable remediation for the MachineHealthCheck resource has the right to life. It's completely optional and any user will be free to decide if he/she needs remediation |
As I mentioned above:
So you would still would like the MachineHealthCheck controller to set that condition right? So the change you would like to consider instead would go to the controller which is actually doing the remediation based on this condition. Also from UX point of view it would be confusing to users to have one cluster, where the condition gets set and results in remediation, but at the other cluster it does not happen. So I also am +1 on going with the custom controller solution, because then you also can define your own condition for this which clearly signals what's going on. |
Not really, I'd suggest the MHC controller not to set the condition (neither create an external remediation resource) if the remediation is disabled in CR. This kind of solution affects the MHC controller only. I'm not 100% sure yet, but I suppose that exiting after the cluster-api/internal/controllers/machinehealthcheck/machinehealthcheck_controller.go Line 242 in 48746cb Thus, all targets will be healthchecked and the |
If NPD reports Also, "HealthCheckSuccedeed" = false without a remediation is already used today when MHC detects that there are too many remediation in flight, and by re-using the same signal for Not to mention External remediation, but I did not have time to check this properly.
TBH, it seems to me that what you need is not health checking, but a way to surface node conditions at machine level, which is an interesting idea we should probably explore down the line.
I personally don't think we should conflate this new requirement (surface node conditions at machine level) with MHC, no matter if this could be convenient for developers or not. |
Yes, we just call it health checking for CAPI machines. Sorry if it confused you :) Personally I think its really close to what MHC does and I don't see any reason why pluggable remediation would confuse any user. The UX issues you talk about take place for the MHC even now: the machine will report "opaque" Why do you think this is a UX issue of the proposed solution, rather than an improvement point for the entire HMC? |
Renamed the issue to better surface the outcome of the discussion so far. Trying to move forward, we should now do some research to find a solution for surfacing arbitrary node conditions at machine level My hot take is that we can leverage the mechanism existing in the machine controller that already takes care of surfacing the NodeReady and NodeHealthy (which is an aggregation of a few node conditions) on the Machine. The missing part to be figured out is how to make the list of additional conditions to be configurable, which, depending on which solution we choose, could imply also changes in ClusterClasses, Cluster, KCP (control plane contract), MD, MP, MS |
What would you like to be added (User Story)?
As operator I'd like to be able to reflect workload cluster's Node status on relevant Machine resources without any remediation.
Detailed Description
We're looking for a way to obtain more control over the MD's (and KCP in the future) rolling update process. First of all, we want to reflect custom conditions, that NPD set, on the Machine's Ready condition in order to pause rolling update until all checks are succeeded. It'll be possible out of the box by using MachineHealthCheck after v1beta2 is released, but MHC is tightly coupled with the remediation feature that we don't want to use.
In this case we want to reflect custom health checks from workload cluster on it's lifecycle management, controlled by CAPI.
Since the resource is named MachineHealthCheck, not MachineSelfHealing, I suppose it'd be ok to plug out remediation when we want to. It could be implemented just like one optional boolean field
remediationDisabled
which will be backward compatible.Anything else you would like to add?
Considered alternative solutions:
cluster.x-k8s.io/skip-remediation
annotation on Machine resources: it's implicit and error prone since the annotation can be easily forgotten when creating a new MD;maxUnhealthy: 0
: deprecated and has no alternatives for now (Deprecate MachineHealthCheck MaxUnhealthy and UnhealthyRange #10722)Label(s) to be applied
/kind feature
/area machinehealthcheck
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: