Improve compatibility with kstatus: avoid "resource is ready" race #9299

tmmorin · 2023-08-24T14:12:47Z

What would you like to be added (User Story)?

There is a low-hanging fruit of something easy to do to have CAPI resources play much nicer with tools relying on kstatus to know if a CAPI resource is ready.

Detailed Description

CAPI custom resources can be considered "Ready" by kstatus library before they actually are ready.

In a context where a tool relying on kstatus is used (e.g FluxCD) this opens the door to inconsistencies: wrongly concluding that something is ready and triggering things that depend on that too early.

This typically happens very shortly after resource creation and for a very short period of time. And this resolves very quickly. But there is a race condition if the tool using kstatus is used during the problematic time window.

The problematic time is when, for instance, a Cluster CR has no status yet, or when it has status but no status.conditions yet, with only the status.observedGeneration being set to 1 (equal to metadata.generation). As soon as the resource is processed by its controller, the controller will set its status to include in status.conditions a condition of type Ready and status False , and then kstatus will report a correct result (InProgress which means "ready").

The typical solution to this issue is to ensure that the CRD defines a default of -1 for status.observedGeneration - this is sufficient to let kstatus library ignore the rest and conclude that the resource isn't ready yet.

Quoting @stefanprodan (FluxCD dev):

They should set this to make ClusterAPI compatible
// +kubebuilder:default={"observedGeneration":-1}

Example here (FluxCD does this on their own CRDs):
https://github.com/fluxcd/source-controller/blob/a302c71c57e370403042a2e307e3f4446b539730/api/v1/gitrepository_types.go#L328

Anything else you would like to add?

#5472 has been opened a while ago, is strongly related, but has a much generic/wider scope than what I described her which is focusing only on having the InProgress/ready state free of this resource creation race condition.

Label(s) to be applied

/kind feature
/area api

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-08-24T14:12:57Z

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tmmorin · 2023-08-24T14:13:35Z

This was discussed on slack two days ago with @killianmuldoon , who suggested me to file an issue.

sbueringer · 2023-08-28T13:24:30Z

@fabriziopandini WDYT?

Seems sort of okay to me, although I'm not a huge fan of doing something like this just for Flux, is this some sort of standard?

(I assume observedGeneration 0 would work, if we wouldn't use omitempty?)

stefanprodan · 2023-08-28T13:34:05Z

Seems sort of okay to me, although I'm not a huge fan of doing something like this just for Flux, is this some sort of standard?

Quite intrigued about this statement, Flux is a graduated CNCF project, why wouldn't we want better synergy between CNCF projects?

PS. As @tmmorin mentioned in this issue, this about Kubernetes kstatus compatibility. Kstatus is a library used in kubectl and also Flux, maybe others CNCF projects use it too. Doc here: https://github.com/kubernetes-sigs/cli-utils/tree/master/pkg/kstatus

sbueringer · 2023-08-28T13:58:22Z

I didn't want to say that we wouldn't want to be compatible with Flux. I just wondered if there is some standard across the ecosystem (which not only includes Flux). For example at the moment Cluster API itself is using Condition types which diverge from the standard k/k conditions (afaik Cluster API and k/k conditions were introduced at ~ the same time). One interesting difference there is that observedGeneration is part of the conditions themselves in k/k (xref: https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/types.go#L1497).

We were thinking for a while if there is a way to make our status / status.conditions more standard so we can move utils like our patch helper to controller-runtime instead of requiring our own in Cluster API

It would be just nice if we could find a way for our condition to align to Kubernetes-ecosystem-wide standards.

I assume the Kubernetes condition type is compatible? (I mean it basically has to be, right? :)).

So setting observedGeneration default to -1 would be a stopgap to make Cluster API conditions compatible, but it wouldn't be another problem if we want to move to k/k conditions? (because they don't have a default of -1 as far as I can tell)

stefanprodan · 2023-08-28T14:16:50Z

The issue here is not with conditions but with resources just created that don't have .status set, before the controller gets the chance to set conditions. The Kubernetes builtin resources are handled in kstatus as exceptions, a Deployment without status will not return ready, while a custom resource would. Using the -1 default tells kstatus that a custom resource wasn't yet processed by a controller. Another way of handling this, is by adopting kstatus conditions and set the default for Reconciling to True. In Flux we've adopted kstatus Reconciling/Stalled conditions, we also set observedGeneration for each condition, but this is a lot of work compared to setting the status observedGeneration to -1.

k8s-triage-robot · 2024-01-27T04:43:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-26T05:36:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-27T06:24:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-27T06:24:22Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2025-04-09T11:11:53Z

/reopen
/remove lifecycle rotten

Let's revisit this given the ongoing work on #11947. Especially the work on moving to metav1.Conditions.
(with CAPI v1.11 (August 2025 release) we will be using metav1.Conditions in status.conditions)

I'm not sure if we can entirely follow what is lined out here: https://github.com/kubernetes-sigs/cli-utils/tree/master/pkg/kstatus e.g.:

The conditions defined in the library are designed to adhere to the "abnormal-true" polarity pattern (https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties) , i.e. that conditions are present and with a value of true whenever something unusual happens. So the absence of any conditions means everything is normal.

Cluster API follows https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties which states

Condition type names should make sense for humans; neither positive nor negative polarity can be recommended as a general rule

Example: We have an "Available" condition that is "True" if the Cluster is available.

What is not entirely clear to me is why kstatus considers a CR that has a status subresource with an empty status as fully reconciled. I guess they want to handle the case where a CR has a status but it's valid that even after reconcile the status is entirely empty.

@stefanprodan If we would add the default value for status.observedGeneration, what would this improve on the Flux side nowadays? I think you'll still need configuration so Flux knows which condition to look for? (the full list of conditions of the Cluster v1beta2 type can be seen here: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md#cluster-newconditions).
I'm not sure I understand how generation/observedGeneration correlates to that we'll need some config on the Flux side to specify which condition is the right one to look at (i.e. if we need one or both of these)

Related: What is the information you are looking for? The v1beta1 Cluster Ready condition doesn't cover much (especially if a Cluster is e.g. upgraded or workers are scaled up)

@JoelSpeed What is your take on setting a default of observedGeneration -1? I couldn't find anything about that in the API conventions.

cc @fabriziopandini

P.S. It was mention above that kstatus is used in kubectl. Is there a simple kubectl command I can use with our Cluster CR to play around with this?

k8s-ci-robot · 2025-04-09T11:11:58Z

@sbueringer: Reopened this issue.

In response to this:

/reopen
/remove lifecycle rotten

Let's revisit this given the ongoing work on #11947. Especially the work on moving to metav1.Conditions.
(with CAPI v1.11 (August 2025) release we will be using metav1.Conditions in status.conditions)

I'm not sure if we can enitrely follow what is lined out here: https://github.com/kubernetes-sigs/cli-utils/tree/master/pkg/kstatus e.g.:

The conditions defined in the library are designed to adhere to the "abnormal-true" polarity pattern (https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties) , i.e. that conditions are present and with a value of true whenever something unusual happens. So the absence of any conditions means everything is normal.

Cluster API follows https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#typical-status-properties which states

Condition type names should make sense for humans; neither positive nor negative polarity can be recommended as a general rule

What is not entirely clear to me is why kstatus considers a CR that has a status subresource with an empty status as fully reconciled. I guess they want to handle the case where a CR has a status but it's valid that even after reconcile the status is entirely empty.

@JoelSpeed What is your take on setting a default of observedGeneration 1? I couldn't find anything about that in the API conventions.

@stefanprodan If we would add the default value, what would this improve on the Flux side nowadays? I think you'll still need configuration so Flux knows which condition to look for? (the full list of conditions of the Cluster v1beta2 type can be seen here: https://github.com/kubernetes-sigs/cluster-api/blob/main/docs/proposals/20240916-improve-status-in-CAPI-resources.md#cluster-newconditions)
Related: What is the information you are looking for? The v1beta1 Cluster Ready condition doesn't cover much (especially if a Cluster is e.g. upgraded or workers are scaled up)

cc @JoelSpeed @fabriziopandini

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

stefanprodan · 2025-04-09T11:23:02Z

@sbueringer given that the Ready condition that kstatus looks for is being replaced with Available we can't rely on kstatus in Flux for ClusterAPI. Flux 2.5 has added custom health checks using CEL and we've added ClusterAPI to our docs here: https://fluxcd.io/flux/cheatsheets/cel-healthchecks/#cluster

After upgrading to v1beta2 the current health checks will fail, so Flux users will need to update their CEL expression and switch to Available, or write an expression that looks for both.

sbueringer · 2025-04-09T11:38:12Z

I think something like this should work

healthCheckExprs:
 - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    failed: status.conditions.filter(e, e.type == 'Ready' || e.type == 'Available').all(e, e.status == 'False')
    current: status.conditions.filter(e, e.type == 'Ready' || e.type == 'Available').all(e, e.status == 'True')

(Note for other readers (discussed offline): this won't actually use v1beta1 it will use the preferred version, which is why we have to handle both conditions as soon as v1beta2 has been added to the CAPI CRD)

JoelSpeed · 2025-04-09T11:52:38Z

(I assume observedGeneration 0 would work, if we wouldn't use omitempty?)

Why does 0 not work, even with omitempty?

The Kubernetes builtin resources are handled in kstatus as exceptions, a Deployment without status will not return ready, while a custom resource would.

Why would a custom resource without status be considered ready? A lack of status, assuming the resource has a status to set, clearly means the controller hasn't got to it yet

@JoelSpeed What is your take on setting a default of observedGeneration -1? I couldn't find anything about that in the API conventions.

Observed generation already has a sensible default when the object is created, omitted, just like the rest of the status. I'm not sure why we would want to set this to a non-obvious default.

From an end user perspective, a lack of status is clearly "nothing has happened yet". Where a partial status with a default value looks odd, and now probably isn't necessarily obvious why there's a partial status.

I'm currently -1 on setting any defaults on status objects right now

sbueringer · 2025-04-09T13:53:53Z

Thx for the feedback everyone.

given that the Ready condition that kstatus looks for is being replaced with Available we can't rely on kstatus in Flux for ClusterAPI.

I think based on that there would be no benefit anymore in setting a default of -1 on observedGeneration

k8s-triage-robot · 2025-05-09T14:08:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2025-05-09T14:08:25Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/api Issues or PRs related to the APIs labels Aug 24, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Aug 24, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 27, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 26, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 27, 2024

k8s-ci-robot reopened this Apr 9, 2025

k8s-ci-robot added the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Apr 9, 2025

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compatibility with kstatus: avoid "resource is ready" race #9299

Improve compatibility with kstatus: avoid "resource is ready" race #9299

tmmorin commented Aug 24, 2023 •

edited

Loading

k8s-ci-robot commented Aug 24, 2023

tmmorin commented Aug 24, 2023

sbueringer commented Aug 28, 2023 •

edited

Loading

stefanprodan commented Aug 28, 2023

sbueringer commented Aug 28, 2023 •

edited

Loading

stefanprodan commented Aug 28, 2023

k8s-triage-robot commented Jan 27, 2024

k8s-triage-robot commented Feb 26, 2024

k8s-triage-robot commented Mar 27, 2024

k8s-ci-robot commented Mar 27, 2024

sbueringer commented Apr 9, 2025 •

edited

Loading

k8s-ci-robot commented Apr 9, 2025

stefanprodan commented Apr 9, 2025

sbueringer commented Apr 9, 2025 •

edited

Loading

JoelSpeed commented Apr 9, 2025

sbueringer commented Apr 9, 2025 •

edited

Loading

k8s-triage-robot commented May 9, 2025

k8s-ci-robot commented May 9, 2025

Improve compatibility with kstatus: avoid "resource is ready" race #9299

Improve compatibility with kstatus: avoid "resource is ready" race #9299

Comments

tmmorin commented Aug 24, 2023 • edited Loading

What would you like to be added (User Story)?

Detailed Description

Anything else you would like to add?

Label(s) to be applied

k8s-ci-robot commented Aug 24, 2023

tmmorin commented Aug 24, 2023

sbueringer commented Aug 28, 2023 • edited Loading

stefanprodan commented Aug 28, 2023

sbueringer commented Aug 28, 2023 • edited Loading

stefanprodan commented Aug 28, 2023

k8s-triage-robot commented Jan 27, 2024

k8s-triage-robot commented Feb 26, 2024

k8s-triage-robot commented Mar 27, 2024

k8s-ci-robot commented Mar 27, 2024

sbueringer commented Apr 9, 2025 • edited Loading

k8s-ci-robot commented Apr 9, 2025

stefanprodan commented Apr 9, 2025

sbueringer commented Apr 9, 2025 • edited Loading

JoelSpeed commented Apr 9, 2025

sbueringer commented Apr 9, 2025 • edited Loading

k8s-triage-robot commented May 9, 2025

k8s-ci-robot commented May 9, 2025

tmmorin commented Aug 24, 2023 •

edited

Loading

sbueringer commented Aug 28, 2023 •

edited

Loading

sbueringer commented Aug 28, 2023 •

edited

Loading

sbueringer commented Apr 9, 2025 •

edited

Loading

sbueringer commented Apr 9, 2025 •

edited

Loading

sbueringer commented Apr 9, 2025 •

edited

Loading