Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include Condition Reason When Merging Conditions in CAPI v1beta2 #11899

Open
LingyanCao opened this issue Feb 25, 2025 · 4 comments
Open

Include Condition Reason When Merging Conditions in CAPI v1beta2 #11899

LingyanCao opened this issue Feb 25, 2025 · 4 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@LingyanCao
Copy link

LingyanCao commented Feb 25, 2025

What would you like to be added (User Story)?

As a developer, I would like to retain condition Reason when aggregating conditions because it can help us locate the issue by checking the Reason string.

Detailed Description

In CAPI v1beta2, the top-level Cluster resource aggregates unexpected condition statuses from underlying resources. For example, the ControlPlaneMachinesReady condition consolidates the Ready condition of all control plane machines, highlighting any unready machines.

However, when merging condition messages, the Reason field is not included by default. This behavior is defined in the merge strategy implementation.

This omission presents a challenge when diagnosing issues. If an infrastructure provider reports an error via a condition in infra machine resource, only the condition message propagates to the top-level Cluster resource. Consequently, when building an operator that interacts solely with the Cluster resource, we must rely on message parsing and pattern matching to infer the root cause, which is inefficient and error-prone.

By preserving both the Reason and Message fields during condition merging, we can retain structured and concise error information. The short Reason string provides a standardized way to identify failure categories, improving debugging and automation.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature
/area conditions

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 25, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fabriziopandini
Copy link
Member

fabriziopandini commented Mar 5, 2025

Some context:

metav1.Condition type (and API conventions) define

  • Message field as a "human readable message indicating details about the transition".
  • Reason field as a "programmatic identifier indicating the reason for the condition's last transition".

The challenge is that while merging many conditions into one like we do in the cluster, we loose some details about the underlying conditions, and this is particularly relevant for the reason field, but also the message have the same issues about loosing information even if we have a little bit more room there and we came up with solutions like multiline messages.


Going back to the request

My first reaction is that, even if we find a way to add reasons to messages, I don't think that interacting with the Cluster solely allows to automatically root cause issues, because merged condition are only designed to try to point to where the problem might be.

A few additional notes:

  • Messages have a weak contract, we keep improving them every release (and we should keep doing it). But this makes messages a poor candidate for building something any external component should rely on
  • Also messages are impacted by the loosing information issue during the merge process, another reason why any external component should not rely on them.
  • I fear that trying to make messages "machine readable" for external components will make them less "human readable", thus compromising the main reason a message exist.

But of course I would like to hear opinion from other maintainers as well...

@sbueringer
Copy link
Member

sbueringer commented Mar 7, 2025

I agree. We discussed this during the implementation of v1beta2 conditions. I don't see a good way to preserve reason when we merge multiple conditions into one. We only have one reason field in the resulting condition and message is not supposed to be machine readable

@LingyanCao
Copy link
Author

LingyanCao commented Mar 11, 2025

Thanks @fabriziopandini and @sbueringer for your reply.

In V1Beta2, the CAPI Cluster CR already reports the status of all underlying resources through conditions. This means we can potentially rely solely on CAPI cluster conditions to diagnose issues. For example, the ControlPlaneMachinesReady condition aggregates the Ready conditions of all control plane Machine CRs. If an infrastructure machine reports an error via its Ready condition, this condition is mirrored in the InfrastructureReady condition of the corresponding CAPI Machine. The CAPI Machine then merges all its conditions (including InfrastructureReady) into its own Ready condition. As a result, the CAPI Cluster’s ControlPlaneMachinesReady condition ultimately reflects the infrastructure error.

With this aggregation, we only need to check the top-level CAPI cluster resource to capture errors. However, when merging conditions, we currently retain only the condition type and message in the following format:

*ConditionType: ConditionMessage

This approach makes it difficult to pinpoint the root cause based on the message alone. If we modify the format to include the condition reason, such as:

*ConditionType-ConditionReason: ConditionMessage

We can more easily identify the underlying issue by matching the simple reason string.

Open Questions:
What is the intended role of aggregated conditions at the cluster level? Are they primarily for customer-facing readability?
If we build an operator based on CAPI, should it rely solely on top-layer CAPI cluster conditions, or is it still recommended to inspect underlying resources (ControlPlane, Machine, Infra Machine) for detailed errors?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

4 participants