Skip to content

KEP-3926: Add beta graduation criteria#5739

Merged
k8s-ci-robot merged 4 commits into
kubernetes:masterfrom
ibihim:ibihim/2025-12-16_kep-3926-update-beta-requirements
Feb 3, 2026
Merged

KEP-3926: Add beta graduation criteria#5739
k8s-ci-robot merged 4 commits into
kubernetes:masterfrom
ibihim:ibihim/2025-12-16_kep-3926-update-beta-requirements

Conversation

@ibihim
Copy link
Copy Markdown
Contributor

@ibihim ibihim commented Dec 16, 2025

Description

Add beta graduation criteria, design principles from SIG API Machinery, and testing requirements

Issue link

#3926

Other comments:

This PR updates KEP-3926 with:

  • Design Principles: Documents the SIG API Machinery consensus on watch cache behavior
    during corrupt object deletion (watch history cannot be preserved, performance
    degradation is acceptable, recovery priorities)

  • Watch Event Propagation and Client Recovery: Explains the deliberate recovery
    sequence when corrupt objects are deleted

  • Alternative Approaches Considered: Documents why shallow object representations
    (DeletedFinalStateUnknown, PartialObjectMetadata, newFunc Object) were
    not pursued

  • Beta Graduation Criteria:

    • Feature enabled by default
    • Dry-run support (#134037)
    • Comprehensive testing requirements including CRD support, bit-flip deserialization,
      and KAS health recovery verification
  • Integration Tests: Updated to reflect both Alpha and Beta testing requirements
    with links to open PRs (#129129, #128726)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 16, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 16, 2025
@ibihim ibihim changed the title KEP-3926: Add beta graduation criteria and design principles KEP-3926: Add beta graduation criteria Dec 16, 2025
@ibihim
Copy link
Copy Markdown
Contributor Author

ibihim commented Dec 17, 2025

/sig api-machinery

@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Dec 17, 2025
@enj enj added this to SIG Auth Dec 17, 2025
@enj enj moved this to Needs Triage in SIG Auth Dec 17, 2025
@ibihim ibihim force-pushed the ibihim/2025-12-16_kep-3926-update-beta-requirements branch from 9ebd69d to 22825eb Compare December 19, 2025 12:55
@stlaz
Copy link
Copy Markdown
Member

stlaz commented Jan 5, 2026

/assign

@stlaz stlaz moved this from Needs Triage to In Review in SIG Auth Jan 5, 2026
Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md Outdated
Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md Outdated
Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md Outdated
Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md Outdated
@benluddy
Copy link
Copy Markdown
Contributor

- list error aggregation
- delete handler with unsafe deletion flow
- deserialization failure via bit-flip (not just encryption failure)
- deletion works for CRDs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, is this referring to deletion of CRD-backed custom resources or deletion of CRDs themselves? If the CRDs themselves: since the CRD finalizer will be bypassed, is there anything we can/should do about "latent" CRs?

Copy link
Copy Markdown
Contributor

@sttts sttts Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRs, right.

CRDs without the finalizer being followed will just disappear and leak their CRs. Recreation will make them visible again. I guess similar behaviour applies for other finalizers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry. I meant CRs, not CRDs.

Good catch!

Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md
Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md

3. **Recovery priorities** (in order):
- Remove the corrupt data from storage
- Restore kube-apiserver to a healthy state
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who/what action is supposed to restore kas to a healthy state?, or the expectation is it goes to this state automatically?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An admin who was enabled to delete all corrupt objects brings the kube-apiserver into a healthy state.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would make sense clarifying this in the KEP.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also a little confused about what additional steps are required beyond deleting the corrupt object ... doesn't KAS return to a healthy state automatically once that is done?

Copy link
Copy Markdown
Contributor Author

@ibihim ibihim Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admin deletes corrupt objects.
Then, without any further manual intervention, the kube-apiserver and the informers come back to a healthy state.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the paragraph to properly describe that we want to enable the Admin to do the recovery by deleting the corrupt objects and that as a result kube-apiserver and the informer will recover.

3. **Recovery priorities** (in order):
- Remove the corrupt data from storage
- Restore kube-apiserver to a healthy state
- Allow clients to converge eventually
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean clients will simply retry?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converge to the new state without a corrupt object. Yes, they will retry to rebuild their cache.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, it would make sense clarifying it in the KEP.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in particular the distinction between "action" and "automation". Good point!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding is that the current implementation of client-go informers will do this automatically, correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they will do so automatically. Only the corrupt objects need to be deleted.

cannot transform or decode the object's previous value. This triggers a
deliberate recovery sequence:

1. **Error Detection**: The etcd3 watcher fails to transform/decode the deleted
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an idea: That would be fantastic having a flow diagram for this.

Beta:
- list error aggregation
- delete handler with unsafe deletion flow
- deserialization failure via bit-flip (not just encryption failure)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

means what? How is that different? Do we store a checksum?

Copy link
Copy Markdown
Contributor Author

@ibihim ibihim Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test different code paths:

  • encryption failure occurs in the transformer layer
  • bit-flip corruption causes deserialization to fail after the transformer succeeds (or when no transformer is configured).

Both "should" produce the same StatusReasonStoreReadError (should -> not tested yet).

error will be truncated

Beta:
- list error aggregation
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... to handle multiple/many object being corrupted.

Right=

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to be able to have an overview of how many objects are actually affected.

Currently we return just the first corrupt object and exit.
If we are able to get a list of up to 100 objects, this will help us understand better the scope of the effort.

Copy link
Copy Markdown
Contributor Author

@ibihim ibihim Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also rename it to not use aggregation as it would assume a n -> 1 operation.

@ibihim ibihim force-pushed the ibihim/2025-12-16_kep-3926-update-beta-requirements branch from 7c9f844 to 8a09dbd Compare January 15, 2026 17:40
@soltysh
Copy link
Copy Markdown
Contributor

soltysh commented Jan 28, 2026

While looking at #5645 I also read this document, to ensure I have a better understanding of the changes. Although I will repeat the same thing I wrote on slack, I'd squash both of these PRs (this and #5645) into one 😉 . Other than that this part looks ok from PRR perspective, modulo the comments I left in #5645, but those are handled in the other PR.

- Rename sections for clarity (Current state → Background, etc.)
- Add Implementation Considerations section:
  - Watch event propagation and client recovery flow
  - Design principles agreed with sig-api-machinery
  - Alternative approaches considered (DeletedFinalStateUnknown,
    PartialObjectMetadata, Type Identity Object)
- Restructure integration tests section with Alpha/Beta split
@ibihim ibihim force-pushed the ibihim/2025-12-16_kep-3926-update-beta-requirements branch from 8a09dbd to 8ac9919 Compare January 29, 2026 14:23
Adds Production Readiness Review responses for beta promotion:
- Feature enablement/rollback documentation
- Monitoring requirements with metrics
- Scalability considerations
- Troubleshooting guidance
- Test plan with integration test references
- Explain why integration tests are used instead of e2e
- Add test links with feature gate toggle line numbers
- Fill Upgrade/Downgrade Strategy section
- Fill Version Skew Strategy section
- Expand rollout/rollback failure explanation
- Answer Troubleshooting section questions
- Add Implementation History with alpha/beta milestones
- Add Drawbacks section
- Add Alternatives section"
@ibihim
Copy link
Copy Markdown
Contributor Author

ibihim commented Jan 29, 2026

While looking at #5645 I also read this document, to ensure I have a better understanding of the changes. Although I will repeat the same thing I wrote on slack, I'd squash both of these PRs (this and #5645) into one 😉 . Other than that this part looks ok from PRR perspective, modulo the comments I left in #5645, but those are handled in the other PR.

Hopefully I didn't screw up 😅

Copy link
Copy Markdown
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
the prr section

@ibihim
Copy link
Copy Markdown
Contributor Author

ibihim commented Jan 30, 2026

@liggitt, @sttts

can you take another look?

@stlaz stlaz mentioned this pull request Feb 2, 2026
13 tasks
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Feb 3, 2026

updates lgtm and match what was discussed in api-machinery in https://docs.google.com/document/d/1x9RNaaysyO0gXHIr1y50QFbiL1x8OWnk2v3XnrdkT5Y/edit?tab=t.0#bookmark=id.22khrlurmv4z

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ibihim, liggitt, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2026
@k8s-ci-robot k8s-ci-robot merged commit 3b6bae6 into kubernetes:master Feb 3, 2026
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Feb 3, 2026
@github-project-automation github-project-automation Bot moved this from In Review to Closed / Done in SIG Auth Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

10 participants