KEP-3926: Add beta graduation criteria by ibihim · Pull Request #5739 · kubernetes/enhancements

ibihim · 2025-12-16T18:18:02Z

Description

Add beta graduation criteria, design principles from SIG API Machinery, and testing requirements

Issue link

#3926

Other comments:

This PR updates KEP-3926 with:

Design Principles: Documents the SIG API Machinery consensus on watch cache behavior
during corrupt object deletion (watch history cannot be preserved, performance
degradation is acceptable, recovery priorities)
Watch Event Propagation and Client Recovery: Explains the deliberate recovery
sequence when corrupt objects are deleted
Alternative Approaches Considered: Documents why shallow object representations
(DeletedFinalStateUnknown, PartialObjectMetadata, newFunc Object) were
not pursued
Beta Graduation Criteria:
- Feature enabled by default
- Dry-run support (#134037)
- Comprehensive testing requirements including CRD support, bit-flip deserialization,
  and KAS health recovery verification
Integration Tests: Updated to reflect both Alpha and Beta testing requirements
with links to open PRs (#129129, #128726)

ibihim · 2025-12-17T08:19:23Z

/sig api-machinery

stlaz · 2026-01-05T17:29:18Z

/assign

benluddy · 2026-01-12T17:19:01Z

xref sig-api-machinery discussion: https://docs.google.com/document/d/1x9RNaaysyO0gXHIr1y50QFbiL1x8OWnk2v3XnrdkT5Y/edit?tab=t.0

benluddy · 2026-01-12T21:40:46Z

+- list error aggregation
+- delete handler with unsafe deletion flow
+- deserialization failure via bit-flip (not just encryption failure)
+- deletion works for CRDs


To be clear, is this referring to deletion of CRD-backed custom resources or deletion of CRDs themselves? If the CRDs themselves: since the CRD finalizer will be bypassed, is there anything we can/should do about "latent" CRs?

CRs, right.

CRDs without the finalizer being followed will just disappear and leak their CRs. Recreation will make them visible again. I guess similar behaviour applies for other finalizers.

Yes, sorry. I meant CRs, not CRDs.

Good catch!

ardaguclu · 2026-01-13T10:25:21Z

+
+3. **Recovery priorities** (in order):
+   - Remove the corrupt data from storage
+   - Restore kube-apiserver to a healthy state


Who/what action is supposed to restore kas to a healthy state?, or the expectation is it goes to this state automatically?

An admin who was enabled to delete all corrupt objects brings the kube-apiserver into a healthy state.

It would make sense clarifying this in the KEP.

I'm also a little confused about what additional steps are required beyond deleting the corrupt object ... doesn't KAS return to a healthy state automatically once that is done?

The admin deletes corrupt objects.
Then, without any further manual intervention, the kube-apiserver and the informers come back to a healthy state.

I updated the paragraph to properly describe that we want to enable the Admin to do the recovery by deleting the corrupt objects and that as a result kube-apiserver and the informer will recover.

ardaguclu · 2026-01-13T10:25:45Z

+3. **Recovery priorities** (in order):
+   - Remove the corrupt data from storage
+   - Restore kube-apiserver to a healthy state
+   - Allow clients to converge eventually


Does that mean clients will simply retry?

Converge to the new state without a corrupt object. Yes, they will retry to rebuild their cache.

In my opinion, it would make sense clarifying it in the KEP.

Yes, in particular the distinction between "action" and "automation". Good point!

my understanding is that the current implementation of client-go informers will do this automatically, correct?

Yes, they will do so automatically. Only the corrupt objects need to be deleted.

ardaguclu · 2026-01-13T10:28:59Z

+cannot transform or decode the object's previous value. This triggers a
+deliberate recovery sequence:
+
+1. **Error Detection**: The etcd3 watcher fails to transform/decode the deleted


Just an idea: That would be fantastic having a flow diagram for this.

sttts · 2026-01-14T19:51:55Z

+Beta:
+- list error aggregation
+- delete handler with unsafe deletion flow
+- deserialization failure via bit-flip (not just encryption failure)


means what? How is that different? Do we store a checksum?

These test different code paths:

encryption failure occurs in the transformer layer

bit-flip corruption causes deserialization to fail after the transformer succeeds (or when no transformer is configured).

Both "should" produce the same StatusReasonStoreReadError (should -> not tested yet).

sttts · 2026-01-14T19:54:27Z

  error will be truncated

+Beta:
+- list error aggregation


... to handle multiple/many object being corrupted.

Right=

Yes, to be able to have an overview of how many objects are actually affected.

Currently we return just the first corrupt object and exit.
If we are able to get a list of up to 100 objects, this will help us understand better the scope of the effort.

I will also rename it to not use aggregation as it would assume a n -> 1 operation.

soltysh · 2026-01-28T12:08:04Z

While looking at #5645 I also read this document, to ensure I have a better understanding of the changes. Although I will repeat the same thing I wrote on slack, I'd squash both of these PRs (this and #5645) into one 😉 . Other than that this part looks ok from PRR perspective, modulo the comments I left in #5645, but those are handled in the other PR.

- Rename sections for clarity (Current state → Background, etc.) - Add Implementation Considerations section: - Watch event propagation and client recovery flow - Design principles agreed with sig-api-machinery - Alternative approaches considered (DeletedFinalStateUnknown, PartialObjectMetadata, Type Identity Object) - Restructure integration tests section with Alpha/Beta split

Adds Production Readiness Review responses for beta promotion: - Feature enablement/rollback documentation - Monitoring requirements with metrics - Scalability considerations - Troubleshooting guidance - Test plan with integration test references - Explain why integration tests are used instead of e2e - Add test links with feature gate toggle line numbers - Fill Upgrade/Downgrade Strategy section - Fill Version Skew Strategy section - Expand rollout/rollback failure explanation - Answer Troubleshooting section questions - Add Implementation History with alpha/beta milestones - Add Drawbacks section - Add Alternatives section"

ibihim · 2026-01-29T16:02:29Z

While looking at #5645 I also read this document, to ensure I have a better understanding of the changes. Although I will repeat the same thing I wrote on slack, I'd squash both of these PRs (this and #5645) into one 😉 . Other than that this part looks ok from PRR perspective, modulo the comments I left in #5645, but those are handled in the other PR.

Hopefully I didn't screw up 😅

soltysh

/approve
the prr section

ibihim · 2026-01-30T17:08:20Z

@liggitt, @sttts

can you take another look?

liggitt · 2026-02-03T18:10:43Z

updates lgtm and match what was discussed in api-machinery in https://docs.google.com/document/d/1x9RNaaysyO0gXHIr1y50QFbiL1x8OWnk2v3XnrdkT5Y/edit?tab=t.0#bookmark=id.22khrlurmv4z

/lgtm
/approve

k8s-ci-robot · 2026-02-03T18:10:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ibihim, liggitt, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [soltysh]
~~keps/sig-auth/OWNERS~~ [liggitt]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 16, 2025

k8s-ci-robot requested review from liggitt and ritazh December 16, 2025 18:18

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/auth Categorizes an issue or PR as relevant to SIG Auth. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 16, 2025

ibihim changed the title ~~KEP-3926: Add beta graduation criteria and design principles~~ KEP-3926: Add beta graduation criteria Dec 16, 2025

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Dec 17, 2025

enj added this to SIG Auth Dec 17, 2025

enj moved this to Needs Triage in SIG Auth Dec 17, 2025

ibihim force-pushed the ibihim/2025-12-16_kep-3926-update-beta-requirements branch from 9ebd69d to 22825eb Compare December 19, 2025 12:55

k8s-ci-robot assigned stlaz Jan 5, 2026

stlaz moved this from Needs Triage to In Review in SIG Auth Jan 5, 2026

stlaz reviewed Jan 6, 2026

View reviewed changes

benluddy reviewed Jan 12, 2026

View reviewed changes

ardaguclu reviewed Jan 13, 2026

View reviewed changes

Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md

ardaguclu reviewed Jan 13, 2026

View reviewed changes

Comment thread keps/sig-auth/3926-handling-undecryptable-resources/README.md

ardaguclu reviewed Jan 13, 2026

View reviewed changes

sttts reviewed Jan 14, 2026

View reviewed changes

ibihim force-pushed the ibihim/2025-12-16_kep-3926-update-beta-requirements branch from 7c9f844 to 8a09dbd Compare January 15, 2026 17:40

ibihim force-pushed the ibihim/2025-12-16_kep-3926-update-beta-requirements branch from 8a09dbd to 8ac9919 Compare January 29, 2026 14:23

kep-3926: rm duplicate beta, bump latest milestone

3048649

soltysh mentioned this pull request Jan 29, 2026

KEP-3926: updating the PRR questionnaire #5645

Closed

soltysh approved these changes Jan 29, 2026

View reviewed changes

kep-3926: add diagram for server cache reset

686d8a9

stlaz mentioned this pull request Feb 2, 2026

Handling undecryptable resources #3926

Open

13 tasks

k8s-ci-robot assigned liggitt Feb 3, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 3, 2026

k8s-ci-robot merged commit 3b6bae6 into kubernetes:master Feb 3, 2026
4 checks passed

k8s-ci-robot added this to the v1.36 milestone Feb 3, 2026

github-project-automation Bot moved this from In Review to Closed / Done in SIG Auth Feb 3, 2026

soltysh mentioned this pull request Apr 15, 2026

KEP-3926: retarget beta milestone to v1.37, add as co-author #5991

Open

Conversation

ibihim commented Dec 16, 2025

Description

Issue link

Other comments:

Uh oh!

ibihim commented Dec 17, 2025

Uh oh!

stlaz commented Jan 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benluddy commented Jan 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sttts Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibihim Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibihim Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibihim Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

soltysh commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ibihim commented Jan 29, 2026

Uh oh!

soltysh left a comment

Choose a reason for hiding this comment

Uh oh!

ibihim commented Jan 30, 2026

Uh oh!

liggitt commented Feb 3, 2026

Uh oh!

sttts Jan 14, 2026 •

edited

Loading

ibihim Jan 15, 2026 •

edited

Loading

ibihim Jan 15, 2026 •

edited

Loading

ibihim Jan 15, 2026 •

edited

Loading

soltysh commented Jan 28, 2026 •

edited

Loading