KEP-3926: Add beta graduation criteria#5739
Conversation
|
/sig api-machinery |
9ebd69d to
22825eb
Compare
|
/assign |
|
xref sig-api-machinery discussion: https://docs.google.com/document/d/1x9RNaaysyO0gXHIr1y50QFbiL1x8OWnk2v3XnrdkT5Y/edit?tab=t.0 |
| - list error aggregation | ||
| - delete handler with unsafe deletion flow | ||
| - deserialization failure via bit-flip (not just encryption failure) | ||
| - deletion works for CRDs |
There was a problem hiding this comment.
To be clear, is this referring to deletion of CRD-backed custom resources or deletion of CRDs themselves? If the CRDs themselves: since the CRD finalizer will be bypassed, is there anything we can/should do about "latent" CRs?
There was a problem hiding this comment.
CRs, right.
CRDs without the finalizer being followed will just disappear and leak their CRs. Recreation will make them visible again. I guess similar behaviour applies for other finalizers.
There was a problem hiding this comment.
Yes, sorry. I meant CRs, not CRDs.
Good catch!
|
|
||
| 3. **Recovery priorities** (in order): | ||
| - Remove the corrupt data from storage | ||
| - Restore kube-apiserver to a healthy state |
There was a problem hiding this comment.
Who/what action is supposed to restore kas to a healthy state?, or the expectation is it goes to this state automatically?
There was a problem hiding this comment.
An admin who was enabled to delete all corrupt objects brings the kube-apiserver into a healthy state.
There was a problem hiding this comment.
It would make sense clarifying this in the KEP.
There was a problem hiding this comment.
I'm also a little confused about what additional steps are required beyond deleting the corrupt object ... doesn't KAS return to a healthy state automatically once that is done?
There was a problem hiding this comment.
The admin deletes corrupt objects.
Then, without any further manual intervention, the kube-apiserver and the informers come back to a healthy state.
There was a problem hiding this comment.
I updated the paragraph to properly describe that we want to enable the Admin to do the recovery by deleting the corrupt objects and that as a result kube-apiserver and the informer will recover.
| 3. **Recovery priorities** (in order): | ||
| - Remove the corrupt data from storage | ||
| - Restore kube-apiserver to a healthy state | ||
| - Allow clients to converge eventually |
There was a problem hiding this comment.
Does that mean clients will simply retry?
There was a problem hiding this comment.
Converge to the new state without a corrupt object. Yes, they will retry to rebuild their cache.
There was a problem hiding this comment.
In my opinion, it would make sense clarifying it in the KEP.
There was a problem hiding this comment.
Yes, in particular the distinction between "action" and "automation". Good point!
There was a problem hiding this comment.
my understanding is that the current implementation of client-go informers will do this automatically, correct?
There was a problem hiding this comment.
Yes, they will do so automatically. Only the corrupt objects need to be deleted.
| cannot transform or decode the object's previous value. This triggers a | ||
| deliberate recovery sequence: | ||
|
|
||
| 1. **Error Detection**: The etcd3 watcher fails to transform/decode the deleted |
There was a problem hiding this comment.
Just an idea: That would be fantastic having a flow diagram for this.
| Beta: | ||
| - list error aggregation | ||
| - delete handler with unsafe deletion flow | ||
| - deserialization failure via bit-flip (not just encryption failure) |
There was a problem hiding this comment.
means what? How is that different? Do we store a checksum?
There was a problem hiding this comment.
These test different code paths:
- encryption failure occurs in the transformer layer
- bit-flip corruption causes deserialization to fail after the transformer succeeds (or when no transformer is configured).
Both "should" produce the same StatusReasonStoreReadError (should -> not tested yet).
| error will be truncated | ||
|
|
||
| Beta: | ||
| - list error aggregation |
There was a problem hiding this comment.
... to handle multiple/many object being corrupted.
Right=
There was a problem hiding this comment.
Yes, to be able to have an overview of how many objects are actually affected.
Currently we return just the first corrupt object and exit.
If we are able to get a list of up to 100 objects, this will help us understand better the scope of the effort.
There was a problem hiding this comment.
I will also rename it to not use aggregation as it would assume a n -> 1 operation.
7c9f844 to
8a09dbd
Compare
|
While looking at #5645 I also read this document, to ensure I have a better understanding of the changes. Although I will repeat the same thing I wrote on slack, I'd squash both of these PRs (this and #5645) into one 😉 . Other than that this part looks ok from PRR perspective, modulo the comments I left in #5645, but those are handled in the other PR. |
- Rename sections for clarity (Current state → Background, etc.)
- Add Implementation Considerations section:
- Watch event propagation and client recovery flow
- Design principles agreed with sig-api-machinery
- Alternative approaches considered (DeletedFinalStateUnknown,
PartialObjectMetadata, Type Identity Object)
- Restructure integration tests section with Alpha/Beta split
8a09dbd to
8ac9919
Compare
Adds Production Readiness Review responses for beta promotion: - Feature enablement/rollback documentation - Monitoring requirements with metrics - Scalability considerations - Troubleshooting guidance - Test plan with integration test references - Explain why integration tests are used instead of e2e - Add test links with feature gate toggle line numbers - Fill Upgrade/Downgrade Strategy section - Fill Version Skew Strategy section - Expand rollout/rollback failure explanation - Answer Troubleshooting section questions - Add Implementation History with alpha/beta milestones - Add Drawbacks section - Add Alternatives section"
Hopefully I didn't screw up 😅 |
soltysh
left a comment
There was a problem hiding this comment.
/approve
the prr section
|
updates lgtm and match what was discussed in api-machinery in https://docs.google.com/document/d/1x9RNaaysyO0gXHIr1y50QFbiL1x8OWnk2v3XnrdkT5Y/edit?tab=t.0#bookmark=id.22khrlurmv4z /lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ibihim, liggitt, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
Add beta graduation criteria, design principles from SIG API Machinery, and testing requirements
Issue link
#3926
Other comments:
This PR updates KEP-3926 with:
Design Principles: Documents the SIG API Machinery consensus on watch cache behavior
during corrupt object deletion (watch history cannot be preserved, performance
degradation is acceptable, recovery priorities)
Watch Event Propagation and Client Recovery: Explains the deliberate recovery
sequence when corrupt objects are deleted
Alternative Approaches Considered: Documents why shallow object representations
(
DeletedFinalStateUnknown,PartialObjectMetadata,newFuncObject) werenot pursued
Beta Graduation Criteria:
and KAS health recovery verification
Integration Tests: Updated to reflect both Alpha and Beta testing requirements
with links to open PRs (#129129, #128726)