KEP-2570: update v1.36 memory protection details#5977
KEP-2570: update v1.36 memory protection details#5977QiWang19 wants to merge 1 commit intokubernetes:masterfrom
Conversation
QiWang19
commented
Mar 24, 2026
- One-line PR description: update v1.36 MemoryQoS cgroup v2 memory protection details
- Issue link: Support memory qos with cgroups v2 #2570
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: QiWang19 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @QiWang19. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@sohankunkerkar PTAL |
| ### Future Considerations (Beta candidates) | ||
|
|
||
| 1. **Tiered memory protection (memory.low for Burstable/BestEffort)**: Currently, `memory.min` (hard protection) is used for all QoS classes. If Alpha v3 feedback or benchmarks show that padded requests cause excessive OOM thrash, implement a tiered approach for Beta: `memory.min` for Guaranteed pods, `memory.low` (soft protection) for Burstable/BestEffort. This allows the kernel to reclaim unused memory under pressure while still providing best-effort protection. Tracked in [kubernetes/kubernetes#131077](https://github.com/kubernetes/kubernetes/issues/131077). | ||
| 1. **Tiered memory protection follow-up (BestEffort)**: Alpha v1.36 already uses `memory.min` for Guaranteed and `memory.low` for Burstable. A potential follow-up for later phases is evaluating whether BestEffort-specific `memory.low` memory protection should be added. Tracked in [kubernetes/kubernetes#131077](https://github.com/kubernetes/kubernetes/issues/131077) |
There was a problem hiding this comment.
You can remove this part completely. BestEffort pods have no memory request (requests.memory is not set). There's nothing to protect. memory.min and memory.low are set to requests.memory, and BestEffort has zero requests, so no protection.
There was a problem hiding this comment.
Removed the whole section Fture Considerations (Beta candidates). The benchmark testing is already properly documented in the "Beta Graduation" section
| /cgroup2/kubepods/pod<UID>/memory.min=sum(pod.spec.containers[i].resources.requests[memory]) // Guaranteed | ||
| /cgroup2/kubepods/burstable/pod<UID>/memory.low=sum(pod.spec.containers[i].resources.requests[memory]) // Burstable | ||
| // QoS ancestor cgroup | ||
| /cgroup2/kubepods/burstable/memory.min=sum(pod[i].spec.containers[j].resources.requests[memory]) |
There was a problem hiding this comment.
It should be:
// QoS ancestor cgroups
/cgroup2/kubepods/memory.min=sum(guaranteed_requests + burstable_requests) // parent covers all children
/cgroup2/kubepods/burstable/memory.low=sum(burstable_pod_requests) // soft protection
/cgroup2/kubepods/burstable/memory.min=0 // no hard protection for burstable
| - Upgrade: Enabling MemoryQoS on a running kubelet correctly sets memory.min/memory.high on new pods and updates node-level cgroups | ||
| - Rollback: Disabling MemoryQoS resets memory.min to 0 and memory.high to max for all managed cgroups | ||
| - Upgrade: Enabling `MemoryQoS`, `memoryReservationPolicy: TieredReservation` on a running kubelet correctly sets `memory.min`/`memory.low`/`memory.high` on new pods and updates node-level cgroups | ||
| - Rollback: Disabling MemoryQoS reconciles QoS class level memory.min/memory.low to 0. |
There was a problem hiding this comment.
You might need to add this--> pod-level memory.min/memory.low values are not reset during rollback because reconcilePodMemoryProtection is gated behind the MemoryQoS feature gate. When the gate is off, the reconciler doesn't run, so per-pod values persist. However, they become ineffective because the parent QoS cgroup has memory.min=0/memory.low=0 (kernel cgroup hierarchy rule: child protection can't exceed parent).
There was a problem hiding this comment.
@sohankunkerkar For the review #5977 (comment), are you also suggesting we drop this line 506
Signed-off-by: Qi Wang <qiwan@redhat.com>
1e9a94e to
6eef21a
Compare