Skip to content

Commit 145a141

Browse files
committed
KEP 1287: Instrumentation for in-place pod resize
1 parent 047426d commit 145a141

File tree

1 file changed

+74
-0
lines changed
  • keps/sig-node/1287-in-place-update-pod-resources

1 file changed

+74
-0
lines changed

keps/sig-node/1287-in-place-update-pod-resources/README.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -881,6 +881,80 @@ Other components:
881881
* check how the change of meaning of resource requests influence other
882882
Kubernetes components.
883883

884+
### Instrumentation
885+
886+
The kubelet will record the following metrics:
887+
888+
#### `kubelet_pod_resize_requests_total`
889+
890+
This metric tracks the total number of resize requests observed by the Kubelet, counted at the pod level.
891+
A single pod update changing multiple containers will be considered a single resize request.
892+
893+
Labels:
894+
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request,
895+
we increment the counter multiple times, once for each. This means that a single pod update changing multiple
896+
resource types will be considered multiple requests for this metric.
897+
- `operation_type` - whether the resize is a net increase or a decrease (taken as an aggregate across
898+
all containers in the pod). Possible values: `increase`, `decrease`, `add`, or `remove`.
899+
900+
This metric is recorded as a counter.
901+
902+
#### `kubelet_container_resize_requests_total`
903+
904+
This metric tracks the total number of resize requests observed by the Kubelet, counted at the container level.
905+
A single pod update changing multiple containers will be considered separate resize requests.
906+
907+
Labels:
908+
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request,
909+
we increment the counter multiple times, once for each. This means that a single pod update changing multiple
910+
resource types will be considered multiple requests for this metric.
911+
- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.
912+
913+
This metric is recorded as a counter.
914+
915+
#### `kubelet_pod_resize_sli_duration_seconds`
916+
917+
This metric tracks the latency between when the kubelet accepts a resize request and when it finshes actuating
918+
the request. More precisely, this metric tracks the total amount of time that the `PodResizeInProgress` condition
919+
is present on a pod.
920+
921+
Labels:
922+
- `resource_type` - what type of resource is being resized. Possible values: `cpu_limits`, `cpu_requests` `memory_limits`, or `memory_requests`. If more than one of these resource types is changing in the resize request,
923+
we increment the counter multiple times, once for each.
924+
- `operation_type` - whether the resize is an increase or a decrease. Possible values: `increase`, `decrease`, `add`, or `remove`.
925+
926+
This metric is recorded as a gauge.
927+
928+
#### `kubelet_pod_infeasible_resize_total`
929+
930+
This metric tracks the total count of resize requests that the kubelet marks as infeasible. This will make it
931+
easier for us to see which of the current limitations users are running into the most.
932+
933+
Labels:
934+
- `reason` - why the resize is infeasible. Although a more detailed "reason" will be provided in the `PodResizePending`
935+
condition in the pod, we limit this label to only the following possible values to keep cardinality low:
936+
- `guaranteed_pod_cpu_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside CPU Manager static policy.
937+
- `guaranteed_pod_memory_manager_static_policy` - In-place resize is not supported for Guaranteed Pods alongside Memory Manager static policy.
938+
- `static_pod` - In-place resize is not supported for static pods.
939+
- `swap_limitation` - In-place resize is not supported for containers with swap.
940+
- `node_capacity` - The node doesn't have enough capacity for this resize request.
941+
942+
This list of possible reasons may shrink or grow depending on limitations that are added or removed in the future.
943+
944+
This metric is recorded as a counter.
945+
946+
#### `kubelet_pod_deferred_resize_accepted_total`
947+
948+
This metric tracks the total number of resize requests that the Kubelet originally marked as deferred but
949+
later accepted. This metric primarily exists because if a deferred resize is accepted through the timed retry as
950+
opposed to being explicitly signaled, it indicates an issue in the Kubelet's logic for handling deferred
951+
resizes that we should fix.
952+
953+
Labels:
954+
- `retry_reason` - whether the resize was accepted through the timed retry or explicitly signaled. Possible values: `timed`, `signaled`.
955+
956+
This metric is recorded as a counter.
957+
884958
### Static CPU & Memory Policy
885959

886960
Resizing pods with static CPU & memory policy configured is out-of-scope for the beta release of

0 commit comments

Comments
 (0)