-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Description
Context
We run Karpenter in production for an O(15) node EKS cluster, using four (mutually exclusive) NodePools for different classes of application.
Our primary NodePool is workload, which provisions capacity for the majority of our pods.
Observed Behavior:
Approximately daily, we experience a period of high (karpenter) workload node volatility caused by consolidation disruptions (reason: Underutilised).
This usually means that a large proportion of workload nodes get disrupted and replaced in a short period of time.
We usually see the newly-created nodes run for about 5-10 minutes, before they too are disrupted as Underutilised.
This disruption period usually occurs for 2-3 generations of replacement nodes, before stopping abruptly. The resulting nodes then typically run without disruption for many hours.
Notably, these events typically occur outside of office hours where changes to the running pods are very unlikely (e.g. rolling upgrades) and traffic is usually very low.
The end result node topology is usually comparable to the starting topology, if not more complex, which doesn't seem to suggest there was any significant resource underutilisation. However in endemic cases, the pods hosted on these nodes may have been restarted up to four times in rapid succession, which is not desirable.
For example:
On 27th September between at 22:45 (local time) we had 14 running workload nodeclaims:
- 7x m6a.large (or equivalent)
- 4x m7i-flex.xlarge (or equivalent)
- 3x m7i-flex-2xlarge
Between 22:45 and 23:15 (local time), 7 of these nodeclaims were disrupted and replaced with successive generations of m7i-flex.large nodeclaims (or equivalent) - a total of 15 "Underutilised" disruptions.
At the end of this process we were running 17 workload nodeclaims:
- 11x m7i-flex.large (or equivalent)
- 3x m6a.xlarge (or equivalent)
- 3x m7i-flex-2xlarge
So the net effect was replacing one xlarge node with 4 large nodes, and shuffling the instance generations slightly.
Pictorially:

(each green bar represents a nodeclaim, with time along the x axis)
Expected Behavior:
- Consolidation disruption due to underutilisation occurs as a single operation, such that pods hosted on these nodes only experience one restart.
- Nodeclaims created due to "Underutilised" consolidation should not be provisioned in an Underutilised state, necessitating further disruption.
Reproduction Steps (Please include YAML):
Our workload NodePool config is:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
annotations:
meta.helm.sh/release-namespace: karpenter
labels:
app.kubernetes.io/managed-by: Helm
name: workload
spec:
disruption:
budgets:
- nodes: 50%
consolidateAfter: 5m
consolidationPolicy: WhenEmptyOrUnderutilized
limits:
cpu: '256'
memory: 1Ti
template:
metadata:
labels:
app: karpenter
environment: prod
name: karpenter
spec:
expireAfter: 336h
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- r
- m
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- '3'
- key: karpenter.k8s.aws/instance-cpu
operator: Lt
values:
- '17'
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- '0'
- key: karpenter.k8s.aws/instance-memory
operator: Lt
values:
- '131073'
- key: karpenter.k8s.aws/instance-memory
operator: Gt
values:
- '2047'
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-2a
- eu-west-2b
- eu-west-2c
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: kubernetes.io/os
operator: In
values:
- linux
- key: kubernetes.io/arch
operator: In
values:
- amd64
taints:
- effect: NoSchedule
key: karpenter.sh
The corresponding EC2NodeClass is:
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
annotations:
meta.helm.sh/release-namespace: karpenter
labels:
app: karpenter
app.kubernetes.io/managed-by: Helm
name: karpenter
name: default
spec:
amiFamily: AL2
amiSelectorTerms:
- name: amazon-eks-node-1.29-*
owner: amazon
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
kmsKeyID: >-
<masked>
volumeSize: 100Gi
volumeType: gp3
instanceProfile: KarpenterNodeInstanceProfile-prod-eu-west-2-eks
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 1
httpTokens: required
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eu-west-2-eks
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: prod-eu-west-2-eks
We also have the spot-to-spot consolidation feature flag enabled.
Versions:
- Chart Version: 1.0.2
- Kubernetes Version (
kubectl version): 1.29
However, we have observed this behaviour as far back as chart v0.36.2 using the v1beta1 CRDs.
We've also seen this on v1.28 and earlier version of Kubernetes.
Additional Questions:
- Can you explain what threshold the Karpenter controller uses to determine Underutilisation?
- We see 50% of nodes being affected during the consolidation disruption window, (matching our disruption budget), but why do we not see similar disruption before and after this ~30 minute window?