Karpenter "Underutilised" disruption causing excessive node churn

### Description

**Context**
We run Karpenter in production for an O(15) node EKS cluster, using four (mutually exclusive) NodePools for different classes of application.

Our primary NodePool is `workload`, which provisions capacity for the majority of our pods.

**Observed Behavior**:
Approximately daily, we experience a period of high (karpenter) `workload` node volatility caused by consolidation disruptions (reason: Underutilised).

This usually means that a large proportion of `workload` nodes get disrupted and replaced in a short period of time. 
We usually see the newly-created nodes run for about 5-10 minutes, before they too are disrupted as Underutilised.

This disruption period usually occurs for 2-3 generations of replacement nodes, before stopping abruptly. The resulting nodes then typically run without disruption for many hours.

Notably, these events typically occur outside of office hours where changes to the running pods are very unlikely (e.g. rolling upgrades) and traffic is usually very low. 

The end result node topology is usually comparable to the starting topology, if not more complex, which doesn't seem to suggest there was any significant resource underutilisation. However in endemic cases, the pods hosted on these nodes may have been restarted up to four times in rapid succession, which is not desirable. 

For example:
On 27th September between at 22:45 (local time) we had 14 running `workload` nodeclaims:
* 7x m6a.large (or equivalent)
* 4x m7i-flex.xlarge (or equivalent)
* 3x m7i-flex-2xlarge

Between 22:45 and 23:15 (local time), 7 of these nodeclaims were disrupted and replaced with successive generations of m7i-flex.large nodeclaims (or equivalent)  - a total of 15 "Underutilised" disruptions.

At the end of this process we were running 17 `workload` nodeclaims:
* 11x m7i-flex.large (or equivalent)
* 3x m6a.xlarge (or equivalent)
* 3x m7i-flex-2xlarge

So the net effect was replacing one `xlarge` node with 4 `large` nodes, and shuffling the instance generations slightly.

Pictorially: 
![image](https://github.com/user-attachments/assets/86984cc6-7b3f-49d2-b764-0a8b31be765f)
(each green bar represents a nodeclaim, with time along the x axis)

**Expected Behavior**:
* Consolidation disruption due to underutilisation occurs as a single operation, such that pods hosted on these nodes only experience one restart.
* Nodeclaims created due to "Underutilised" consolidation should not be provisioned in an Underutilised state, necessitating further disruption.

**Reproduction Steps** (Please include YAML):
Our `workload` NodePool config is:
 
```
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  annotations:
    meta.helm.sh/release-namespace: karpenter
  labels:
    app.kubernetes.io/managed-by: Helm
  name: workload
spec:
  disruption:
    budgets:
      - nodes: 50%
    consolidateAfter: 5m
    consolidationPolicy: WhenEmptyOrUnderutilized
  limits:
    cpu: '256'
    memory: 1Ti
  template:
    metadata:
      labels:
        app: karpenter
        environment: prod
        name: karpenter
    spec:
      expireAfter: 336h
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
            - r
            - m
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values:
            - '3'
        - key: karpenter.k8s.aws/instance-cpu
          operator: Lt
          values:
            - '17'
        - key: karpenter.k8s.aws/instance-cpu
          operator: Gt
          values:
            - '0'
        - key: karpenter.k8s.aws/instance-memory
          operator: Lt
          values:
            - '131073'
        - key: karpenter.k8s.aws/instance-memory
          operator: Gt
          values:
            - '2047'
        - key: topology.kubernetes.io/zone
          operator: In
          values:
            - eu-west-2a
            - eu-west-2b
            - eu-west-2c
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - spot
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
      taints:
        - effect: NoSchedule
          key: karpenter.sh
```

The corresponding EC2NodeClass is:
```
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  annotations:
    meta.helm.sh/release-namespace: karpenter
  labels:
    app: karpenter
    app.kubernetes.io/managed-by: Helm
    name: karpenter
  name: default
spec:
  amiFamily: AL2
  amiSelectorTerms:
    - name: amazon-eks-node-1.29-*
      owner: amazon
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        deleteOnTermination: true
        encrypted: true
        kmsKeyID: >-
          <masked>
        volumeSize: 100Gi
        volumeType: gp3
  instanceProfile: KarpenterNodeInstanceProfile-prod-eu-west-2-eks
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eu-west-2-eks
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: prod-eu-west-2-eks
```

We also have the spot-to-spot consolidation feature flag enabled.

**Versions**:
- Chart Version: 1.0.2
- Kubernetes Version (`kubectl version`): 1.29

However, we have observed this behaviour as far back as chart v0.36.2 using the v1beta1 CRDs.
We've also seen this on v1.28 and earlier version of Kubernetes.

**Additional Questions**:
* Can you explain what threshold the Karpenter controller uses to determine Underutilisation? 
* We see 50% of nodes being affected during the consolidation disruption window, (matching our disruption budget), but why do we not see similar disruption before and after this ~30 minute window?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Karpenter "Underutilised" disruption causing excessive node churn #7146

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Karpenter "Underutilised" disruption causing excessive node churn #7146

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions