-
Notifications
You must be signed in to change notification settings - Fork 388
Description
Description
Observed Behavior:
We run Karpenter with a handful of nodepools: one general flex pool for a wide variety of diverse high churn workloads and a handful (~10) nodepools specifically configured to be provisioned for a particular workload. Each of the dedicated nodepools hosts exactly one workload (ie one replicaset, <3 pods per node), while the flex pool runs a large mishmash of nodes (many replicaset, >5 pods per node). Each node, regardless of nodepool, runs the same 10 daemonset pods. The total cluster size is 80 nodes.
In our dedicated nodepool "nodepoolX", we provision either r7g.2xlarges (8CPU, 64Mem) and m7g.2xlarges (8CPU, 32 Mem). Each node hosts either 1 or 2 pods, and the pods each request 3CPU, 25 GB Mem. NodepoolX hosts ~8 pods.
When the nodepool has a handful of pending pods, Karpenter can provision r7g.2xlarges correctly (nice!) and fit two pods onto it. However, if the nodepool has single pending pods over time (as a result of deploys), karpenter will provision one m7g.2xlarge and schedule one pod onto it. one m7g.2xlarge is cheaper than one r7g.2xlarge, but will just be underutilized on CPU.
With multinode consoliation, I would expect to see, once the RS hits steadystate, Karpenter consolidationg two m7g.2xlarges -> one new r7g.2xlarge. As established earlier, the pods do fit, and karpenter knows they fit (ie, the karpenter VM scheduling overhead isn't a factor here). However, what we do see is that that consolidation doesn't happen (at least not within an hour, I gave up waiting after that).
I think what's happening here is that there are a handful of issues with the way Multinode consoldiation's code is written:
- Multinode consolidation only considers the first 100 nodes across all nodepools (ie if nodepoolX's nodes are sorted outside of [0, 100), they will never be candidates for multinode consolidation. In this case, this isn't a factor because the nodecount is only ~80, but seems like a real issue for multinode consolidation in large clusters (ie, our production cluster with 800-1500 nodes), especially if the first 100 nodes host a largely stable/static workload -> https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/multinodeconsolidation.go#L87
- Multinode consolidation binary searches and returns the smallest range [n, m] such that there is a consolidation available between [n, m]. This means that there might be multinode consolidations between [0, 100) that don't get considered because they are outside of [n, m]. I think, because the nodes in nodepoolX run fewer pods than the nodes in our flex pool, those nodes are at the front of [0, 100), while the binary search algorithm biases against the first nodes (ie, binary search starts [0, 100) -> finds a consolidation, updates min, searches [50, 100) -> finds a consolidation -> will never really consider [0, 50) from this point forward. https://github.com/kubernetes-sigs/karpenter/blob/main/pkg/controllers/disruption/multinodeconsolidation.go#L163
To test this theory, I edited each of the other nodepools to disable consolidation on those nodepools entirely. After doing so, I see Karpenter immediately consolidate two m7g.2xlarges -> one r7g.2xlarge (whereas it had been stalled/not consolidated for >1hr). This, at least, demonstrates to me that the nodes provisioned by nodepoolX are never the returned solution by multinode consolidation because it's nodes exist in [0, 50) due to the "sort by disruption cost".
Expected Behavior:
Multinode consolidation should fairly consider all nodes, not just the first 100. Multinode consolidation should also not bias against the first [0, 50) in the array, as that leads to starvation of opportunities.
One potential solution here might be to just shuffle the list of candidates instead of sorting them, and expanding the window to include nodes outside of the first 100.
Reproduction Steps (Please include YAML):
Versions:
- Karpenter Version: 1.4.0
- Kubernetes Version (
kubectl version): 1.30
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a commen