diff --git a/cluster-autoscaler/proposals/granular-resource-limits.md b/cluster-autoscaler/proposals/granular-resource-limits.md new file mode 100644 index 000000000000..dbeab463f4b7 --- /dev/null +++ b/cluster-autoscaler/proposals/granular-resource-limits.md @@ -0,0 +1,328 @@ +# Granular Resource Limits in Node Autoscalers + +## Objective + +Node Autoscalers should allow setting more granular resource limits that would +apply to arbitrary subsets of nodes, beyond the existing limiting mechanisms. + +## Background + +Cluster Autoscaler supports cluster-wide limits on resources (like total CPU and +memory) and per-node-group node count limits. Karpenter supports +setting [resource limits on a NodePool](https://karpenter.sh/docs/concepts/nodepools/#speclimits). +Also, as mentioned +in [AWS docs](https://docs.aws.amazon.com/eks/latest/best-practices/karpenter.html), +cluster-wide limits are not supported too. This is not flexible enough for many +use cases. + +Users often need to configure more granular limits. For instance, a user might +want to limit the total resources consumed by nodes of a specific machine +family, nodes with a particular OS, or nodes with specialized hardware like +GPUs. The current resource limits implementations in both node autoscalers do +not support these scenarios. + +This proposal introduces a new API to extend the Node Autoscalers’ +functionality, allowing limits to be applied to arbitrary sets of nodes. + +## Proposal: The AutoscalingResourceQuota API + +We propose a new Kubernetes custom resource, AutoscalingResourceQuota, to define +resource limits on specific subsets of nodes. Node subsets are targeted using +standard Kubernetes label selectors, offering a flexible way to group nodes. + +A node's eligibility for provisioning operation will be checked against all +AutoscalingResourceQuota objects that select it. The operation will only be +permitted if it does not violate any of the applicable limits. This should be +compatible with the existing limiting mechanisms, i.e. CAS’ cluster-wide limits +and Karpenter’s NodePool limits. Therefore, if the operation doesn’t violate +AutoscalingResourceQuota, but violates existing limiting mechanisms, it should +be rejected. + +### API Specification + +An AutoscalingResourceQuota object would look as follows: + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: example-resource-quota +spec: + selector: + matchLabels: + example.cloud.com/machine-family: e2 + limits: + resources: + cpu: 64 + memory: 256Gi +``` + +* `selector`: A standard Kubernetes label selector that determines which nodes + the limits apply to. This allows for fine-grained control based on any label + present on the nodes, such as zone, region, OS, machine family, or custom + user-defined labels. +* `limits`: Defines the limits of summed up resources of the selected nodes. + +This approach is highly flexible – adding a new dimension for limits only +requires ensuring the nodes are labeled appropriately, with no code changes +needed in the autoscaler. + +### Node as a Resource + +The AutoscalingResourceQuota API can be naturally extended to treat the number +of nodes itself as a limitable resource, as shown in one of the examples below. + +### AutoscalingResourceQuota Status + +For better observability, the AutoscalingResourceQuota resource could be +enhanced with a status field. This field, updated by a controller, would display +the current resource usage for the selected nodes, allowing users to quickly +check usage against the defined limits via kubectl describe. The controller can +run in a separate thread as a part of the node autoscaler component. + +An example of the status field: + +```yaml +status: + usage: + cpu: 32 + memory: 128Gi + nodes: 50 +``` + +## Alternatives considered + +### Minimum limits support + +The initial design, besides the maximum limits, also included minimum limits. +Minimum limits were supposed to affect the node consolidation in the node +autoscalers. A consolidation would be allowed only if removing the node wouldn’t +violate any minimum limits. Cluster-wide minimum limits are implemented in CAS +together with the maximum limits, so at first, it seemed logical to include both +limit directions in the design. + +Despite being conceptually similar, minimum and maximum limits cover completely +different use cases. Maximum limits can be used to control the cloud provider +costs, to limit scaling certain types of compute, or to control distribution of +compute resources between teams working on the same cluster. Minimum limits’ +main use case is ensuring a baseline capacity for users’ workloads, for example +to handle sudden spikes in traffic. However, minimum limits defined as a minimum +amount of resources in the cluster or a subset of nodes do not guarantee that +the workloads will be schedulable on those resources. For example, two nodes +with 2 CPUs each satisfy the minimum limit of 4 CPUs. If a user created a +workload requesting 2 CPUs, that workload would not fit into existing nodes, +making the baseline capacity effectively useless. This scenario will be better +handled by +the [CapacityBuffer API](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md), +which allows the user to provide an exact shape of their workloads, including +the resource requests. In our example, the user would create a CapacityBuffer +with a pod template requesting 2 CPUs. Such a CapacityBuffer would ensure that a +pod with that shape is always schedulable on the existing nodes. + +Therefore, we decided to remove minimum limits from the design of granular +limits, as CapacityBuffers are a better way to provide a baseline capacity for +user workloads. + +### Kubernetes LimitRange and ResourceQuota + +It has been discussed whether the same result could be accomplished by using the +standard Kubernetes +resources: [LimitRange](https://kubernetes.io/docs/concepts/policy/limit-range/) +and [ResourceQuota](https://kubernetes.io/docs/concepts/policy/resource-quotas/). + +LimitRange is a resource used to configure minimum and maximum resource +constraints for a namespace. For example, it can define the default CPU and +memory requests for pods and containers within a namespace, or enforce a minimum +and maximum CPU request for a pod. However, its scope is limited to a single +resource, meaning that it doesn’t look at all pods in the namespace, but just +looks if the pod requests and limits are within defined bounds. + +ResourceQuota allows to define and limit the aggregate resource consumption per +namespace. This includes limiting the total CPU, memory, and storage that all +pods and persistent volume claims within a namespace can request or consume. It +also supports limiting the count of various Kubernetes objects, such as pods, +services, and replication controllers. While resource quotas can be used to +limit the resources provisioned by the CA to some degree, it’s not possible to +guarantee that CA won’t scale up above the defined limit. Since the quotas +operate on pod requests, and CA does not guarantee that bin packing will yield +the optimal result, setting the quota to e.g. 64 CPUs does not mean that CA will +stop scaling at 64 CPUs. + +Moreover, both of those resources are namespaced, so their scope is limited to +the namespace in which they are defined, while the nodes are global. We can’t +use namespaced resources to limit the creation and deletion of global resources. + +## User Stories + +### Story 1 + +As a cluster administrator, I want to configure cluster-wide resource limits to +avoid excessive cloud provider costs. + +**Note:** This is already supported in CAS, but not in Karpenter. + +Example AutoscalingResourceQuota: + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: cluster-wide-limits +spec: + limits: + resources: + cpu: 128 + memory: 256Gi +``` + +### Story 2 + +As a cluster administrator, I want to configure separate resource limits for +specific groups of nodes on top of cluster-wide limits, to avoid a situation +where one group of nodes starves others of resources. + +**Note:** A specific group of nodes can be either a NodePool in Karpenter, a +ComputeClass in GKE, or simply a set of nodes grouped by a user-defined label. +This can be useful e.g. for organizations where multiple teams are running +workloads in a shared cluster, and these teams have separate sets of nodes. This +way, a cluster administrator can ensure that each team has a proper limit for +their resources and it doesn’t starve other teams. This story is partly +supported by Karpenter’s NodePool limits. + +Example AutoscalingResourceQuota: + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: team-a-limits +spec: + selector: + matchLabels: + team: a + limits: + resources: + cpu: 32 +``` + +### Story 3 + +As a cluster administrator, I want to allow scaling up machines that are more +expensive or less suitable for my workloads when better machines are +unavailable, but I want to limit how many of them can be created, so that I can +control extra cloud provider costs, or limit the impact of using non-optimal +machine for my workloads. + +Example AutoscalingResourceQuota: + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: max-e2-resources +spec: + selector: + matchLabels: + example.cloud.com/machine-family: e2 + limits: + resources: + cpu: 32 + memory: 64Gi +``` + +### Story 4 + +As a cluster administrator, I want to limit the number of nodes in a specific +zone if my cluster is unbalanced for any reason, so that I can avoid exhausting +IP space in that zone, or enforce better balancing across zones. + +**Note:** Originally requested +in [https://github.com/kubernetes/autoscaler/issues/6940](https://github.com/kubernetes/autoscaler/issues/6940). + +Example AutoscalingResourceQuota: + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: max-nodes-us-central1-b +spec: + selector: + matchLabels: + topology.kubernetes.io/zone: us-central1-b + limits: + resources: + nodes: 64 +``` + +### Story 5 (obsolete) + +As a cluster administrator, I want to ensure there is always a baseline capacity +in my cluster or specific parts of my cluster below which the node autoscaler +won’t consolidate the nodes, so that my workloads can quickly react to sudden +spikes in traffic. + +This user story is obsolete. CapacityBuffer API covers this use case in a more +flexible way. + +## Other AutoscalingResourceQuota examples + +The following examples illustrate the flexibility of the proposed API and +demonstrate other possible use cases not described in the user stories. + +#### **Maximum Windows Nodes** + +Limit the total number of nodes running the Windows operating system to 8. + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: max-windows-nodes +spec: + selector: + matchLabels: + kubernetes.io/os: windows + limits: + resources: + nodes: 8 +``` + +#### **Maximum NVIDIA T4 GPUs** + +Limit the total number of NVIDIA T4 GPUs in the cluster to 16. + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: max-t4-gpus +spec: + selector: + matchLabels: + example.cloud.com/gpu-type: nvidia-t4 + limits: + resources: + nvidia.com/gpu: 16 +``` + +#### **Cluster-wide Limits Excluding Control Plane Nodes** + +Apply cluster-wide CPU and memory limits while excluding nodes with the +control-plane role. + +```yaml +apiVersion: autoscaling.x-k8s.io/v1beta1 +kind: AutoscalingResourceQuota +metadata: + name: cluster-limits-no-control-plane +spec: + selector: + matchExpressions: + - key: node-role.kubernetes.io/control-plane + operator: DoesNotExist + limits: + resources: + cpu: 64 + memory: 128Gi +```