-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Add proposal for CSI node awareness #8083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: gnufied The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc |
thanks for proposing this @gnufied , this is definitely a problem that needs to be solved. from discussions i've had, i don't think we want to go down the same path as we did for the gpu label. the alternative suggestion of using a taint is a little better, but i would love to see something on the node (for example, a condition) that advertise the metadata in a more generic way. at the least, a taint on the nodes could be detected by the autoscaler, even if the taint doesn't prevent workloads from landing there. |
Problem with adopting taint as a general solution is - basically it is not backward compatible. Every workload who doesn't care about this must be updated. The other thing is - afaik, the taints which are being added by some CPs are being down through proprietary components and code that taints the node is not part of CCM. |
As mentioned during the SIG meeting, here are some links about how DRA support was added to CA. Unfortunately there aren't really any relevant docs explaining the approach, because the solution evolved quite heavily during development. PR and individual commit descriptions are pretty comprehensive though, I'd suggest starting there. These are CA refactors that were necessary for DRA support but aren't DRA-specific themselves:
This was the scheduler framework change widening the integration surface with CA to capture the new DRA objects: These are the DRA-specific parts on top of the refactors in CA: If my intuition is right you should be able to utilize the refactors mostly as-is for volumes, by following the DRA example above. Note that the integration isn't production-ready yet, there are a number of issues to address first: https://github.com/kubernetes/autoscaler/issues?q=%22CA%20DRA%22%20is%3Aissue%20. In particular:
|
## Using CSINode allocatable limits for scaling nodes that are required for pods that use volumes | ||
|
||
## Introduction | ||
Currently cluster-autoscaler doesn’t take into account, volume-attach limit that a node may have when scaling nodes to support unschedulable pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you are talking about is a feature of scheduler and yes scheduler is aware of CSI volume limits.
But cluster-autoscaler itself actually doesn't take into account, how many nodes it should spin to satisfy pending pods that need X number of volumes. For example - in a nodegroup
X, there are 0 nodes and assuming all other nodegroups are at max. capacity, autoscaler must scale the nodegroup X to schedule pending pods. Now say, there are 5 pending pods, that each use 20 volumes. So in total 100 volumes. Cluster Autoscaler currently can't figure out that, on AWS for example, each of those nodes can support 20 volumes and hence it must spin 5 new nodes. So, what will happen is - cluster-autoscaler will spin exactly 1 node and assuming CPU/Memory requirements are satisfied, all of those 5 pods will get scheduled to same node and except 1 pod, rest of the pod will be stuck in ContainerCreating
forever until someone manually intervenes.
Part of the reasons is - scheduler allows scheduling of pods to a node that doesn't have CSI driver installed. So a upcoming node N will have unlimited capacity until CSI driver is installed on the node and it starts to report a capacity. But while the CSI driver is getting installed on the node, cluster-autoscaler will send *all pending pods to same node, assuming CPU/Memory and other constraints are met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this proposal work with the existing CSI volume limits? Can you help to clarify the flow?
|
||
### Scaling a node-group that already has one or more nodes. | ||
|
||
1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if there are multiple drivers installed?
} | ||
|
||
type DriverAllocatable struct { | ||
Allocatable int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this extensible to multiple different attach limit in the future? Like what was discussed in the K8s Advanced Volume Attach Limits Strawman
scaling from 1, because in former case - no CSI volume limit information will be available | ||
If no node exists in a NodeGroup. | ||
|
||
5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How to pick the existing nodes if they have very different attach limits?
type NodeInfo struct { | ||
.... | ||
.... | ||
csiDrivers map[string]*DriverAllocatable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these mutable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, Allocatable count can change for some storage providers (E.g. some EC2 instance types share network interfaces and EBS volume attachment limits. Therefore if extra ENI added later, new stateful workloads may get stuck.
Storage providers might have to deploy controller that watches for additional ENIs and edits this allocatable, or result of NodeGetInfo may need to get percolated up to this DriverAllocatable struct due to kubernetes/enhancements#4876.
|
||
### Scaling a node-group that already has one or more nodes. | ||
|
||
1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where CSI Drivers are only installed on subset of nodes, how does the auto-scalar get this info? (e.g. ebs-csi-node Daemonset ignores instance-type x, or disabled on windows nodes)
Does the autoscalar have to parse CSI Node Daemonset node-affinity? Does the cluster admin have to configure whether nodes from X node-group/node-pool should have a CSI Driver?
Otherwise CSI Driver will never be installed and then node marked as NotReady. (I guess if no driver is ever scheduled onto node, then we can ignore this requirement?
This proposal adds CSI node awareness to cluster-autoscaler solving a long standing issue of scheduling too many pods on a node that doesn't have enough slots for volumes.