Add proposal for CSI node awareness #8083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

gnufied wants to merge 1 commit into kubernetes:master from gnufied:add-csi-node-awareness

+82 −0

Member

gnufied commented May 1, 2025

This proposal adds CSI node awareness to cluster-autoscaler solving a long standing issue of scheduling too many pods on a node that doesn't have enough slots for volumes.


          Add proposal for CSI node awareness

c666e49

k8s-ci-robot added the cncf-cla: yes label

k8s-ci-robot requested review from feiskyer and vadasambar

May 1, 2025 20:30

k8s-ci-robot added the area/cluster-autoscaler label

Contributor

k8s-ci-robot commented May 1, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the size/M label

sunnylovestiramisu commented May 1, 2025

/cc

k8s-ci-robot requested a review from sunnylovestiramisu

May 1, 2025 22:34

Contributor

elmiko commented May 2, 2025

thanks for proposing this @gnufied , this is definitely a problem that needs to be solved. from discussions i've had, i don't think we want to go down the same path as we did for the gpu label. the alternative suggestion of using a taint is a little better, but i would love to see something on the node (for example, a condition) that advertise the metadata in a more generic way.

at the least, a taint on the nodes could be detected by the autoscaler, even if the taint doesn't prevent workloads from landing there.

Member Author

gnufied commented May 2, 2025

at the least, a taint on the nodes could be detected by the autoscaler, even if the taint doesn't prevent workloads from landing there.

Problem with adopting taint as a general solution is - basically it is not backward compatible. Every workload who doesn't care about this must be updated.

The other thing is - afaik, the taints which are being added by some CPs are being down through proprietary components and code that taints the node is not part of CCM.

Collaborator

towca commented May 5, 2025

As mentioned during the SIG meeting, here are some links about how DRA support was added to CA. Unfortunately there aren't really any relevant docs explaining the approach, because the solution evolved quite heavily during development. PR and individual commit descriptions are pretty comprehensive though, I'd suggest starting there.

These are CA refactors that were necessary for DRA support but aren't DRA-specific themselves:

This was the scheduler framework change widening the integration surface with CA to capture the new DRA objects:

DRA: allow Cluster Autoscaler to integrate with DRA scheduler plugin kubernetes#127904

These are the DRA-specific parts on top of the refactors in CA:

If my intuition is right you should be able to utilize the refactors mostly as-is for volumes, by following the DRA example above.

Note that the integration isn't production-ready yet, there are a number of issues to address first: https://github.com/kubernetes/autoscaler/issues?q=%22CA%20DRA%22%20is%3Aissue%20. In particular:

CA DRA: integrate template NodeInfos with K8s API #7799 is about making scale-from-0 better.
CA DRA: correctly handle Node readiness after scale-up #7780 is about correctly handling Node readiness after startup.

ialidzhikov reviewed

View reviewed changes

cluster-autoscaler/proposals/csinode.md

+              ## Using CSINode allocatable limits for scaling nodes that are required for pods that use volumes
+              ## Introduction
+              Currently cluster-autoscaler doesn’t take into account, volume-attach limit that a node may have when scaling nodes to support unschedulable pods.

Contributor

ialidzhikov May 7, 2025

Hm, I thought I reported this in #4517 and fixed it in #4539. 🤔

Am I missing something?

Member Author

gnufied May 7, 2025 •

edited

Loading

What you are talking about is a feature of scheduler and yes scheduler is aware of CSI volume limits.

But cluster-autoscaler itself actually doesn't take into account, how many nodes it should spin to satisfy pending pods that need X number of volumes. For example - in a nodegroup X, there are 0 nodes and assuming all other nodegroups are at max. capacity, autoscaler must scale the nodegroup X to schedule pending pods. Now say, there are 5 pending pods, that each use 20 volumes. So in total 100 volumes. Cluster Autoscaler currently can't figure out that, on AWS for example, each of those nodes can support 20 volumes and hence it must spin 5 new nodes. So, what will happen is - cluster-autoscaler will spin exactly 1 node and assuming CPU/Memory requirements are satisfied, all of those 5 pods will get scheduled to same node and except 1 pod, rest of the pod will be stuck in ContainerCreating forever until someone manually intervenes.

Part of the reasons is - scheduler allows scheduling of pods to a node that doesn't have CSI driver installed. So a upcoming node N will have unlimited capacity until CSI driver is installed on the node and it starts to report a capacity. But while the CSI driver is getting installed on the node, cluster-autoscaler will send *all pending pods to same node, assuming CPU/Memory and other constraints are met.

sunnylovestiramisu May 7, 2025

How will this proposal work with the existing CSI volume limits? Can you help to clarify the flow?

sunnylovestiramisu reviewed

View reviewed changes

cluster-autoscaler/proposals/csinode.md


		### Scaling a node-group that already has one or more nodes.

		1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.

sunnylovestiramisu May 7, 2025

What happens if there are multiple drivers installed?

sunnylovestiramisu reviewed

View reviewed changes

cluster-autoscaler/proposals/csinode.md

+              }
+              type DriverAllocatable struct {
+                  Allocatable int32

sunnylovestiramisu May 7, 2025 •

edited

Loading

Is this extensible to multiple different attach limit in the future? Like what was discussed in the K8s Advanced Volume Attach Limits Strawman

sunnylovestiramisu reviewed

View reviewed changes

cluster-autoscaler/proposals/csinode.md

+              scaling from 1, because in former case - no CSI volume limit information will be available
+              If no node exists in a NodeGroup.
+. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes.

sunnylovestiramisu May 7, 2025

How to pick the existing nodes if they have very different attach limits?

sunnylovestiramisu reviewed

View reviewed changes

cluster-autoscaler/proposals/csinode.md

+              type NodeInfo struct {
+              ....
+              ....
+              csiDrivers map[string]*DriverAllocatable

sunnylovestiramisu May 8, 2025

Are these mutable?

AndrewSirenko May 9, 2025

+1, Allocatable count can change for some storage providers (E.g. some EC2 instance types share network interfaces and EBS volume attachment limits. Therefore if extra ENI added later, new stateful workloads may get stuck.

Storage providers might have to deploy controller that watches for additional ENIs and edits this allocatable, or result of NodeGetInfo may need to get percolated up to this DriverAllocatable struct due to kubernetes/enhancements#4876.

AndrewSirenko reviewed

View reviewed changes

cluster-autoscaler/proposals/csinode.md


		### Scaling a node-group that already has one or more nodes.

		1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.

AndrewSirenko May 9, 2025

In cases where CSI Drivers are only installed on subset of nodes, how does the auto-scalar get this info? (e.g. ebs-csi-node Daemonset ignores instance-type x, or disabled on windows nodes)

Does the autoscalar have to parse CSI Node Daemonset node-affinity? Does the cluster admin have to configure whether nodes from X node-group/node-pool should have a CSI Driver?

Otherwise CSI Driver will never be installed and then node marked as NotReady. (I guess if no driver is ever scheduled onto node, then we can ignore this requirement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

ialidzhikov ialidzhikov left review comments

AndrewSirenko AndrewSirenko left review comments

sunnylovestiramisu sunnylovestiramisu left review comments

feiskyer Awaiting requested review from feiskyer

vadasambar Awaiting requested review from vadasambar

Labels

area/cluster-autoscaler cncf-cla: yes size/M