Skip to content

Add proposal for CSI node awareness #8083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented May 1, 2025

This proposal adds CSI node awareness to cluster-autoscaler solving a long standing issue of scheduling too many pods on a node that doesn't have enough slots for volumes.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gnufied
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 1, 2025
@sunnylovestiramisu
Copy link

/cc

@elmiko
Copy link
Contributor

elmiko commented May 2, 2025

thanks for proposing this @gnufied , this is definitely a problem that needs to be solved. from discussions i've had, i don't think we want to go down the same path as we did for the gpu label. the alternative suggestion of using a taint is a little better, but i would love to see something on the node (for example, a condition) that advertise the metadata in a more generic way.

at the least, a taint on the nodes could be detected by the autoscaler, even if the taint doesn't prevent workloads from landing there.

@gnufied
Copy link
Member Author

gnufied commented May 2, 2025

at the least, a taint on the nodes could be detected by the autoscaler, even if the taint doesn't prevent workloads from landing there.

Problem with adopting taint as a general solution is - basically it is not backward compatible. Every workload who doesn't care about this must be updated.

The other thing is - afaik, the taints which are being added by some CPs are being down through proprietary components and code that taints the node is not part of CCM.

@towca
Copy link
Collaborator

towca commented May 5, 2025

As mentioned during the SIG meeting, here are some links about how DRA support was added to CA. Unfortunately there aren't really any relevant docs explaining the approach, because the solution evolved quite heavily during development. PR and individual commit descriptions are pretty comprehensive though, I'd suggest starting there.

These are CA refactors that were necessary for DRA support but aren't DRA-specific themselves:

This was the scheduler framework change widening the integration surface with CA to capture the new DRA objects:

These are the DRA-specific parts on top of the refactors in CA:

If my intuition is right you should be able to utilize the refactors mostly as-is for volumes, by following the DRA example above.

Note that the integration isn't production-ready yet, there are a number of issues to address first: https://github.com/kubernetes/autoscaler/issues?q=%22CA%20DRA%22%20is%3Aissue%20. In particular:

## Using CSINode allocatable limits for scaling nodes that are required for pods that use volumes

## Introduction
Currently cluster-autoscaler doesn’t take into account, volume-attach limit that a node may have when scaling nodes to support unschedulable pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought I reported this in #4517 and fixed it in #4539. 🤔

Am I missing something?

Copy link
Member Author

@gnufied gnufied May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are talking about is a feature of scheduler and yes scheduler is aware of CSI volume limits.

But cluster-autoscaler itself actually doesn't take into account, how many nodes it should spin to satisfy pending pods that need X number of volumes. For example - in a nodegroup X, there are 0 nodes and assuming all other nodegroups are at max. capacity, autoscaler must scale the nodegroup X to schedule pending pods. Now say, there are 5 pending pods, that each use 20 volumes. So in total 100 volumes. Cluster Autoscaler currently can't figure out that, on AWS for example, each of those nodes can support 20 volumes and hence it must spin 5 new nodes. So, what will happen is - cluster-autoscaler will spin exactly 1 node and assuming CPU/Memory requirements are satisfied, all of those 5 pods will get scheduled to same node and except 1 pod, rest of the pod will be stuck in ContainerCreating forever until someone manually intervenes.

Part of the reasons is - scheduler allows scheduling of pods to a node that doesn't have CSI driver installed. So a upcoming node N will have unlimited capacity until CSI driver is installed on the node and it starts to report a capacity. But while the CSI driver is getting installed on the node, cluster-autoscaler will send *all pending pods to same node, assuming CPU/Memory and other constraints are met.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will this proposal work with the existing CSI volume limits? Can you help to clarify the flow?


### Scaling a node-group that already has one or more nodes.

1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if there are multiple drivers installed?

}

type DriverAllocatable struct {
Allocatable int32
Copy link

@sunnylovestiramisu sunnylovestiramisu May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this extensible to multiple different attach limit in the future? Like what was discussed in the K8s Advanced Volume Attach Limits Strawman

scaling from 1, because in former case - no CSI volume limit information will be available
If no node exists in a NodeGroup.

5. We propose that, when deciding pods that should be considered for scaling nodes in podListProcessor.Process function, we update the hinting_simulator to consider CSI volume limits of existing nodes. This will allow cluster-autoscaler to exactly know, if all unschedulable pods will fit in the recently spun or currently running nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to pick the existing nodes if they have very different attach limits?

type NodeInfo struct {
....
....
csiDrivers map[string]*DriverAllocatable

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these mutable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, Allocatable count can change for some storage providers (E.g. some EC2 instance types share network interfaces and EBS volume attachment limits. Therefore if extra ENI added later, new stateful workloads may get stuck.

Storage providers might have to deploy controller that watches for additional ENIs and edits this allocatable, or result of NodeGetInfo may need to get percolated up to this DriverAllocatable struct due to kubernetes/enhancements#4876.


### Scaling a node-group that already has one or more nodes.

1. We propose a similar label as GPULabel added to the node that is supposed to come up with a CSI driver. This would ensure that, nodes which are supposed to have a certain CSI driver installed aren’t considered ready - https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L979 until CSI driver is installed there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where CSI Drivers are only installed on subset of nodes, how does the auto-scalar get this info? (e.g. ebs-csi-node Daemonset ignores instance-type x, or disabled on windows nodes)

Does the autoscalar have to parse CSI Node Daemonset node-affinity? Does the cluster admin have to configure whether nodes from X node-group/node-pool should have a CSI Driver?

Otherwise CSI Driver will never be installed and then node marked as NotReady. (I guess if no driver is ever scheduled onto node, then we can ignore this requirement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants