Skip to content

Conversation

@elmiko
Copy link
Contributor

@elmiko elmiko commented Sep 29, 2025

What type of PR is this?

/kind cleanup
/kind api-change

What this PR does / why we need it:

This patch series changes an argument to the NewCloudProvider function to use an AutoscalerOptions struct instead of AutoscalingOptions. This change allows cloud providers to have more control over the core functionality of the cluster autoscaler.

In specific, this patch series also adds a method named RegisterScaleDownNodeProcessor to the AutoscalerOptions so that cloud providers can inject a custom scale down processor.

Lastly, this change adds a custom scale down processor to the clusterapi provider to help it avoid removing the wrong instance during scale down operations that occur during a cluster upgrade.

Which issue(s) this PR fixes:

Fixes #8494

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider and removed do-not-merge/needs-area labels Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider labels Sep 29, 2025
@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2025

this is an alternate solution instead of #8531

@elmiko
Copy link
Contributor Author

elmiko commented Oct 21, 2025

refactored slightly to take suggestions:

  • changed register function to util style
  • moved cloud provider init to the end of default init
  • added TODO for future work

@elmiko elmiko force-pushed the provider-options-refactor branch 2 times, most recently from 651e7ab to ea046a0 Compare October 21, 2025 20:20
This change helps to prevent circular dependencies between the core and
builder packages as we start to pass the AutoscalerOptions to the cloud
provider builder functions.
this changes the options input to the cloud provider builder function so
that the full autoscaler options are passed. This is being proposed so
that cloud providers will have new options for injecting behavior into
the core parts of the autoscaler.
util function to help cloud providers in adding additional combined
scale down processors.
This change adds a custom scale down node processor for cluster api to
reject nodes that are undergoing upgrade.
this change moves the cloud provider initialization to the end of the
initializeDefaultOptions function to ensure that all other options are
prepared before the cloud provider. Due to the cloud provider now
receiving the full AutoscalerOptions struct, we need to ensure that all
the data is available.
this change removes the import from the gce module in favor of using the
string value directly.
@elmiko elmiko force-pushed the provider-options-refactor branch from ea046a0 to 51a0514 Compare October 21, 2025 22:34
@elmiko
Copy link
Contributor Author

elmiko commented Oct 21, 2025

updated to revert the public scoping on combined mixed node scaled down processor.

Copy link
Contributor

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/assign @towca

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 21, 2025
@k8s-ci-robot
Copy link
Contributor

@elmiko: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-autoscaler-e2e-azure-master 51a0514 link false /test pull-cluster-autoscaler-e2e-azure-master

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@elmiko
Copy link
Contributor Author

elmiko commented Oct 22, 2025

@jackfrancis is the azure test failure something that i introduced?

@towca
Copy link
Collaborator

towca commented Oct 22, 2025

Thanks for incorporating my feedback, LGTM! Leaving the hold so that you can confirm the e2e test before submitting, feel free to unhold.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, jackfrancis, towca

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis
Copy link
Contributor

E2E failures are not related to this change: #8681

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 22, 2025
@k8s-ci-robot k8s-ci-robot merged commit e68a5b6 into kubernetes:master Oct 22, 2025
7 of 8 checks passed
@sbueringer
Copy link
Member

Thx @elmiko and everyone contributing to this!! Really appreciate it

@elmiko elmiko deleted the provider-options-refactor branch October 22, 2025 19:10
@jackfrancis
Copy link
Contributor

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2025
@jackfrancis
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.34

@k8s-infra-cherrypick-robot

@jackfrancis: #8583 failed to apply on top of branch "cluster-autoscaler-release-1.34":

Applying: refactor core.AutoscalerOptions in a new package
Using index info to reconstruct a base tree...
M	cluster-autoscaler/core/autoscaler.go
Falling back to patching base and 3-way merge...
Auto-merging cluster-autoscaler/core/autoscaler.go
CONFLICT (content): Merge conflict in cluster-autoscaler/core/autoscaler.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 refactor core.AutoscalerOptions in a new package

In response to this:

/cherry-pick cluster-autoscaler-release-1.34

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade

7 participants