Skip to content

Conversation

@vicentefb
Copy link
Contributor

@vicentefb vicentefb commented Oct 8, 2025

towards: #59

This PR introduces disruption control for sandboxes created via sandboxclaim controller. It extends the SandboxTemplate API with a new boolean flag, enableDisruptionControl, which protects the resulting sandbox pod from voluntary disruptions (e.g., node upgrades, cluster autoscaling).

This is achieved by updating the sandboxclaim-controller to:

  1. Inject a "safe-to-evict" annotation and a PDB selector label into the Sandbox resource.
  2. Implement a finalizer to manage the lifecycle of a shared PodDisruptionBudget (PDB) in each namespace.

API Update (sandboxtemplate_types.go):
-Added a new optional field enableDisruptionControl: bool to the SandboxTemplateSpec.

Controller Logic (sandboxclaim_controller.go):
-PDB Finalizer: The controller now adds a pdb-cleanup finalizer to any SandboxClaim that uses a template with enableDisruptionControl: true.
-Resource Injection: When creating a Sandbox resource, the controller now injects two fields into its podTemplate:

  1. The annotation cluster-autoscaler.kubernetes.io/safe-to-evict: "false".
  2. The label sandbox-disruption-policy: HighlyAvailable (to be used by the shared PDB).

-PDB Reconciliation: The controller ensures a single, shared PodDisruptionBudget (named sandbox-highly-available) exists in any namespace that has at least one claim with enableDisruptionControl: true.
-Cleanup Logic: The reconcilePDBDeletion function ensures that the shared PDB is only deleted when the last participating SandboxClaim in that namespace is deleted.

Testing:
Added a new test suite, TestSandboxClaimPDBLifecycle, to sandboxclaim_controller_test.go to validate:
-The PDB is correctly created when enableDisruptionControl is true.
-The finalizer is successfully added to the SandboxClaim.
-The Sandbox resource is created with the correct PDB label and safe-to-evict annotation.
-The PDB is not deleted if other claims still require it.

The PDB is correctly deleted when the final participating claim is removed.
image

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 8, 2025
@netlify
Copy link

netlify bot commented Oct 8, 2025

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 9decdeb
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/68fc053663b5f800084c73f9

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vicentefb
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 8, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @vicentefb. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2025
@vicentefb vicentefb force-pushed the AddPDB branch 2 times, most recently from ae1d6ff to 8891f87 Compare October 8, 2025 17:56
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2025
@barney-s
Copy link
Contributor

barney-s commented Oct 9, 2025

I am inclined to keep this out of the core sandbox CRD.
Similar to NW policies.

Thoughts @justinsb @janetkuo

Related:

@barney-s
Copy link
Contributor

barney-s commented Oct 9, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 9, 2025
@vicentefb vicentefb force-pushed the AddPDB branch 3 times, most recently from 4016287 to 588db3d Compare October 9, 2025 23:16
@vicentefb
Copy link
Contributor Author

I've refactored PDB reconciler into its own controller file. PTAL @barney-s @justinsb @janetkuo

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 11, 2025
@gongmax
Copy link

gongmax commented Oct 13, 2025

I am inclined to keep this out of the core sandbox CRD. Similar to NW policies.

Thoughts @justinsb @janetkuo

Related:

Hi @barney-s , were you suggesting that we should not have this at all, and leave it to users to create and manage the PDB for their Sandbox pod?

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 13, 2025
@barney-s
Copy link
Contributor

Hi @barney-s , were you suggesting that we should not have this at all, and leave it to users to create and manage the PDB for their Sandbox pod?

Yes. Not part of the sandbox spec.

@gongmax
Copy link

gongmax commented Oct 14, 2025

Hi @barney-s , were you suggesting that we should not have this at all, and leave it to users to create and manage the PDB for their Sandbox pod?

Yes. Not part of the sandbox spec.

Sounds good, and we can provide some documentation/guidance around this so users can follow

@vicentefb
Copy link
Contributor Author

friendly ping

@vicentefb vicentefb requested a review from janetkuo October 17, 2025 00:01
Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending some more time on this, I don't think using a PDB for replica=1 resources is ideal. Given that each Sandbox has only one replica, applying a PDB with minAvailable: 1 creates an usually undesirable, operational outcome that it will block all voluntary disruptions (like node drains) and effectively prevent cluster maintenance.

@barney-s
Copy link
Contributor

After spending some more time on this, I don't think using a PDB for replica=1 resources is ideal. Given that each Sandbox has only one replica, applying a PDB with minAvailable: 1 creates an usually undesirable, operational outcome that it will block all voluntary disruptions (like node drains) and effectively prevent cluster maintenance.

+1 to this. Also commented to this effect on the Issue #59
And there is no guarantee the PDB will provide a guaranteed runtimes. K8s providers may choose to ignore PDB's after a set time.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 24, 2025
feat: added PDB to Sandbox spec

updated rbac generated file

nit

nit

refacted pdb into its own controller

nit sandbox controller test

nit

moved PDB implementation to extensions

fixed lint
@vicentefb vicentefb changed the title feat: Added Disruption control for Sandbox feat: Added Disruption control for SandboxTemplate in extensions Oct 24, 2025
@vicentefb vicentefb requested a review from janetkuo October 24, 2025 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants