Skip to content

Conversation

@barney-s
Copy link
Contributor

@barney-s barney-s commented Sep 14, 2025

  • Add TTL info in status

    • .status.firstReadyTime - When the first time sandbox became ready
    • .status.shutdownAt - When is the sandbox scheduled for deletion
  • Add TTL config to Sandbox

    • .spec.tt.seconds - How many seconds does the sandbox live
    • .spec.ttl.startPolicy - When do we start the counting the ttl
      • onCreate - TTL starts from sandbox creation
      • onReady - TTL starts from sandbox ready
      • onEnable - When this is set and .status.shutdownAt is nil
      • never - TTL is disabled
  • User can modify the .ttl.seconds and .ttl.startPolicy

  • If the .status.shutdownAt is computed to a past-time, the sandbox is
    immediately deleted

    closes Feature Request: TTL support for sandbox #18

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: barney-s
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 14, 2025
@barney-s barney-s changed the title Sandbox ttl Add TTL support for sandbox Sep 14, 2025
Copy link
Contributor

@vicentefb vicentefb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of a label-based approach ? Instead of having spec.ttl.ttlFrom we could have a generic label like sandbox.io/no-ttl-apply: true which can be toggled by another controller ? The label sandbox.io/no-ttl-apply would actually represent the system's state.

I'm just wondering how easily someone could bypass the TTL policy (as long as they have create permissions) now that the control of the "startTimer" is at the spec level.

@barney-s
Copy link
Contributor Author

WDYT of a label-based approach ? Instead of having spec.ttl.ttlFrom we could have a generic label like sandbox.io/no-ttl-apply: true which can be toggled by another controller ? The label sandbox.io/no-ttl-apply would actually represent the system's state.

I did consider annotation instead of spec field. My thought process was:

  1. In addition to disable, would be good to have the ability to select start time [create|ready|some-timestamp]
  2. Having in one-place makes it clear that there is only one setting. vs having to look at spec & annotation.
  3. for multiple values, having in spec made sense.
  4. the option of some-timestamp was necessary if we decided to re-enable after a disable. Instead of relying on too far an old time (create|retry) we can set it to current time.
  5. Having explicit timestamp was declarative (vs enable where we need to track when it was toggled to enable in the status)

Alternatively i also considered having a expiryTime/ttlExpiryAt/Cleanup absolute time. If we have that we dont need anything else.

I'm just wondering how easily someone could bypass the TTL policy (as long as they have create permissions) now that the control of the "startTimer" is at the spec level.

Even if we have an annotation, they could still bypass the ttl. This is a larger concern around rbac beyond the scope of this feature.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 15, 2025
@vicentefb
Copy link
Contributor

I see, the key difference is that we can use a Validating Webhook to block non-admin users from setting a label (like sandbox.io/no-ttl-apply). We can't cleanly block users from setting a specific value in the spec. This means the security flaw is in scope and is solvable with the label pattern.

On Features:
I agree we need flexibility. We can do both:

Put user-safe options in spec: We can add spec.ttlStartPolicy: [OnCreate, OnReady] for users

Keep the admin-only override in a label: The WarmPool uses sandbox.io/no-ttl-apply (protected by the webhook) to disable TTL. AFAIK, it might be very difficult for the user to know when they actually want to start counting the timer if a Sandbox is in the pool i think this is a good use case for the label.

This separates user intent (spec) from system state (label) and system triggers (annotation). WDYT ?

Also, is there any default value that we are thinking of for ttlFrom for example, like "infinite" ?

@barney-s
Copy link
Contributor Author

I see, the key difference is that we can use a Validating Webhook to block non-admin users from setting a label (like sandbox.io/no-ttl-apply). We can't cleanly block users from setting a specific value in the spec. This means the security flaw is in scope and is solvable with the label pattern.

Can policies not be applied for spec values ?

On Features: I agree we need flexibility. We can do both:
Put user-safe options in spec: We can add spec.ttlStartPolicy: [OnCreate, OnReady] for users

liking ttlStartPolicy: [OnCreate|OnReady|Never|OnEnable]

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 18, 2025
* Add TTL info in status
  * .status.firstReadyTime - When the first time sandbox became ready
  * .status.shutdownAt - When is the sandbox scheduled for deletion
* Add TTL config to Sandbox
  * .spec.tt.seconds - How many seconds does the sandbox live
  * .spec.ttl.startPolicy -  When do we start the counting the ttl
    * onCreate - TTL starts from sandbox creation
    * onReady - TTL starts from sandbox ready
    * onEnable - When this is set and .status.shutdownAt is nil
    * never - TTL is disabled
* User can modify the .ttl.seconds and .ttl.startPolicy
* If the .status.shutdownAt is computed to a past-time, the sandbox is
  immediately deleted
@barney-s
Copy link
Contributor Author

Updated the code to match the current state of proposal #18

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI @flpanbin proposed a feature for "shutdown time" (#23 (comment)). Which field suits our target users better? Or do we need both?

@barney-s
Copy link
Contributor Author

FYI @flpanbin proposed a feature for "shutdown time" (#23 (comment)). Which field suits our target users better? Or do we need both?

I would think both. warmpool requires relative TTL

#18 (comment)

@flpanbin
Copy link
Contributor

FYI @flpanbin proposed a feature for "shutdown time" (#23 (comment)). Which field suits our target users better? Or do we need both?

I think we need both, the TTL means the lifetime of the sandbox. After TTL expires, the sandbox will be deleted. And the "shutdown time" is used to control the pause(introduced by #36) time of the sandbox.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need both, the TTL means the lifetime of the sandbox. After TTL expires, the sandbox will be deleted. And the "shutdown time" is used to control the pause(introduced by #36) time of the sandbox.

I had a live discussion with @justinsb, @vicentefb, @tomergee about TTL and shutdownTime/endsAt today. We think that starting with shutdownTime/endsAt is better given that it's less complex. With TTL we need to think about the UX regarding when the clock starts and track the original timestamp (needed for updating TTL). We can change it to TTL or add TTL later if needed. @barney-s FYI

For pausing, let's continue the discussion in #36. @flpanbin IMO pausing and shutdown are different. Sandbox will be deleted without preserving state after shutdown time, and if it's paused, its state will be persevered.

@maqiuyujoyce
Copy link

maqiuyujoyce commented Sep 23, 2025

Rethinking about the TTL from values, I actually got a bit concern about the OnCreate option. Reason for that is it takes unpredictable amount of time to spin up the actual underlying sandbox pods, so it is possible that the sandbox expires before it got span up. So I feel this can be a risky and confusing option to provide to users.

@barney-s
Copy link
Contributor Author

#51 created with just shutdownat

@barney-s
Copy link
Contributor Author

I think we need both, the TTL means the lifetime of the sandbox. After TTL expires, the sandbox will be deleted. And the "shutdown time" is used to control the pause(introduced by #36) time of the sandbox.

I had a live discussion with @justinsb, @vicentefb, @tomergee about TTL and shutdownTime/endsAt today. We think that starting with shutdownTime/endsAt is better given that it's less complex. With TTL we need to think about the UX regarding when the clock starts and track the original timestamp (needed for updating TTL). We can change it to TTL or add TTL later if needed. @barney-s FYI

For pausing, let's continue the discussion in #36. @flpanbin IMO pausing and shutdown are different. Sandbox will be deleted without preserving state after shutdown time, and if it's paused, its state will be persevered.

:( but nevertheless ...

#51 created with just shutdownat

@k8s-ci-robot
Copy link
Contributor

@barney-s: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
presubmit-agent-sandbox-unit-test ecc2bd0 link true /test presubmit-agent-sandbox-unit-test
presubmit-agent-sandbox-lint-go ecc2bd0 link true /test presubmit-agent-sandbox-lint-go
presubmit-agent-sandbox-e2e-test ecc2bd0 link true /test presubmit-agent-sandbox-e2e-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: TTL support for sandbox

6 participants