feat(sandbox): add /delete endpoint + auto-delete background scan by zhangjaycee · Pull Request #1038 · alibaba/ROCK

zhangjaycee · 2026-05-27T14:23:38Z

When `--storage-opt size=` is in effect, docker allocates an XFS project id on the container's overlay2 upper dir and binds a bhard limit to it. Previously the sandbox log dir got an independently-allocated prjid + its own bhard (disk_limit_log), so one sandbox had two separate quotas — log writes could exhaust the log quota while rootfs sat idle (or vice versa) even though there was no operational reason to split them. Reuse docker's prjid for the log dir instead. Both paths now share a single bhard (the rootfs limit) — the user-facing knob becomes "this sandbox's total disk write budget", which matches how operators reason about it. Implementation (rock/deployments/docker.py): - _get_docker_rootfs_prjid_and_mountpoint (new): reads the prjid docker assigned to the container's upper dir via `docker inspect` + `xfs_io -r -c lsproj`, plus the mountpoint via findmnt. Only meaningful after `docker create` — the upper dir doesn't exist before that. - _setup_log_dir_quota_shared (new): runs in the create→start gap added by the previous commit. If the log dir lives on the same XFS mountpoint as the rootfs upper dir, attaches the same prjid to it with `xfs_quota project -s` (binds prjid + sets inherit flag). Does NOT call `limit -p` — the bhard belongs to docker; a second limit would clobber it. Cleanup is also docker's responsibility: `docker rm` resets the prjid's limit when the upper dir is torn down. - start() invokes the shared path unconditionally. On any failure (no rootfs prjid, log dir on a different XFS, subprocess error) the log dir simply has no quota in effect — there is no independent fallback. Deployments that need shared-quota enforcement must keep the log dir and docker root on the same filesystem. - _effective_disk_limit_log mirrors the rootfs limit on success, stays None otherwise. Drops the legacy independent-prjid code path entirely: - _try_set_log_dir_quota / _cleanup_log_dir_xfs_quota methods removed - XFS_PRJID_MIN / XFS_PRJID_RANGE constants removed - log_dir_xfs_prjid / log_dir_xfs_mountpoint fields removed (no longer need to track the prjid for manual cleanup)

…ive key Add pick_sandbox_info_fields() that keeps only keys declared in the SandboxInfo TypedDict, and apply it inside SandboxMetaStore.create/update before json_set on the alive key. Why: SandboxRecord.to_dict() returns extra DB-only keys (spec, status) that the DB-fallback read path (meta_store.get(check_db=True)) surfaces. Without filtering, callers that round-trip the dict back through update() would leak those keys into Redis. Letting the schema drive the allowlist also auto-protects future DB-only columns and removes the need for ad-hoc blocklists at every call site. DB writes are unchanged — sandbox_table has its own column-based filter. (cherry picked from commit b0191a1) Signed-off-by: Jiachen Zhang <zjc462490@alibaba-inc.com>

…e semantics archive_cache() reads the current Redis alive-key, merges final_info on top (e.g. stop_time, state), persists the snapshot to DB, then evicts both Redis keys. The DB always receives a complete snapshot regardless of what the caller passes.

When sm is None, _get_current_statemachine already checked both Redis (alive key) and DB (check_db=True) and found nothing. archive_cache in this path would: read Redis (empty), merge (empty), skip DB write (nothing to write), delete Redis keys (already absent) — a pure no-op. The operator.stop() call is kept to kill any lingering Ray actor.

restore_cache() re-seeds Redis alive + timeout keys from stored state, making an archived sandbox live again. Used by restart to fill the cache after on_stop has evicted it via archive_cache.

The previous shell script spawned the watchdog via `nohup ... &`, so the process saved in `self._clean_container_background_process` was only the short-lived setup script (which exited within milliseconds after detaching the real watchdog). The subsequent `.kill()` call in `actor.stop()` was a no-op and the real watchdog was unreachable from Python. Run the watchdog as the foreground process of the Popen call and add `start_new_session=True` to preserve the SIGHUP isolation that `nohup` provided. With this, the Popen handle points at the actual watchdog and `.kill()` works, which is the prerequisite for the upcoming restart fix (restart must terminate the old watchdog before docker start, otherwise the old watchdog races and `docker stop`s the freshly started container).

restart brings a stopped sandbox back up by running `docker start` on the original container, preserving its filesystem state across the stop/restart cycle. start() remains the path for fresh containers via `docker run`. - POST /restart admin route + SDK Sandbox.restart() - AbstractOperator / RayOperator / SandboxActor / DockerDeployment each expose a restart() method; DockerDeployment.restart() runs `docker start` and validates the container is running. - SandboxStateMachine adds a stopped -> pending transition. on_restart rebuilds DockerDeploymentConfig from the spec snapshot in sandbox_record.spec (DockerDeploymentConfig.model_dump written once by sandbox_table.create), so the new actor wraps the existing container with the same image / memory / cpus / auto_clear. Sandboxes without a spec snapshot fall back to flat sandbox_info fields plus pydantic field defaults. - SandboxManager.restart_async validates the transition and dispatches the SM event; symmetric with stop(). perf(ports): find_free_port resolves docker-published host ports from a module-level cache. do_port_mapping refreshes the cache once per call so its three find_free_port lookups share a single docker scan; standalone find_free_port callers lazy-refresh when the cache is empty. Tests: integration suites for find_free_port and DockerDeployment diagnostics; unit tests for the SM restart transition and the manager restart path.

…rror _stop() calls _cleanup_kata_disk() which deletes the host .img file bound to the container via -v. On restart, docker start cannot mount the missing volume and kata-agent's createContainer fails. Until a dedicated delete API moves disk cleanup out of _stop(), restart is blocked for kata containers. Raise NotImplementedError early in DockerDeployment.restart() so callers get an actionable error instead of a cryptic kata-agent gRPC failure. Signed-off-by: Jiachen Zhang <zjc462490@alibaba-inc.com>

- Add DELETE /sandbox/{id} API that permanently removes stopped containers - State machine guarantees delete only triggers from stopped (actor already killed), so delete reuses the standard `sandbox-{id}` actor name - DockerDeployment.delete() issues `docker rm -f` for permanent removal - SandboxStateMachine: new `deleted` terminal state with `delete` transition - MetaStore: hard-delete Redis key + DB record on delete - Auto-delete scheduler scans for sandboxes stopped > N hours and deletes them - SDK client: add `delete()` method

zhangjaycee added 10 commits May 27, 2026 06:59

feat(meta-store): add restore_cache as inverse of archive_cache

8bffbae

restore_cache() re-seeds Redis alive + timeout keys from stored state, making an archived sandbox live again. Used by restart to fill the cache after on_stop has evicted it via archive_cache.

feat(sandbox-table): add optional limit param to list_by_in query

61dd6b1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sandbox): add /delete endpoint + auto-delete background scan#1038

feat(sandbox): add /delete endpoint + auto-delete background scan#1038
zhangjaycee wants to merge 10 commits into
alibaba:masterfrom
zhangjaycee:feature/delete

zhangjaycee commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangjaycee commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant