Skip to content

feat(scheduler): DB-driven SandboxLogArchiveTask, drop sentinel file design #1025

Open
jinbai340997 wants to merge 4 commits into
alibaba:masterfrom
jinbai340997:feat/sandbox-log-archive-db-driven
Open

feat(scheduler): DB-driven SandboxLogArchiveTask, drop sentinel file design #1025
jinbai340997 wants to merge 4 commits into
alibaba:masterfrom
jinbai340997:feat/sandbox-log-archive-db-driven

Conversation

@jinbai340997
Copy link
Copy Markdown
Collaborator

Summary

New files

  • rock/admin/scheduler/tasks/sandbox_log_archive_task.py — task implementation + module-level providers + _run_on_main_loop
    cross-loop dispatch
  • tests/unit/admin/scheduler/test_sandbox_log_archive_task.py — 18 unit tests covering classification (orphan / alive / too_young
    / archived), credential-via-env, OSS key format, error isolation, cross-loop dispatch behavior

Modified files

  • rock/admin/main.py — wire sandbox_table / rock_config / main_loop providers from lifespan
  • rock/admin/scheduler/tasks/__init__.py — register new task

Cross-loop dispatch contract

Safety cap _CROSS_LOOP_DISPATCH_TIMEOUT = 60.0: if main loop is gone (admin was SIGKILLed before SchedulerThread stopped), the
child loop would otherwise hang forever. 60s is generous for any sandbox_table query; raises TimeoutError if exceeded.

Worker discovery hardening

_discover_candidates uses find -maxdepth 1 -mindepth 1 -type d to restrict to directories only. Earlier ls -1 returned
daemon-written files (docuum.log, rocklet.log, rock_worker.log, access.log, command.log, rsync_logs_to_host.log,
worker_metrics_monitor.log, image_pull.log) alongside real sandbox subdirs, each triggering a useless DB lookup and spurious
"orphan log dir" warning.

Test plan

  • Unit: uv run pytest tests/unit/admin/scheduler/test_sandbox_log_archive_task.py -v — 18/18 PASS
  • Pre-prod 3 workers direct run_action invocation: scanned=21–33, failed=0, zero Future attached to a different loop
    errors, classification counts match DB state
  • ossutil v1.7.18 verified installed on worker (out-of-scope of this PR — handled in internal worker Dockerfile)
  • OSS upload path manually verified (ossutil cp /tmp/test.txt oss://chatos-rock/... succeeded)
  • Post-launch verification: confirm oss://chatos-rock/rock-archives/sandbox-logs/<sandbox_id>.tar.gz appears after first
    sandbox ages past keep_days_before_archive=3 (validates admin → worker env-cred injection path end-to-end)
    fixes [Feature] DB-driven SandboxLogArchiveTask (replaces sentinel-file design) #1024

@jinbai340997 jinbai340997 force-pushed the feat/sandbox-log-archive-db-driven branch from f370ff3 to 4ab29e2 Compare May 27, 2026 12:24
@jinbai340997 jinbai340997 force-pushed the feat/sandbox-log-archive-db-driven branch from 4ab29e2 to 30a4724 Compare May 28, 2026 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] DB-driven SandboxLogArchiveTask (replaces sentinel-file design)

1 participant