Skip to content

feat(scheduler): RayLogCleanupTask 4-stage cleanup of session_latest/logs#1029

Open
jinbai340997 wants to merge 3 commits into
alibaba:masterfrom
jinbai340997:feat/ray-log-cleanup-session-latest
Open

feat(scheduler): RayLogCleanupTask 4-stage cleanup of session_latest/logs#1029
jinbai340997 wants to merge 3 commits into
alibaba:masterfrom
jinbai340997:feat/ray-log-cleanup-session-latest

Conversation

@jinbai340997
Copy link
Copy Markdown
Collaborator

Summary

Modified files

  • rock/admin/scheduler/tasks/ray_log_cleanup_task.py — 4-stage shell pipeline + daemon whitelist + 3 new tunable params
    (live_log_keep_days, old_logs_keep_hours, setup_log_keep_minutes) + per-category counters in return dict
  • tests/unit/admin/scheduler/test_ray_log_cleanup_task.py — 29 tests covering all 4 stages, daemon whitelist regression guards
    (especially PART 2a), output parsing, init validation, from_config defaults

New tunables (all backward compatible, defaults preserve existing behavior)

  • live_log_keep_days — PART 2b mtime threshold for non-PID non-daemon files (default 7)
  • old_logs_keep_hours — PART 3 mtime threshold for logs/old/ files (default 24)
  • setup_log_keep_minutes — PART 2c mtime threshold for runtime_env_setup-* (default 60)

Cron mirror (out of scope of this PR)

The ray-head node runs an identical shell pipeline via cron daily (no rocklet on ray-head, scheduler doesn't reach it). The shell
mirror lives in the rock-internal repo and is kept lock-step with this task's pipeline (manual sync — both stages added together).

Test plan

  • Unit: uv run pytest tests/unit/admin/scheduler/test_ray_log_cleanup_task.py -v — 29/29 PASS
  • Pre-prod worker 11.8.74.237 (6-month Ray session, 1705 files):
    • 160 daemon files preserved (BEFORE=160 → AFTER=160, zero diff) ← validates fix
    • 41 dead-PID worker files removed in single run (PART 2a)
    • runtime_env_setup-<hex> files correctly handled by new PART 2c (verified 770+ such files exist on production worker)
    • PART 3 verified by synthetic test: touched 25h-old file → deleted, fresh file → preserved
  • Regression: ran the task on a long-lived session that had agent-1183572637.err etc. — agent / log_monitor / dashboard_agent
    files preserved as expected
    fixes [Feature] RayLogCleanupTask 4-stage cleanup of session_latest/logs (PID-aware + daemon-safe) #1028

@jinbai340997 jinbai340997 force-pushed the feat/ray-log-cleanup-session-latest branch from 97477cc to d27cb60 Compare May 27, 2026 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] RayLogCleanupTask 4-stage cleanup of session_latest/logs (PID-aware + daemon-safe)

1 participant