Skip to content

Conversation

@cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Oct 23, 2025

Fixes #7707

In managed jobs, HA recovery and update_managed_jobs_statuses should not run at the same time to avoid it being set to CONTROLLER_FAILED, while it could be just not recovered yet. We previously run the recovery script before we start the job status refresh deamon, who calls update_managed_jobs_statuses, but this function is also called in sky jobs cancel which causes the conflict. This PR fixes this issue.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 23, 2025

/quicktest-core
/smoke-test

Copy link
Collaborator

@cg505 cg505 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the update_managed_jobs_statuses could still begin before we get a chance to trigger ha_recovery_for_consolidation_mode, right? or do we have some protection against that?

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 26, 2025

seems like the update_managed_jobs_statuses could still begin before we get a chance to trigger ha_recovery_for_consolidation_mode, right? or do we have some protection against that?

technically yes. but very unlikely. so the update_managed_jobs_statuses is only called in 2 places:

  1. the request deamon
  2. cancel_jobs_by_id, which is used by sky jobs cancel

in the first case, we dont call update_managed_jobs_statuses until ha_recovery_for_consolidation_mode finish. if a user submit a sky jobs cancel so fast that it is scheduled even before the request deamon start execution, it is possible. but this is really not possible as those request deamon starts at the api server startup.

@cblmemo cblmemo merged commit a4f8094 into master Oct 30, 2025
21 of 22 checks passed
@cblmemo cblmemo deleted the fix-controller-failed-on-ha branch October 30, 2025 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Jobs][Consolidation] HA failed on controller restart

3 participants