Skip to content

[Slurm] Add retry_delays when waiting for workers #176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 8, 2021

Conversation

David96
Copy link
Contributor

@David96 David96 commented Sep 30, 2021

Fixes JuliaParallel/SlurmClusterManager.jl#55

Inspired by the LSF code. It's not very well tested though, especially about the error path I'm not sure. From reading the code I think just throwing an exception is fine?

Copy link
Collaborator

@kescobo kescobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I'll wait to merge for a couple of days in case anyone has bright ideas for how to test. This package needs a lot more testing in general, but as it is, I think it's worth merging

@kescobo kescobo added enhancement manager: SLURM The Slurm Workload Manager labels Oct 4, 2021
@David96
Copy link
Contributor Author

David96 commented Oct 4, 2021

Just something maybe worth noting: for people using SlurmManager directly (not addprocs_slurm) this is actually a breaking change as the retry_delays parameter is required.

@kescobo
Copy link
Collaborator

kescobo commented Oct 5, 2021

Oh, thank you for noting that, it's definitely important to know about that for SemVer purposes. Can you set a default value? I think you should be able to just define a constructor like SlurmManager(n::Int) = SlurmManager(n, 3) or something.

@David96
Copy link
Contributor Author

David96 commented Oct 6, 2021

Added the default constructor, I used the exponential backoff as used in addprocs_slurm (which I got from LSF), too, though. If you want to minimize changes in behavior I can change both of them to a repeated iterator.

@kescobo
Copy link
Collaborator

kescobo commented Oct 6, 2021

I don't think minimizing changes is the goal, I think we want the behavior that makes most sense as a default. I'd consider this not technically breaking if the same code works to launch jobs, but I'm happy to defer to others. Incrementing a version isn't the worst thing in the world.

@kescobo
Copy link
Collaborator

kescobo commented Oct 7, 2021

Not sure if you want to use #154 to try to add tests, but AFAIC this is ready to merge. Let me know

@David96
Copy link
Contributor Author

David96 commented Oct 8, 2021

I'm not exactly sure how a test on this would look like so from my side, it's ready to merge, too.

@kescobo kescobo merged commit 14e7302 into JuliaParallel:master Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement manager: SLURM The Slurm Workload Manager
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better handling of SLURM job submission timing
2 participants