Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic allocation #73

Open
1 of 5 tasks
Tracked by #71
cmelone opened this issue Jul 31, 2024 · 1 comment
Open
1 of 5 tasks
Tracked by #71

Dynamic allocation #73

cmelone opened this issue Jul 31, 2024 · 1 comment
Assignees
Labels
feature New feature or request

Comments

@cmelone
Copy link
Collaborator

cmelone commented Jul 31, 2024

We would like to use historical resource utilization data to predict future usage, and assign container memory and CPU requests within Kubernetes -- to optimize utilization.

To ensure we don't have negative impacts on the CI, we are starting with requests.

Because we are simply creating resource requests, it will still be possible for jobs to use more CPU/RAM than we expect. The next phase of experimentation will focus on instituting limits.

The ability to predict usage and assign requests has been implemented in the gantry web api, but there are a few steps needed to complete this goal.

  • Merge CI scriptable config into spack
  • Perform tests to ensure the system will function as expected
  • Enable gantry in spack
  • Monitor results to see if resources are indeed being optimized and fix any issues
  • Implement assigning resource limits after designing an accurate prediction algorithm
@cmelone cmelone added the feature New feature or request label Jul 31, 2024
@cmelone cmelone self-assigned this Jul 31, 2024
@cmelone cmelone mentioned this issue Jul 31, 2024
5 tasks
@cmelone
Copy link
Collaborator Author

cmelone commented Oct 29, 2024

I've now deployed gantry onto the CI staging cluster and will be testing how successful we are...things to experiment with:

  • setting requests=limits
    • this would guarantee that jobs would not escape their "cocoon" of resources, but it would also mean that we're leaving mem_max-mem_mean on the table, in exchange for more stability in the CI

how to evaluate: check the avg duration of each spec before/after dynamic allocation to see how performance has been impacted...cost as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant