Skip to content

Conversation

@roclark
Copy link
Member

@roclark roclark commented Apr 5, 2021

To make it easier to run on large clusters, Bobber should be able to run on SLURM clusters with Pyxis and Enroot installed. This would replace the need for mpirun and SSH keys/daemons inside the containers, making it easier to run tests without copying images between nodes or synchronizing SSH keys.

Closes #1

Signed-Off-By: Robert Clark [email protected]

@roclark roclark added enhancement New feature or request slurm Any items related to running tests with SLURM labels Apr 5, 2021
@roclark roclark requested review from fredvx and joehandzik April 5, 2021 18:52
@roclark roclark self-assigned this Apr 5, 2021
@roclark
Copy link
Member Author

roclark commented Apr 5, 2021

This is currently a draft based on the ongoing discussion in #1. At this point, the NCCL tests should be fully functional using the Python wheel. As I see it, the following items are still required:

  • Add DALI tests
  • Add FIO tests
  • Add mdtest
  • Document the installation and usage
  • Update the troubleshooting guide with steps to fix common issues
  • Get a public image up on NGC

@roclark roclark mentioned this pull request Apr 5, 2021
@roclark roclark force-pushed the slurm-support branch 2 times, most recently from b41f382 to 4866c75 Compare April 5, 2021 19:14
@roclark roclark force-pushed the slurm-support branch 10 times, most recently from e1cfc68 to e1125ff Compare April 8, 2021 16:45
To make it easier to run on large clusters, Bobber should be able to run
on SLURM clusters with Pyxis and Enroot installed. This would replace the
need for mpirun and SSH keys/daemons inside the containers, making it
easier to run tests without copying images between nodes or synchronizing
SSH keys.

Signed-Off-By: Robert Clark <[email protected]>
While using Slurm, it is entirely possible to still use Bobber but not
have Docker installed on the head node where the jobs will be launched.
In this case, Docker should be ignored unless one of the commands
directly needs the Docker runtime.

Signed-Off-By: Robert Clark <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request slurm Any items related to running tests with SLURM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update mechanism for synchronizing SSH keys

3 participants