Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

Open
MostafaCham opened this issue Nov 4, 2024 · 0 comments

Comments

@MostafaCham
Copy link

I have tried the same example provided on multigpu_torchrun.py and trained MNIST dataset and replaced the model with a simple CNN model. However, when increasing the number of GPUs in a single node the training time increases.

I have tried doubling the batch size when doubling the number of GPUs, but I still cannot see time improvement. The total training time increases instead of decreasing.

Again, my code is identical as the one in the repository. I would appreciate any help to identify the issue. Thanks in advance.

Here is the slurm file content I am using:

#SBATCH --job-name=4gp
#SBATCH --output=pytorch-DP-%j-%u-4gpu-64-slurm.out
#SBATCH --error=pytorch-DP-%j-%u-4gpu-64-slurm.err
#SBATCH --mem=24G # Job memory request
#SBATCH --gres=gpu:4 # Number of requested GPU(s)
#SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_6000 # Specific hardware constraint

nvidia-smi

torchrun --nnodes=1 --nproc_per_node=4 main_ddp.py 50 5 --batch_size 64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant