multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

MostafaCham · 2024-11-04T04:48:58Z

I have tried the same example provided on multigpu_torchrun.py and trained MNIST dataset and replaced the model with a simple CNN model. However, when increasing the number of GPUs in a single node the training time increases.

I have tried doubling the batch size when doubling the number of GPUs, but I still cannot see time improvement. The total training time increases instead of decreasing.

Again, my code is identical as the one in the repository. I would appreciate any help to identify the issue. Thanks in advance.

Here is the slurm file content I am using:

#SBATCH --job-name=4gp
#SBATCH --output=pytorch-DP-%j-%u-4gpu-64-slurm.out
#SBATCH --error=pytorch-DP-%j-%u-4gpu-64-slurm.err
#SBATCH --mem=24G # Job memory request
#SBATCH --gres=gpu:4 # Number of requested GPU(s)
#SBATCH --time=3-23:00:00 # Time limit days-hrs:min:sec
#SBATCH --constraint=rtx_6000 # Specific hardware constraint

nvidia-smi

torchrun --nnodes=1 --nproc_per_node=4 main_ddp.py 50 5 --batch_size 64

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

MostafaCham commented Nov 4, 2024

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

multigpu_torchrun.py does not show speed up when training on multi GPUs! #1298

Comments

MostafaCham commented Nov 4, 2024