We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
With container nvcr.io/nvidia/nemo:24.07 (Pyxis/Enroot), ran in Slurm interactive mode with 1 GPU if I execute the command
nvcr.io/nvidia/nemo:24.07
python3 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \ ...
the script will get stuck on the second of multiple consecutive runs. The point where it does so is
> HERE Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True ...
However if I execute the command
torchrun --standalone --nnodes=1 --nproc-per-node=1 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \ ...
I can run multiple consecutive times without any issues.
I have tried with standard debug parameters like
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=INFO NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ALL
but nothing peculiar pops out.
The text was updated successfully, but these errors were encountered:
nemo:24.07
nemo:dev
No branches or pull requests
With container
nvcr.io/nvidia/nemo:24.07
(Pyxis/Enroot), ran in Slurm interactive mode with 1 GPU if I execute the commandthe script will get stuck on the second of multiple consecutive runs. The point where it does so is
> HERE Using bfloat16 Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True ...
However if I execute the command
I can run multiple consecutive times without any issues.
I have tried with standard debug parameters like
but nothing peculiar pops out.
The text was updated successfully, but these errors were encountered: