Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm interactive mode, transcribe_speech_parallel.py gets stuck on consecutive runs #11105

Open
itzsimpl opened this issue Oct 30, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@itzsimpl
Copy link
Contributor

With container nvcr.io/nvidia/nemo:24.07 (Pyxis/Enroot), ran in Slurm interactive mode with 1 GPU if I execute the command

python3 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...

the script will get stuck on the second of multiple consecutive runs. The point where it does so is

> HERE
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
...

However if I execute the command

torchrun --standalone --nnodes=1 --nproc-per-node=1 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...

I can run multiple consecutive times without any issues.

I have tried with standard debug parameters like

TORCH_CPP_LOG_LEVEL=INFO 
TORCH_DISTRIBUTED_DEBUG=INFO 
NCCL_DEBUG=INFO 
NCCL_DEBUG_SUBSYS=ALL

but nothing peculiar pops out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant