Slurm interactive mode, transcribe_speech_parallel.py gets stuck on consecutive runs #11105

itzsimpl · 2024-10-30T21:30:06Z

With container nvcr.io/nvidia/nemo:24.07 (Pyxis/Enroot), ran in Slurm interactive mode with 1 GPU if I execute the command

python3 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...

the script will get stuck on the second of multiple consecutive runs. The point where it does so is

> HERE
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
...

However if I execute the command

torchrun --standalone --nnodes=1 --nproc-per-node=1 /opt/NeMo/examples/asr/transcribe_speech_parallel.py \
...

I can run multiple consecutive times without any issues.

I have tried with standard debug parameters like

TORCH_CPP_LOG_LEVEL=INFO 
TORCH_DISTRIBUTED_DEBUG=INFO 
NCCL_DEBUG=INFO 
NCCL_DEBUG_SUBSYS=ALL

but nothing peculiar pops out.

The text was updated successfully, but these errors were encountered:

itzsimpl added the bug Something isn't working label Oct 30, 2024

itzsimpl mentioned this issue Oct 30, 2024

fastconformer hybrid recipe reports strange val_WER with nemo:24.07 and nemo:dev #10299

Open

Provide feedback