[BUG] Batch inference DDP + zero stage 3 = inference code hangs #7128

ShengYun-Peng · 2025-03-11T03:19:22Z

I ran the batch inference code with deepspeed generation, not the vllm one. The code hangs while I set zero stage = 3. I created a minimal code snippet for you to debug the error.

import os

import torch
import torch.distributed as dist
from transformers import AutoModelForCausalLM, AutoTokenizer

import deepspeed

# Initialize distributed environment
def setup_distributed():
    dist.init_process_group(backend="nccl", init_method="env://")
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    torch.cuda.set_device(local_rank)
    return local_rank


def load_model(model_name="facebook/opt-1.3b", local_rank=0):
    # Ensure distributed environment is set up
    if not dist.is_initialized():
        dist.init_process_group(backend="nccl", init_method="env://")

    world_size = dist.get_world_size()  # Number of GPUs available
    torch.cuda.set_device(local_rank)  # Assign each process to a GPU

    print(
        f"Loading model {model_name} on rank {local_rank}, using {world_size} GPUs for model parallelism"
    )

    # Load model and tokenizer
    model = AutoModelForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    # ✅ DeepSpeed Inference config for Model Parallelism
    ds_config = {
        # "replace_with_kernel_inject": False,  # Enables optimized inference kernels
        "tensor_parallel": {"tp_size": 1},  # Enables Model Parallelism
        "dtype": "bf16"
        if torch.cuda.is_bf16_supported()
        else "fp16",  # Automatic dtype selection
    }

    # ✅ Initialize DeepSpeed for Model Parallel Inference
    model = deepspeed.init_inference(model, config=ds_config)

    return model, tokenizer


# Perform inference with data parallelism
def batch_inference(model, tokenizer, prompts, local_rank):
    inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True).to(
        f"cuda:{local_rank}"
    )
    with torch.no_grad():
        outputs = model.generate(**inputs, max_length=150, synced_gpus=True)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)


def main():
    local_rank = setup_distributed()
    model, tokenizer = load_model(local_rank=local_rank)

    # Each GPU gets a different batch
    global_batch = [
        [
            "What is AI?",
            "Explain deep learning.",
        ],  # Batch for GPU 0
        [
            "Tell me a joke.",
            "What is reinforcement learning? Tell me all the details",
        ],  # Batch for GPU 1
    ]
    prompts = global_batch[local_rank] if local_rank < len(global_batch) else []

    print(f"GPU {local_rank} prompts:", prompts)
    # Perform batch inference
    results = batch_inference(model, tokenizer, prompts, local_rank)
    print(f"GPU {local_rank} results:", results)

    dist.barrier()  # Ensure all GPUs finish


if __name__ == "__main__":
    main()

Run the code with

NCCL_DEBUG=INFO NCCL_BLOCKING_WAIT=1 NCCL_ASYNC_ERROR_HANDLING=1 deepspeed --num_gpus 2 test_deepspeed.py

The code should run without error because it's DDP.
Now, if we change set "tensor_parallel": {"tp_size": 1} -> "tensor_parallel": {"tp_size": 2} and rerun the code. The code hangs forever. Note that the bug happens when DDP + TP are enabled.

loadams · 2025-03-14T20:16:48Z

@ShengYun-Peng can you share the system you're working on, the transformers and deepspeed versions, and the full error message please?

ShengYun-Peng · 2025-03-19T14:06:07Z

@ShengYun-Peng can you share the system you're working on, the transformers and deepspeed versions, and the full error message please?

Thanks! Here is the full error message if I run the code with {"tp_size": 2}

[2025-03-19 09:55:21,518] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-19 09:55:24,072] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-03-19 09:55:24,073] [INFO] [runner.py:607:main] cmd = /home/<my user name>/anaconda3/envs/llm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test_deepspeed.py
[2025-03-19 09:55:26,132] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-19 09:55:28,724] [INFO] [launch.py:139:main] 0 NCCL_ASYNC_ERROR_HANDLING=1
[2025-03-19 09:55:28,724] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=INFO
[2025-03-19 09:55:28,724] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-03-19 09:55:28,724] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-03-19 09:55:28,724] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-03-19 09:55:28,724] [INFO] [launch.py:164:main] dist_world_size=2
[2025-03-19 09:55:28,724] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-03-19 09:55:28,726] [INFO] [launch.py:256:main] process 3704567 spawned with command: ['/home/<my user name>/anaconda3/envs/llm/bin/python', '-u', 'test_deepspeed.py', '--local_rank=0']
[2025-03-19 09:55:28,727] [INFO] [launch.py:256:main] process 3704568 spawned with command: ['/home/<my user name>/anaconda3/envs/llm/bin/python', '-u', 'test_deepspeed.py', '--local_rank=1']
[2025-03-19 09:55:32,570] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-19 09:55:32,655] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[W319 09:55:34.249597544 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W319 09:55:34.257145620 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Loading model facebook/opt-1.3b on rank 0, using 2 GPUs for model parallelism
Loading model facebook/opt-1.3b on rank 1, using 2 GPUs for model parallelism
[2025-03-19 09:55:35,121] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-03-19 09:55:35,122] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2025-03-19 09:55:35,392] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-03-19 09:55:35,393] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2025-03-19 09:55:35,875] [INFO] [comm.py:652:init_distributed] cdb=None
AutoTP:  [(<class 'transformers.models.opt.modeling_opt.OPTDecoderLayer'>, ['.fc2', 'self_attn.out_proj'])]
[2025-03-19 09:55:36,294] [INFO] [comm.py:652:init_distributed] cdb=None
AutoTP:  [(<class 'transformers.models.opt.modeling_opt.OPTDecoderLayer'>, ['.fc2', 'self_attn.out_proj'])]
cosmo:3704567:3704567 [0] NCCL INFO Bootstrap : Using enp226s0:130.207.126.157<0>
cosmo:3704567:3704567 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
cosmo:3704567:3704567 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
cosmo:3704567:3704567 [0] NCCL INFO NET/Plugin: Using internal network plugin.
cosmo:3704567:3704567 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
cosmo:3704568:3704568 [1] NCCL INFO cudaDriverVersion 12040
cosmo:3704568:3704568 [1] NCCL INFO Bootstrap : Using enp226s0:130.207.126.157<0>
cosmo:3704568:3704568 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
cosmo:3704568:3704568 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
cosmo:3704568:3704568 [1] NCCL INFO NET/Plugin: Using internal network plugin.
cosmo:3704567:3705042 [0] NCCL INFO NET/IB : No device found.
cosmo:3704567:3705042 [0] NCCL INFO NET/Socket : Using [0]enp226s0:130.207.126.157<0>
cosmo:3704567:3705042 [0] NCCL INFO Using non-device net plugin version 0
cosmo:3704567:3705042 [0] NCCL INFO Using network Socket
cosmo:3704568:3705043 [1] NCCL INFO NET/IB : No device found.
cosmo:3704568:3705043 [1] NCCL INFO NET/Socket : Using [0]enp226s0:130.207.126.157<0>
cosmo:3704568:3705043 [1] NCCL INFO Using non-device net plugin version 0
cosmo:3704568:3705043 [1] NCCL INFO Using network Socket
cosmo:3704568:3705043 [1] NCCL INFO ncclCommInitRank comm 0xa55ead0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId f000 commId 0xea5c4e2426f0b236 - Init START
cosmo:3704567:3705042 [0] NCCL INFO ncclCommInitRank comm 0x8df0a00 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0xea5c4e2426f0b236 - Init START
cosmo:3704568:3705043 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
cosmo:3704567:3705042 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
cosmo:3704567:3705042 [0] NCCL INFO comm 0x8df0a00 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
cosmo:3704568:3705043 [1] NCCL INFO comm 0xa55ead0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
cosmo:3704567:3705042 [0] NCCL INFO Channel 00/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 01/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 02/24 :    0   1
cosmo:3704568:3705043 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
cosmo:3704567:3705042 [0] NCCL INFO Channel 03/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 04/24 :    0   1
cosmo:3704568:3705043 [1] NCCL INFO P2P Chunksize set to 524288
cosmo:3704567:3705042 [0] NCCL INFO Channel 05/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 06/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 07/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 08/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 09/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 10/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 11/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 12/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 13/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 14/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 15/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 16/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 17/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 18/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 19/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 20/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 21/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 22/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Channel 23/24 :    0   1
cosmo:3704567:3705042 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
cosmo:3704567:3705042 [0] NCCL INFO P2P Chunksize set to 524288
cosmo:3704568:3705043 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704567:3705042 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705043 [1] NCCL INFO Connected all rings
cosmo:3704567:3705042 [0] NCCL INFO Connected all rings
cosmo:3704568:3705043 [1] NCCL INFO Connected all trees
cosmo:3704568:3705043 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
cosmo:3704567:3705042 [0] NCCL INFO Connected all trees
cosmo:3704568:3705043 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
cosmo:3704567:3705042 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
cosmo:3704567:3705042 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
cosmo:3704568:3705043 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
cosmo:3704567:3705042 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
cosmo:3704568:3705043 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
cosmo:3704567:3705042 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
cosmo:3704568:3705043 [1] NCCL INFO ncclCommInitRank comm 0xa55ead0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId f000 commId 0xea5c4e2426f0b236 - Init COMPLETE
cosmo:3704567:3705042 [0] NCCL INFO ncclCommInitRank comm 0x8df0a00 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0xea5c4e2426f0b236 - Init COMPLETE
GPU 1 prompts: ['Tell me a joke.', 'What is reinforcement learning? Tell me all the details']
GPU 0 prompts: ['What is AI?', 'Explain deep learning.']
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
cosmo:3704567:3705577 [0] NCCL INFO Using non-device net plugin version 0
cosmo:3704568:3705578 [1] NCCL INFO Using non-device net plugin version 0
cosmo:3704567:3705577 [0] NCCL INFO Using network Socket
cosmo:3704568:3705578 [1] NCCL INFO Using network Socket
cosmo:3704567:3705577 [0] NCCL INFO bootstrapSplit: comm 0x10772970 parent 0x8df0a00 rank 0 nranks 2 color 1530306504 key 0 prev 1 next 1 - DONE
cosmo:3704568:3705578 [1] NCCL INFO bootstrapSplit: comm 0x226b1c70 parent 0xa55ead0 rank 1 nranks 2 color 1530306504 key 1 prev 0 next 0 - DONE
cosmo:3704568:3705578 [1] NCCL INFO ncclCommSplit comm 0x226b1c70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId f000 parent 0xa55ead0 color 1530306504 key 1 commId 0xa421cd9c71d9bccd - Init START
cosmo:3704567:3705577 [0] NCCL INFO ncclCommSplit comm 0x10772970 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 parent 0x8df0a00 color 1530306504 key 0 commId 0xa421cd9c71d9bccd - Init START
cosmo:3704567:3705577 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
cosmo:3704568:3705578 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
cosmo:3704568:3705578 [1] NCCL INFO comm 0x226b1c70 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
cosmo:3704567:3705577 [0] NCCL INFO comm 0x10772970 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
cosmo:3704568:3705578 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
cosmo:3704567:3705577 [0] NCCL INFO Channel 00/24 :    0   1
cosmo:3704568:3705578 [1] NCCL INFO P2P Chunksize set to 524288
cosmo:3704567:3705577 [0] NCCL INFO Channel 01/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 02/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 03/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 04/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 05/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 06/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 07/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 08/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 09/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 10/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 11/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 12/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 13/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 14/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 15/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 16/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 17/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 18/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 19/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 20/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 21/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 22/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Channel 23/24 :    0   1
cosmo:3704567:3705577 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
cosmo:3704567:3705577 [0] NCCL INFO P2P Chunksize set to 524288
cosmo:3704568:3705578 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704567:3705577 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3704568:3705578 [1] NCCL INFO Connected all rings
cosmo:3704567:3705577 [0] NCCL INFO Connected all rings
cosmo:3704567:3705577 [0] NCCL INFO Connected all trees
cosmo:3704568:3705578 [1] NCCL INFO Connected all trees
cosmo:3704568:3705578 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
cosmo:3704568:3705578 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
cosmo:3704567:3705577 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
cosmo:3704567:3705577 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
cosmo:3704567:3705577 [0] NCCL INFO ncclCommSplit comm 0x10772970 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 parent 0x8df0a00 color 1530306504 key 0 commId 0xa421cd9c71d9bccd - Init COMPLETE
cosmo:3704568:3705578 [1] NCCL INFO ncclCommSplit comm 0x226b1c70 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId f000 parent 0xa55ead0 color 1530306504 key 1 commId 0xa421cd9c71d9bccd - Init COMPLETE

Basically the code hangs out after printing Init COMPLETE. In comparison, below is the the log message running the same code but with tp_size=1

[2025-03-19 10:01:24,600] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-19 10:01:27,152] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-03-19 10:01:27,152] [INFO] [runner.py:607:main] cmd = /home/<my user name>/anaconda3/envs/llm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None test_deepspeed.py
[2025-03-19 10:01:29,285] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-19 10:01:31,860] [INFO] [launch.py:139:main] 0 NCCL_ASYNC_ERROR_HANDLING=1
[2025-03-19 10:01:31,860] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=INFO
[2025-03-19 10:01:31,860] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-03-19 10:01:31,860] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-03-19 10:01:31,860] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-03-19 10:01:31,860] [INFO] [launch.py:164:main] dist_world_size=2
[2025-03-19 10:01:31,860] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-03-19 10:01:31,862] [INFO] [launch.py:256:main] process 3711589 spawned with command: ['/home/<my user name>/anaconda3/envs/llm/bin/python', '-u', 'test_deepspeed.py', '--local_rank=0']
[2025-03-19 10:01:31,863] [INFO] [launch.py:256:main] process 3711590 spawned with command: ['/home/<my user name>/anaconda3/envs/llm/bin/python', '-u', 'test_deepspeed.py', '--local_rank=1']
[2025-03-19 10:01:35,736] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-03-19 10:01:35,799] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[W319 10:01:37.240582657 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
[W319 10:01:37.245839433 Utils.hpp:135] Warning: Environment variable NCCL_ASYNC_ERROR_HANDLING is deprecated; use TORCH_NCCL_ASYNC_ERROR_HANDLING instead (function operator())
Loading model facebook/opt-1.3b on rank 0, using 2 GPUs for model parallelism
Loading model facebook/opt-1.3b on rank 1, using 2 GPUs for model parallelism
[2025-03-19 10:01:38,072] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-03-19 10:01:38,073] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2025-03-19 10:01:38,303] [INFO] [logging.py:128:log_dist] [Rank -1] DeepSpeed info: version=0.16.3, git-hash=unknown, git-branch=unknown
[2025-03-19 10:01:38,305] [INFO] [logging.py:128:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
GPU 0 prompts: ['What is AI?', 'Explain deep learning.']
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
GPU 1 prompts: ['Tell me a joke.', 'What is reinforcement learning? Tell me all the details']
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
cosmo:3711589:3711589 [0] NCCL INFO Bootstrap : Using enp226s0:130.207.126.157<0>
cosmo:3711589:3711589 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
cosmo:3711589:3711589 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
cosmo:3711589:3711589 [0] NCCL INFO NET/Plugin: Using internal network plugin.
cosmo:3711589:3711589 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
cosmo:3711590:3711590 [1] NCCL INFO cudaDriverVersion 12040
cosmo:3711590:3711590 [1] NCCL INFO Bootstrap : Using enp226s0:130.207.126.157<0>
cosmo:3711590:3711590 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
cosmo:3711590:3711590 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
cosmo:3711590:3711590 [1] NCCL INFO NET/Plugin: Using internal network plugin.
cosmo:3711589:3712583 [0] NCCL INFO NET/IB : No device found.
cosmo:3711589:3712583 [0] NCCL INFO NET/Socket : Using [0]enp226s0:130.207.126.157<0>
cosmo:3711589:3712583 [0] NCCL INFO Using non-device net plugin version 0
cosmo:3711589:3712583 [0] NCCL INFO Using network Socket
cosmo:3711590:3712584 [1] NCCL INFO NET/IB : No device found.
cosmo:3711590:3712584 [1] NCCL INFO NET/Socket : Using [0]enp226s0:130.207.126.157<0>
cosmo:3711590:3712584 [1] NCCL INFO Using non-device net plugin version 0
cosmo:3711590:3712584 [1] NCCL INFO Using network Socket
cosmo:3711590:3712584 [1] NCCL INFO ncclCommInitRank comm 0x117b0300 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId f000 commId 0xc2c4503b884a437a - Init START
cosmo:3711589:3712583 [0] NCCL INFO ncclCommInitRank comm 0x101d0870 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0xc2c4503b884a437a - Init START
cosmo:3711590:3712584 [1] NCCL INFO Setting affinity for GPU 1 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
cosmo:3711589:3712583 [0] NCCL INFO Setting affinity for GPU 0 to ffff0000,00000000,00000000,00000000,ffff0000,00000000
cosmo:3711590:3712584 [1] NCCL INFO comm 0x117b0300 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
cosmo:3711589:3712583 [0] NCCL INFO comm 0x101d0870 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
cosmo:3711589:3712583 [0] NCCL INFO Channel 00/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 01/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 02/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 03/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 04/24 :    0   1
cosmo:3711590:3712584 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1 [8] 0/-1/-1->1->-1 [9] 0/-1/-1->1->-1 [10] 0/-1/-1->1->-1 [11] 0/-1/-1->1->-1 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] 0/-1/-1->1->-1 [19] 0/-1/-1->1->-1 [20] 0/-1/-1->1->-1 [21] 0/-1/-1->1->-1 [22] 0/-1/-1->1->-1 [23] 0/-1/-1->1->-1
cosmo:3711589:3712583 [0] NCCL INFO Channel 05/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 06/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 07/24 :    0   1
cosmo:3711590:3712584 [1] NCCL INFO P2P Chunksize set to 524288
cosmo:3711589:3712583 [0] NCCL INFO Channel 08/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 09/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 10/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 11/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 12/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 13/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 14/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 15/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 16/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 17/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 18/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 19/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 20/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 21/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 22/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Channel 23/24 :    0   1
cosmo:3711589:3712583 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1 [8] -1/-1/-1->0->1 [9] -1/-1/-1->0->1 [10] -1/-1/-1->0->1 [11] -1/-1/-1->0->1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] -1/-1/-1->0->1 [19] -1/-1/-1->0->1 [20] -1/-1/-1->0->1 [21] -1/-1/-1->0->1 [22] -1/-1/-1->0->1 [23] -1/-1/-1->0->1
cosmo:3711589:3712583 [0] NCCL INFO P2P Chunksize set to 524288
cosmo:3711590:3712584 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 16/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 17/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 18/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 19/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 20/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 21/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 22/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM/read
cosmo:3711590:3712584 [1] NCCL INFO Channel 23/0 : 1[1] -> 0[0] via P2P/CUMEM/read
cosmo:3711589:3712583 [0] NCCL INFO Connected all rings
cosmo:3711589:3712583 [0] NCCL INFO Connected all trees
cosmo:3711590:3712584 [1] NCCL INFO Connected all rings
cosmo:3711590:3712584 [1] NCCL INFO Connected all trees
cosmo:3711589:3712583 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
cosmo:3711589:3712583 [0] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
cosmo:3711590:3712584 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
cosmo:3711590:3712584 [1] NCCL INFO 24 coll channels, 24 collnet channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer
cosmo:3711590:3712584 [1] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
cosmo:3711590:3712584 [1] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
cosmo:3711590:3712584 [1] NCCL INFO ncclCommInitRank comm 0x117b0300 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId f000 commId 0xc2c4503b884a437a - Init COMPLETE
cosmo:3711589:3712583 [0] NCCL INFO TUNER/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-tuner.so
cosmo:3711589:3712583 [0] NCCL INFO TUNER/Plugin: Using internal tuner plugin.
cosmo:3711589:3712583 [0] NCCL INFO ncclCommInitRank comm 0x101d0870 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 7000 commId 0xc2c4503b884a437a - Init COMPLETE
GPU 0 results: ['What is AI?\n\nArtificial Intelligence is a branch of computer science that deals with the study of intelligent machines. It is a branch of computer science that deals with the study of intelligent machines.\n\nArtificial Intelligence is a branch of computer science that deals with the study of intelligent machines. It is a branch of computer science that deals with the study of intelligent machines.\n\nWhat is AI?\n\nArtificial Intelligence is a branch of computer science that deals with the study of intelligent machines. It is a branch of computer science that deals with the study of intelligent machines.\n\nWhat is AI?\n\nArtificial Intelligence is a branch of computer science that deals with the study of intelligent machines. It is a branch', 'Explain deep learning.\n\nDeep learning is a branch of computer science that uses artificial neural networks to learn from data.\n\nThe most common type of deep learning is called reinforcement learning.\n\nIt uses a computer to learn from data.\n\nThe computer learns by observing the data and then making decisions based on the data.\n\nThe computer can learn from data that is not labeled.\n\nThe computer can learn from data that is labeled incorrectly.\n\nThe computer can learn from data that is not labeled at all.\n\nThe computer can learn from data that is not labeled at all.\n\nThe computer can learn from data that is not labeled at all.\n\nThe computer can learn from data that is']
GPU 1 results: ["Tell me a joke.\nWhat's the difference between a man and a bag of chips?                                                                                                                             ", 'What is reinforcement learning? Tell me all the details!\n\nReinforcement learning is a form of artificial intelligence that uses machine learning to learn from past experiences. It is a form of artificial intelligence that uses machine learning to learn from past experiences.\n\nReinforcement learning is a form of artificial intelligence that uses machine learning to learn from past experiences. It is a form of artificial intelligence that uses machine learning to learn from past experiences.\n\nReinforcement learning is a form of artificial intelligence that uses machine learning to learn from past experiences. It is a form of artificial intelligence that uses machine learning to learn from past experiences.\n\nReinforcement learning is a form of artificial intelligence that uses machine learning to learn from past experiences']
cosmo:3711590:3712592 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[rank0]:[W319 10:01:43.320007968 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
cosmo:3711589:3712594 [0] NCCL INFO [Service thread] Connection closed by localRank 0
cosmo:3711590:3712609 [1] NCCL INFO comm 0x117b0300 rank 1 nranks 2 cudaDev 1 busId f000 - Abort COMPLETE
cosmo:3711589:3712610 [0] NCCL INFO comm 0x101d0870 rank 0 nranks 2 cudaDev 0 busId 7000 - Abort COMPLETE
[2025-03-19 10:01:44,876] [INFO] [launch.py:351:main] Process 3711590 exits successfully.
[2025-03-19 10:01:45,878] [INFO] [launch.py:351:main] Process 3711589 exits successfully.

I'm running the code on DGX A100 with deepspeed==0.16.3, transformers==4.48.3, torch==2.5.1.

ShengYun-Peng · 2025-03-19T14:11:27Z

请问你这个代码可以运行吗，为什么我运行会出现错误 [rank1]: raise ValueError( [rank1]: ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.)or add a new pad token viatokenizer.add_special_tokens({'pad_token': '[PAD]'}).

Thanks! The opt-1.3b model explicitly defines a pad_token in its tokenizer_config: link, so there's no need to manually set tokenizer.pad_token = tokenizer.eos_token. I printed tokenizer.pad_token in the code above, and it returned <pad>. Perhaps we're using different versions of transformers, so adding tokenizer.pad_token = tokenizer.eos_token in your code might help bypass the issue.

ShengYun-Peng changed the title ~~Batch inference DDP + zero stage 3 = inference code hangs~~ [BUG] Batch inference DDP + zero stage 3 = inference code hangs Mar 11, 2025

ShengYun-Peng mentioned this issue Mar 11, 2025

[BUG] Batch inference DDP + zero stage 3 = inference code hangs huggingface/transformers#36638

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Batch inference DDP + zero stage 3 = inference code hangs #7128

[BUG] Batch inference DDP + zero stage 3 = inference code hangs #7128

ShengYun-Peng commented Mar 11, 2025

loadams commented Mar 14, 2025

ShengYun-Peng commented Mar 19, 2025

ShengYun-Peng commented Mar 19, 2025 •

edited

Loading

[BUG] Batch inference DDP + zero stage 3 = inference code hangs #7128

[BUG] Batch inference DDP + zero stage 3 = inference code hangs #7128

Comments

ShengYun-Peng commented Mar 11, 2025

loadams commented Mar 14, 2025

ShengYun-Peng commented Mar 19, 2025

ShengYun-Peng commented Mar 19, 2025 • edited Loading

ShengYun-Peng commented Mar 19, 2025 •

edited

Loading