Open
Description
Hi all,
I am trying to deploy DeepSeek-V3 on two H100 nodes, but the deployment gets stuck at a certain point. I have already set the required environment variables, including:
NCCL_SOCKET_IFNAME
NCCL_IB_GID_INDEX=3
When I tried to look into the error using NCCL_DEBUG env variable , the last lines shows this
[Node-A]:[PID]:[Thread] [0] NCCL WARN NET/IB : Got completion from peer [PRIVATE_IP] with error 12, opcode 0, len 0, vendor err 129 (Recv)
localGid ::ffff:[MASKED_IP] remoteGid ::ffff:[MASKED_IP]
[Node-A]:[PID]:[Thread] [0] NCCL INFO transport/net.cc:1302 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:698 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]
[Node-A]:[PID]:[Thread] [0] NCCL WARN NET/IB : Got completion from peer [PRIVATE_IP] with error 5, opcode 29989, len 29993, vendor err 244 (Recv)
localGid ::ffff:[MASKED_IP] remoteGid ::ffff:[MASKED_IP]
[Node-A]:[PID]:[Thread] [0] NCCL INFO transport/net.cc:1302 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:698 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]
I am using NCCL version 2.21.5
Has anyone encountered a similar issue, or does anyone have insights on resolving this? Any guidance would be appreciated!
Metadata
Metadata
Assignees
Labels
No labels