You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all,
I am trying to deploy DeepSeek-V3 on two H100 nodes, but the deployment gets stuck at a certain point. I have already set the required environment variables, including:
NCCL_SOCKET_IFNAME
NCCL_IB_GID_INDEX=3
When I tried to look into the error using NCCL_DEBUG env variable , the last lines shows this
[Node-A]:[PID]:[Thread] [0] NCCL WARN NET/IB : Got completion from peer [PRIVATE_IP] with error 12, opcode 0, len 0, vendor err 129 (Recv)
localGid ::ffff:[MASKED_IP] remoteGid ::ffff:[MASKED_IP]
[Node-A]:[PID]:[Thread] [0] NCCL INFO transport/net.cc:1302 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:698 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]
[Node-A]:[PID]:[Thread] [0] NCCL WARN NET/IB : Got completion from peer [PRIVATE_IP] with error 5, opcode 29989, len 29993, vendor err 244 (Recv)
localGid ::ffff:[MASKED_IP] remoteGid ::ffff:[MASKED_IP]
[Node-A]:[PID]:[Thread] [0] NCCL INFO transport/net.cc:1302 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:698 -> 6
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]
I am using NCCL version 2.21.5
Has anyone encountered a similar issue, or does anyone have insights on resolving this? Any guidance would be appreciated!
The text was updated successfully, but these errors were encountered:
Hi all,
I am trying to deploy DeepSeek-V3 on two H100 nodes, but the deployment gets stuck at a certain point. I have already set the required environment variables, including:
NCCL_SOCKET_IFNAME
NCCL_IB_GID_INDEX=3
When I tried to look into the error using NCCL_DEBUG env variable , the last lines shows this
I am using NCCL version 2.21.5
Has anyone encountered a similar issue, or does anyone have insights on resolving this? Any guidance would be appreciated!
The text was updated successfully, but these errors were encountered: