Skip to content

Deployment of DeepseekV3 on 2*H100 nodes is getting stuck. #4857

Open
@abhay-tih

Description

@abhay-tih

Hi all,
I am trying to deploy DeepSeek-V3 on two H100 nodes, but the deployment gets stuck at a certain point. I have already set the required environment variables, including:

  • NCCL_SOCKET_IFNAME
  • NCCL_IB_GID_INDEX=3

Image

When I tried to look into the error using NCCL_DEBUG env variable , the last lines shows this

[Node-A]:[PID]:[Thread] [0] NCCL WARN NET/IB : Got completion from peer [PRIVATE_IP] with error 12, opcode 0, len 0, vendor err 129 (Recv)  
localGid ::ffff:[MASKED_IP] remoteGid ::ffff:[MASKED_IP]  

[Node-A]:[PID]:[Thread] [0] NCCL INFO transport/net.cc:1302 -> 6  
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:698 -> 6  
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]  

[Node-A]:[PID]:[Thread] [0] NCCL WARN NET/IB : Got completion from peer [PRIVATE_IP] with error 5, opcode 29989, len 29993, vendor err 244 (Recv)  
localGid ::ffff:[MASKED_IP] remoteGid ::ffff:[MASKED_IP]  

[Node-A]:[PID]:[Thread] [0] NCCL INFO transport/net.cc:1302 -> 6  
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:698 -> 6  
[Node-A]:[PID]:[Thread] [0] NCCL INFO proxy.cc:878 -> 6 [Progress Thread]  

I am using NCCL version 2.21.5

Has anyone encountered a similar issue, or does anyone have insights on resolving this? Any guidance would be appreciated!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions