You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I'm trying to combine DDP with a trained MoE model using deepspeed moe. I set the ep_size to 1 and there is no tensor parallel. The way to enable DDP is to launch with deepspeed and initialize process groups, the data is sliced manually to different ranks, so each rank will have an identical model with different data batches.
However, I noticed that all to all communication inside the MoE layer hangs even the ep_size is 1, my goal is to run processes independently so I can do inference in parallel, the data is distributed to different ranks and I can gather them after all processes finish their generation.
By the way, the inference runs fine if I use just one GPU for inference.
Describe the bug
I'm trying to combine DDP with a trained MoE model using deepspeed moe. I set the ep_size to 1 and there is no tensor parallel. The way to enable DDP is to launch with deepspeed and initialize process groups, the data is sliced manually to different ranks, so each rank will have an identical model with different data batches.
However, I noticed that all to all communication inside the MoE layer hangs even the ep_size is 1, my goal is to run processes independently so I can do inference in parallel, the data is distributed to different ranks and I can gather them after all processes finish their generation.
By the way, the inference runs fine if I use just one GPU for inference.
To Reproduce
Expected behavior
I expect that the processes won't hang
System info (please complete the following information):
The text was updated successfully, but these errors were encountered: