-
Notifications
You must be signed in to change notification settings - Fork 226
Open
Description
[rank7]:[W624 12:52:07.296197118 ProcessGroupNCCL.cpp:1659] [PG ID 0 PG GUID 0 Rank 7] Failed to check the "should dump" flag on TCPStore, (maybe TCPStore server has shut down too early), with error: Broken pipe
[rank7]:[W624 12:52:08.296327263 TCPStore.cpp:106] [c10d] sendBytes failed on SocketImpl(fd=97, addr=[openmatrix-minimax-m1-80k-prod-hzh20-custom-serving-58c98cpwxbz]:50828, remote=[openmatrix-minimax-m1-80k-prod-hzh20-custom-serving-58c98cpwxbz]:45903): Broken pipe
Exception raised from sendBytes at /pytorch/torch/csrc/distributed/c10d/Utils.hpp:653 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f4d327785e8 in /opt/venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x5ba8afe (0x7f4d1b3a8afe in /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #2: <unknown function> + 0x5baa358 (0x7f4d1b3aa358 in /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x5babb3e (0x7f4d1b3abb3e in /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&) + 0x298 (0x7f4d1b3a5198 in /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::ProcessGroupNCCL::heartbeatMonitor() + 0x379 (0x7f4cd83e09a9 in /opt/venv/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #6: <unknown function> + 0xecdb4 (0x7f4cc8196db4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: <unknown function> + 0x9caa4 (0x7f4d3349caa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: <unknown function> + 0x129c3c (0x7f4d33529c3c in /usr/lib/x86_64-linux-gnu/libc.so.6)
Metadata
Metadata
Assignees
Labels
No labels