-
Notifications
You must be signed in to change notification settings - Fork 26
Fix "disconnected: write(m_post_fd, &buffer, 1): Broken pipe" EventLoop shutdown races. #129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -188,29 +188,31 @@ void EventLoop::loop() | |
|
||
kj::Own<kj::AsyncIoStream> wait_stream{ | ||
m_io_context.lowLevelProvider->wrapSocketFd(m_wait_fd, kj::LowLevelAsyncIoProvider::TAKE_OWNERSHIP)}; | ||
int post_fd{m_post_fd}; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why make a copy of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: #129 (comment)
This is just saving a copy of the variable while the lock is held. There are three methods (loop, post, removeClient) where As you say, this should never be necessary because the code that changes this variable can only run after the code that reads it, but at least when I started debugging this issue, thread sanitizer complained that this variable was being read and written to without synchronization, so I stopped accessing it without a lock to prevent this. Going forward it might also be clearer to access I wouldn't object to a followup accessing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hmm, but the lock is not held when making the copy at line 191?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: #129 (comment)
Oh, that is a good point, I didn't notice that. For consistency it would be good to just acquire the lock whereever the variable is used, even when it not needed, or have some other more consistent rule. I think a good followup would be to add clang annotations and add the lock everywhere it says they are required, and remove it places where it isn't necessary and the annotations don't say it is required. This would add an unnecessary lock here as you are suggesting, and remove an unnecessary lock in the destructor as you have also suggested. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be possible to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: #129 (comment)
Definitely possible. I think it wouldn't be a performance difference because in practice whenever m_post_fd is used some state update is being posted, and a lock is needed anyway to update the shared state. But there could be other reasons for preferring an atomic reasons like style reasons. Very possible writing this code in a different style could make it clearer. I think at very least it should be using clang thread safety annotations which would make intent more clear. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I think |
||
char buffer = 0; | ||
for (;;) { | ||
size_t read_bytes = wait_stream->read(&buffer, 0, 1).wait(m_io_context.waitScope); | ||
if (read_bytes == 1) { | ||
std::unique_lock<std::mutex> lock(m_mutex); | ||
if (m_post_fn) { | ||
Unlock(lock, *m_post_fn); | ||
m_post_fn = nullptr; | ||
} | ||
} else { | ||
throw std::logic_error("EventLoop wait_stream closed unexpectedly"); | ||
} | ||
m_cv.notify_all(); | ||
if (m_num_clients == 0 && m_async_fns.empty()) { | ||
log() << "EventLoop::loop done, cancelling event listeners."; | ||
m_task_set.reset(); | ||
log() << "EventLoop::loop bye."; | ||
if (read_bytes != 1) throw std::logic_error("EventLoop wait_stream closed unexpectedly"); | ||
std::unique_lock<std::mutex> lock(m_mutex); | ||
if (m_post_fn) { | ||
Unlock(lock, *m_post_fn); | ||
m_post_fn = nullptr; | ||
m_cv.notify_all(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Previously There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: #129 (comment)
Yep, that's intended. The only reason to notify a condition variable is after updating some shared state, like changing a reference count or adding or removing something from a queue. The notify_all() call here is made right after setting |
||
} else if (done(lock)) { | ||
// Intentionally do not break if m_post_fn was set, even if done() | ||
// would return true, to ensure that the removeClient write(post_fd) | ||
// call always succeeds and the loop does not exit between the time | ||
// that the done condition is set and the write call is made. | ||
break; | ||
} | ||
} | ||
log() << "EventLoop::loop done, cancelling event listeners."; | ||
m_task_set.reset(); | ||
log() << "EventLoop::loop bye."; | ||
wait_stream = nullptr; | ||
KJ_SYSCALL(::close(post_fd)); | ||
std::unique_lock<std::mutex> lock(m_mutex); | ||
m_wait_fd = -1; | ||
KJ_SYSCALL(::close(m_post_fd)); | ||
m_post_fd = -1; | ||
} | ||
|
||
|
@@ -222,9 +224,10 @@ void EventLoop::post(const std::function<void()>& fn) | |
std::unique_lock<std::mutex> lock(m_mutex); | ||
m_cv.wait(lock, [this] { return m_post_fn == nullptr; }); | ||
m_post_fn = &fn; | ||
int post_fd{m_post_fd}; | ||
Unlock(lock, [&] { | ||
char buffer = 0; | ||
KJ_SYSCALL(write(m_post_fd, &buffer, 1)); | ||
KJ_SYSCALL(write(post_fd, &buffer, 1)); | ||
Comment on lines
+227
to
+230
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Am I right that the problem with this code was that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: #129 (comment)
This was actually not the problem. The problem wasn't that the code was trying to The cause of all these problems (theoretical and actual) was that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, so there is another problem in
Right? Or do I miss some high level context here (I am new to the internals of this library)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: #129 (comment)
Yes, the high level context that prevents what you described from happening is the addClient/removeClient reference counting. If you call post() without calling addClient(), the race condition you described will be present, but that is a higher level bug. The eventloop code is assuming that addClient and removeClient are being used whenever post() is used. The bug being fixed in this PR is a lower-level bug that only happens during shutdown when the reference count is 0 and at the very last second, right very last removeClient call in the program is about to call The cause of the spurious writes is that previously removeClient was only checking the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, conclude this. I have another question. Posting here because it is related: This pattern, used in
looks odd to me. Assuming the usual semantics that the mutex is protecting the variable - it is not allowed to change while holding the mutex and can change when not holding the mutex. This means that at the time of 4. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re: #129 (comment)
Not exactly. I think you keep basically asking the question again and again "Why is m_post_fd accessed with a lock? It doesn't need to be locked." And sorry I haven't given a more direct answer. It just has to do with behavior of thread sanitizer with earlier versions of this code. When I first enabled thread sanitizer to debug this issue, I fixed immediate tsan errors that happened running the code by restructuring the loop in the way it is currently written, but without the post_fd variable. But then when I ran the test in a loop, after around 20-30 minutes eventually there were tsan errors complaining both about the I did not feel any need to remove these variables because I was thinking of the m_post_fd, m_num_clients, and m_async_fns as a unit and wanted to follow a simple rule of using a single lock to protect them. You're completely right though that a lock is probably not needed to access m_post_fd, although I still think clang thread annotation might complain if it is accessed without a lock outside of the destructor, and I am not 100% sure that tsan would not complain if I removed the post_fd variables, but I am 99% sure so that it would be ok, so that change could definitely be a followup. Thanks for asking this and feel free to ask about anything else that isn't clear or could be improved. |
||
}); | ||
m_cv.wait(lock, [this, &fn] { return m_post_fn != &fn; }); | ||
} | ||
|
@@ -233,13 +236,13 @@ void EventLoop::addClient(std::unique_lock<std::mutex>& lock) { m_num_clients += | |
|
||
void EventLoop::removeClient(std::unique_lock<std::mutex>& lock) | ||
{ | ||
assert(m_num_clients > 0); | ||
m_num_clients -= 1; | ||
if (m_num_clients == 0) { | ||
if (done(lock)) { | ||
m_cv.notify_all(); | ||
int post_fd{m_post_fd}; | ||
Unlock(lock, [&] { | ||
char buffer = 0; | ||
KJ_SYSCALL(write(m_post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon) | ||
KJ_SYSCALL(write(post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon) | ||
}); | ||
} | ||
} | ||
|
@@ -268,6 +271,14 @@ void EventLoop::startAsyncThread(std::unique_lock<std::mutex>& lock) | |
} | ||
} | ||
|
||
bool EventLoop::done(std::unique_lock<std::mutex>& lock) | ||
{ | ||
assert(m_num_clients >= 0); | ||
assert(lock.owns_lock()); | ||
assert(lock.mutex() == &m_mutex); | ||
return m_num_clients == 0 && m_async_fns.empty(); | ||
} | ||
|
||
std::tuple<ConnThread, bool> SetThread(ConnThreads& threads, std::mutex& mutex, Connection* connection, std::function<Thread::Client()> make_thread) | ||
{ | ||
std::unique_lock<std::mutex> lock(mutex); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(posting in a random place coz github wouldn't allow me to comment on the destructor)
Not related to this PR, just an observation. The destructor is:
locking
m_mutex
shouldn't be necessary because there cannot be two threads destroying the same object concurrently. Or if there are then this is a serious bug elsewhere that will cause a double free bug after the destructor completes. Also, there cannot be one thread destroying the object while another one is calling a method of that object.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #129 (comment)
I think that is basically true and the lock is probably unnecessary here.
But there might be scenarios where threadsanitizer will throw errors without a lock, if it sees a variable being written here after being read from another thread with no synchronization in between. I am also not sure if clang thread safety annotations would complain without a lock here.
Also to be pedantic though I think "there cannot be one thread destroying the object while another one is calling a method of that object" is not true. The state in the object is still valid while the destructor method is running, and it should be fine for other threads to access it. The destructor method is not different from other methods in this respect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang thread safety is aware of this:
Generates only a warning for
f()
:I assume the thread sanitizer is the same even though I couldn't demonstrate this easily.
True, but then the problem arises that after the destructor finishes, without further synchronization, the memory that contains the object is freed. So it would be a serious read-after-free bug if another method is being executed while the destructor is being called because it could happen that the destructor finishes and the memory freed while the other method is still executing. Any mutexes within the object locked inside the destructor will be unlocked when the destructor finishes and before the memory is freed, so they cannot be used for synchronization with the other method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: #129 (comment)
Thanks, this is interesting. I'm a little unsure how clang knows that no mutex is required to access the variable in the destructor. It doesn't seem like a totally safe assumption to make because I could easily think of cases where locking the mutex in a destructor would be required, and then clang wouldn't diagnose any problem. But maybe it is just assuming that typical destructors aren't blocking or doing anything complicated, and the mutex is about to be destroyed anyway, and that is enough proof of ownership.
This could also be the case, but I wouldn't assume it. Thread sanitizer (as far as I know) does not treat destructors differently than other methods, and is operating at a lower level just looking at what mutexes are held during reads and write and what synchronization events are happening between reads and writes.
It's definitely possible you are right and dropping locks from this code is perfectly ok. So just for the sake of this PR I don't want to expand it by dropping an already existing lock, but, that could be a good change to make in a followup.
This is all true but it is ok for a destructor to block and wait for other events and other threads. And as long as it is waiting, it is ok for other threads to access the object and call its methods.