bugfix: Do not lock EventLoop::mutex after EventLoop is done

user · user · commit be6ad9f95512 · 2025-02-09T04:19:30.000-08:00
Currently, there are two cases where code may attempt to lock `EventLoop::mutex` after the `EventLoop::loop()` method has finished running and the `EventLoop` object is potentially destroyed. Both cases happen due to uses of the `Unlock()` utility function which unlocks a mutex temporarily, runs a callback function and relocks it. The first case happens in `EventLoop::removeClient` when the `EventLoop::done()` condition is reach and it calls `Unlock()` in order to be able to call `write(post_fd, ...)` without blocking and wake the EventLoop thread. In this case, since `EventLoop` execution is done there is not really any point to using `Unlock()` and relocking the mutex after calling `write()` so the code is updated to just use a simple `lock.unlock()` call and permanently let go of the lock and try to reacquire it. The second case happens in `EventLoop::post` where `Unlock()` is also used in a similar way, and depending on thread scheduling (if the post thread is delayed for a long time before relocking) can result in relocking `EventLoop::m_mutex` after calling `write()` to fail. In this case, since relocking the mutex is actually necessary for the code that follows, the fix is different: new `addClient`/`removeClient` calls are just added to this code, so the `EventLoop::loop()` function will just not exit while the `post()` function is waiting. These two changes are being labeled as a bugfix even though there is not technically a bug here in libmultiprocess code or API. The `EventLoop` object itself was perfectly safe before these changes as long as outside code was waited for `EventLoop` methods to finish executing before deleting the `EventLoop` object. Originally the multiprocess code added in bitcoin/bitcoin#10102 did this, but later as more features were added to binding and connecting to unix sockets, and unit tests were added which would immediately delete the `EventLoop` object after `EventLoop::loop()` returned it meant these two methods could start failing due to their uses of `Unlock()` depending on thread scheduling. A previous attempt was made to fix this bug in bitcoin/bitcoin#31815 outside of libmultiprocess code. But it only addressed the first case of a failing `Unlock()` in `EventLoop::removeClient`, not the second case in `EventLoop::post`. This first case described above was not actually observed but is just theoretically possible. The second case happens intermittently on macos and was reported bitcoin-core#154
diff --git a/include/mp/proxy-io.h b/include/mp/proxy-io.h
@@ -170,7 +170,7 @@ class EventLoop
 
     //! Add/remove remote client reference counts.
     void addClient(std::unique_lock<std::mutex>& lock);
-    void removeClient(std::unique_lock<std::mutex>& lock);
+    bool removeClient(std::unique_lock<std::mutex>& lock);
     //! Check if loop should exit.
     bool done(std::unique_lock<std::mutex>& lock);
 
diff --git a/src/mp/proxy.cpp b/src/mp/proxy.cpp
@@ -225,6 +225,7 @@ void EventLoop::post(const std::function<void()>& fn)
         return;
     }
     std::unique_lock<std::mutex> lock(m_mutex);
+    addClient(lock);
     m_cv.wait(lock, [this] { return m_post_fn == nullptr; });
     m_post_fn = &fn;
     int post_fd{m_post_fd};
@@ -233,21 +234,23 @@ void EventLoop::post(const std::function<void()>& fn)
         KJ_SYSCALL(write(post_fd, &buffer, 1));
     });
     m_cv.wait(lock, [this, &fn] { return m_post_fn != &fn; });
+    removeClient(lock);
 }
 
 void EventLoop::addClient(std::unique_lock<std::mutex>& lock) { m_num_clients += 1; }
 
-void EventLoop::removeClient(std::unique_lock<std::mutex>& lock)
+bool EventLoop::removeClient(std::unique_lock<std::mutex>& lock)
 {
     m_num_clients -= 1;
     if (done(lock)) {
         m_cv.notify_all();
         int post_fd{m_post_fd};
-        Unlock(lock, [&] {
-            char buffer = 0;
-            KJ_SYSCALL(write(post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon)
-        });
+        lock.unlock();
+        char buffer = 0;
+        KJ_SYSCALL(write(post_fd, &buffer, 1)); // NOLINT(bugprone-suspicious-semicolon)
+        return true;
     }
+    return false;
 }
 
 void EventLoop::startAsyncThread(std::unique_lock<std::mutex>& lock)
@@ -263,7 +266,7 @@ void EventLoop::startAsyncThread(std::unique_lock<std::mutex>& lock)
                     const std::function<void()> fn = std::move(m_async_fns.front());
                     m_async_fns.pop_front();
                     Unlock(lock, fn);
-                    removeClient(lock);
+                    if (removeClient(lock)) break;
                     continue;
                 } else if (m_num_clients == 0) {
                     break;