Fix: Consensus Worker thread join to prevent shutdown crash#3419
Merged
peilun-conflux merged 1 commit intoConflux-Chain:masterfrom Mar 29, 2026
Merged
Fix: Consensus Worker thread join to prevent shutdown crash#3419peilun-conflux merged 1 commit intoConflux-Chain:masterfrom
peilun-conflux merged 1 commit intoConflux-Chain:masterfrom
Conversation
f5e7deb to
fe3d025
Compare
peilun-conflux
approved these changes
Mar 27, 2026
Contributor
peilun-conflux
left a comment
There was a problem hiding this comment.
@peilun-conflux reviewed 12 files and all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on ChenxingLi).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Conflux nodes intermittently crash with SIGABRT (exit code −6) during shutdown. This has been a known flaky issue since at least December 2021, when commit
5cff42cadded a workaround to the test framework to silently ignore the resultingpthread lock: Invalid argumentstderr output rather than fixing the root cause.Root cause: a race between C++ static destructors and Rust
Arcdrop.The shutdown sequence had three independent bugs that combined to create the race:
Bug 1 — Consensus Worker thread was detached (no
JoinHandleretained).SynchronizationGraph::new()spawned the Consensus Worker thread but immediately discarded theJoinHandle. The struct had noDropimpl and no shutdown method, so the thread ran freely with no lifecycle management.Bug 2 — The thread was self-blocking on the very channel it helped keep alive.
The worker loop blocked on
block_on(consensus_receiver.recv()). The channel closes only when all senders are dropped — but the thread closure itself captured anArc<ConsensusGraph>which indirectly held a reference toArc<Notifications>(the sender). The thread was waiting for a signal that its own existence was preventing.Bug 3 —
check_graceful_shutdownconflated "Arc strong count = 0" with "destructor complete".The shutdown handler polled
Weak<BlockDataManager>::upgrade()and declared success the moment it returnedNone. However, Rust'sArc::dropis a two-step operation: the strong count is decremented atomically beforedrop_slow()runs. There is no barrier between these steps. The main thread could observestrong count = 0and proceed toexit()while the Worker thread was still mid-way through therocksdb_closecall chain (buried under multiple layers of nestedArcdrops).The crash:
exit()invokes C++ static destructors in reverse construction order, which destroysPeriodicWorkScheduler's internal mutex. If the Worker thread reachesRocksDB::~DBImpl → PeriodicWorkScheduler::Unregister → pthread_mutex_lockafter this point, RocksDB'sPthreadCallwrapper asserts the return value is zero — it isn't (EINVAL) — and callsabort().Fix
1.
ConsensusWorkerHandle— explicit thread join viaunsubscribe.Introduce a
ConsensusWorkerHandlestruct that holds the thread'sJoinHandleand is stored inSynchronizationGraph. ItsDropimpl shuts down the worker cleanly:Channel::unsubscribe(id)to drop the worker's sender, closing the channel from the outside and breaking the self-blocking cycle.handle.join()to block until the thread — and its entire destructor chain, includingrocksdb_close— has fully completed.This establishes a hard happens-before guarantee: by the time
SynchronizationGraphdrops, all RocksDB cleanup performed by the Worker thread is complete. The main thread cannot reachexit()until afterjoin()returns.The Worker loop is restored to a simple
recv_blocking()call — no polling, no timeout, noAtomicBoolflag.2. Metrics threads — cooperative stop.
Signal metrics reporter threads (
cfx-metricsanddiem-metrics) to exit before the component drop chain begins, so they don't access freed state during shutdown.3.
_exit()safety net.If
check_graceful_shutdowntimes out (components not released within the deadline), fall back tolibc::_exit(1)to skip C++ static destructors entirely, avoiding the crash as a last resort for the degraded case.Files changed
sync/synchronization_graph.rsConsensusWorkerHandleholdsJoinHandle; shutdown viaunsubscribe+join; worker loop restored torecv_blockingclient/src/common/shutdown_handler.rsmetrics::stop()anddiem_metrics::stop()before dropping componentsutil/metrics/src/metrics.rsstop()/is_stopped()global flagutil/metrics/src/report.rsis_stopped()in reporter loopspos/common/metrics/src/lib.rsbins/conflux/src/main.rs_exit(1)fallback on unclean shutdownThis change is