Skip to content

Fix: Consensus Worker thread join to prevent shutdown crash#3419

Merged
peilun-conflux merged 1 commit intoConflux-Chain:masterfrom
ChenxingLi:graceful_shutdown
Mar 29, 2026
Merged

Fix: Consensus Worker thread join to prevent shutdown crash#3419
peilun-conflux merged 1 commit intoConflux-Chain:masterfrom
ChenxingLi:graceful_shutdown

Conversation

@ChenxingLi
Copy link
Copy Markdown
Contributor

@ChenxingLi ChenxingLi commented Mar 27, 2026

Problem

Conflux nodes intermittently crash with SIGABRT (exit code −6) during shutdown. This has been a known flaky issue since at least December 2021, when commit 5cff42c added a workaround to the test framework to silently ignore the resulting pthread lock: Invalid argument stderr output rather than fixing the root cause.

Root cause: a race between C++ static destructors and Rust Arc drop.

The shutdown sequence had three independent bugs that combined to create the race:

Bug 1 — Consensus Worker thread was detached (no JoinHandle retained).
SynchronizationGraph::new() spawned the Consensus Worker thread but immediately discarded the JoinHandle. The struct had no Drop impl and no shutdown method, so the thread ran freely with no lifecycle management.

Bug 2 — The thread was self-blocking on the very channel it helped keep alive.
The worker loop blocked on block_on(consensus_receiver.recv()). The channel closes only when all senders are dropped — but the thread closure itself captured an Arc<ConsensusGraph> which indirectly held a reference to Arc<Notifications> (the sender). The thread was waiting for a signal that its own existence was preventing.

Bug 3 — check_graceful_shutdown conflated "Arc strong count = 0" with "destructor complete".
The shutdown handler polled Weak<BlockDataManager>::upgrade() and declared success the moment it returned None. However, Rust's Arc::drop is a two-step operation: the strong count is decremented atomically before drop_slow() runs. There is no barrier between these steps. The main thread could observe strong count = 0 and proceed to exit() while the Worker thread was still mid-way through the rocksdb_close call chain (buried under multiple layers of nested Arc drops).

The crash: exit() invokes C++ static destructors in reverse construction order, which destroys PeriodicWorkScheduler's internal mutex. If the Worker thread reaches RocksDB::~DBImpl → PeriodicWorkScheduler::Unregister → pthread_mutex_lock after this point, RocksDB's PthreadCall wrapper asserts the return value is zero — it isn't (EINVAL) — and calls abort().

Fix

1. ConsensusWorkerHandle — explicit thread join via unsubscribe.

Introduce a ConsensusWorkerHandle struct that holds the thread's JoinHandle and is stored in SynchronizationGraph. Its Drop impl shuts down the worker cleanly:

  • Call Channel::unsubscribe(id) to drop the worker's sender, closing the channel from the outside and breaking the self-blocking cycle.
  • Call handle.join() to block until the thread — and its entire destructor chain, including rocksdb_close — has fully completed.

This establishes a hard happens-before guarantee: by the time SynchronizationGraph drops, all RocksDB cleanup performed by the Worker thread is complete. The main thread cannot reach exit() until after join() returns.

The Worker loop is restored to a simple recv_blocking() call — no polling, no timeout, no AtomicBool flag.

2. Metrics threads — cooperative stop.

Signal metrics reporter threads (cfx-metrics and diem-metrics) to exit before the component drop chain begins, so they don't access freed state during shutdown.

3. _exit() safety net.

If check_graceful_shutdown times out (components not released within the deadline), fall back to libc::_exit(1) to skip C++ static destructors entirely, avoiding the crash as a last resort for the degraded case.

Files changed

File Change
sync/synchronization_graph.rs ConsensusWorkerHandle holds JoinHandle; shutdown via unsubscribe + join; worker loop restored to recv_blocking
client/src/common/shutdown_handler.rs Call metrics::stop() and diem_metrics::stop() before dropping components
util/metrics/src/metrics.rs Add stop() / is_stopped() global flag
util/metrics/src/report.rs Check is_stopped() in reporter loops
pos/common/metrics/src/lib.rs Same stop flag for diem metrics reporters
bins/conflux/src/main.rs _exit(1) fallback on unclean shutdown

This change is Reviewable

Copy link
Copy Markdown
Contributor

@peilun-conflux peilun-conflux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peilun-conflux reviewed 12 files and all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on ChenxingLi).

@peilun-conflux peilun-conflux merged commit a2e999c into Conflux-Chain:master Mar 29, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants