Fix: Consensus Worker thread join to prevent shutdown crash by ChenxingLi · Pull Request #3419 · Conflux-Chain/conflux-rust

ChenxingLi · 2026-03-27T05:06:22Z

Problem

Conflux nodes intermittently crash with SIGABRT (exit code −6) during shutdown. This has been a known flaky issue since at least December 2021, when commit 5cff42c added a workaround to the test framework to silently ignore the resulting pthread lock: Invalid argument stderr output rather than fixing the root cause.

Root cause: a race between C++ static destructors and Rust Arc drop.

The shutdown sequence had three independent bugs that combined to create the race:

Bug 1 — Consensus Worker thread was detached (no JoinHandle retained).
SynchronizationGraph::new() spawned the Consensus Worker thread but immediately discarded the JoinHandle. The struct had no Drop impl and no shutdown method, so the thread ran freely with no lifecycle management.

Bug 2 — The thread was self-blocking on the very channel it helped keep alive.
The worker loop blocked on block_on(consensus_receiver.recv()). The channel closes only when all senders are dropped — but the thread closure itself captured an Arc<ConsensusGraph> which indirectly held a reference to Arc<Notifications> (the sender). The thread was waiting for a signal that its own existence was preventing.

Bug 3 — check_graceful_shutdown conflated "Arc strong count = 0" with "destructor complete".
The shutdown handler polled Weak<BlockDataManager>::upgrade() and declared success the moment it returned None. However, Rust's Arc::drop is a two-step operation: the strong count is decremented atomically before drop_slow() runs. There is no barrier between these steps. The main thread could observe strong count = 0 and proceed to exit() while the Worker thread was still mid-way through the rocksdb_close call chain (buried under multiple layers of nested Arc drops).

The crash: exit() invokes C++ static destructors in reverse construction order, which destroys PeriodicWorkScheduler's internal mutex. If the Worker thread reaches RocksDB::~DBImpl → PeriodicWorkScheduler::Unregister → pthread_mutex_lock after this point, RocksDB's PthreadCall wrapper asserts the return value is zero — it isn't (EINVAL) — and calls abort().

Fix

1. ConsensusWorkerHandle — explicit thread join via unsubscribe.

Introduce a ConsensusWorkerHandle struct that holds the thread's JoinHandle and is stored in SynchronizationGraph. Its Drop impl shuts down the worker cleanly:

Call Channel::unsubscribe(id) to drop the worker's sender, closing the channel from the outside and breaking the self-blocking cycle.
Call handle.join() to block until the thread — and its entire destructor chain, including rocksdb_close — has fully completed.

This establishes a hard happens-before guarantee: by the time SynchronizationGraph drops, all RocksDB cleanup performed by the Worker thread is complete. The main thread cannot reach exit() until after join() returns.

The Worker loop is restored to a simple recv_blocking() call — no polling, no timeout, no AtomicBool flag.

2. Metrics threads — cooperative stop.

Signal metrics reporter threads (cfx-metrics and diem-metrics) to exit before the component drop chain begins, so they don't access freed state during shutdown.

3. _exit() safety net.

If check_graceful_shutdown times out (components not released within the deadline), fall back to libc::_exit(1) to skip C++ static destructors entirely, avoiding the crash as a last resort for the degraded case.

Files changed

File	Change
`sync/synchronization_graph.rs`	`ConsensusWorkerHandle` holds `JoinHandle`; shutdown via `unsubscribe` + `join`; worker loop restored to `recv_blocking`
`client/src/common/shutdown_handler.rs`	Call `metrics::stop()` and `diem_metrics::stop()` before dropping components
`util/metrics/src/metrics.rs`	Add `stop()` / `is_stopped()` global flag
`util/metrics/src/report.rs`	Check `is_stopped()` in reporter loops
`pos/common/metrics/src/lib.rs`	Same stop flag for diem metrics reporters
`bins/conflux/src/main.rs`	`_exit(1)` fallback on unclean shutdown

This change is

peilun-conflux

@peilun-conflux reviewed 12 files and all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on ChenxingLi).

Fix: Consensus Worker thread join to prevent shutdown crash

fe3d025

ChenxingLi force-pushed the graceful_shutdown branch from f5e7deb to fe3d025 Compare March 27, 2026 05:23

peilun-conflux approved these changes Mar 27, 2026

View reviewed changes

peilun-conflux merged commit a2e999c into Conflux-Chain:master Mar 29, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Consensus Worker thread join to prevent shutdown crash#3419

Fix: Consensus Worker thread join to prevent shutdown crash#3419
peilun-conflux merged 1 commit intoConflux-Chain:masterfrom
ChenxingLi:graceful_shutdown

ChenxingLi commented Mar 27, 2026 •

edited by peilun-conflux

Loading

Uh oh!

peilun-conflux left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChenxingLi commented Mar 27, 2026 • edited by peilun-conflux Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Files changed

Uh oh!

peilun-conflux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChenxingLi commented Mar 27, 2026 •

edited by peilun-conflux

Loading