`net_plugin` Performance Optimizations by heifner · Pull Request #204 · Wire-Network/wire-sysio

heifner · 2026-02-17T19:01:08Z

Summary

Elimination of heap allocations, mutex contention, and redundant computation from net_plugin hot paths: block/transaction/vote receive and broadcast. These are the highest-frequency code paths in net_plugin;
executing thousands of times per second across all peer connections.

Changes

Atomic scalars for chain_info reads

Replace exclusive mutex acquisition with std::atomic<uint32_t> for fork_db_root_num, head_num, and fork_db_head_num. These are read on every block/trx/vote receive across all net threads.

Estimated savings: ~25ns per read x 3 fields x every message = ~75ns per inbound message (eliminates mutex contention across N threads)

Reduce nested lock acquisition in `peer_connections()`

Move peer_connections() call (which acquires its own lock) out of for_each_connection (which holds connections_mtx), eliminating redundant nested locking.

Atomic `_write_queue_size`

Convert queued_buffer::_write_queue_size to std::atomic<uint32_t>, eliminating exclusive mutex acquisition in the read-only write_queue_size() called on every read cycle in start_read_message.

Estimated savings: ~25ns per read cycle per connection, fires on every inbound TCP read

Synchronous `integrate_received_qc_to_block`

Run QC integration synchronously on the critical path before fork_db_.add() instead of posting to thread pool. Only consider_voting is posted asynchronously.

Estimated savings: Eliminates thread pool scheduling delay (~1-5us) from block processing latency

Atomic for `queued_buffer::_out_queue` empty check

Replace mutex-guarded _out_queue.empty() read with atomic flag.

Estimated savings: ~25ns per enqueue call per connection

Eliminate `std::function` in `queued_write`

Replace std::function<void(boost::system::error_code, std::size_t)> callback stored per queued message with a compact struct holding an enum tag + optional fields. Eliminates heap allocation from std::function SBO overflow.

Estimated savings: ~50-100ns per enqueued message (heap alloc + dealloc). At 1000 TPS x 25 peers = 1.25-2.5ms/s saved

Stack-local `small_vector` for `peer_connections()`

Replace unordered_flat_set<uint32_t> returned by value (heap-allocating) with small_vector<connection_id_t, 64> (256 bytes on stack via RVO).

Estimated savings: ~80-120ns per transaction broadcast (heap alloc + dealloc eliminated)

Eliminate per-peer `boost::asio::post` lambda heap allocations

Pre-bind broadcast lambdas to avoid per-peer lambda capture heap allocations in block/trx/vote broadcast fan-out loops. With 25 peers, each broadcast previously created 25 separate lambda heap allocations.

Estimated savings: ~50-80ns x 25 peers = ~1.25-2us per broadcast event

Wire-level transaction ID prefix for zero-allocation duplicate dedup

Embed the transaction ID as a fixed prefix in the transaction_message wire format, allowing the receiver to extract the 32-byte ID directly from the network buffer without deserializing the full packed transaction. Duplicate
transactions (the common case in a gossip network) are rejected via a single memcmp + index lookup with zero heap allocation.

Estimated savings: ~200-500ns per duplicate transaction (eliminates fc::raw::unpack of packed_transaction + packed_transaction_ptr construction). At 80%+ duplicate rate with 1000 TPS x 25 peers: ~5-12ms/s saved

Wire-level vote ID prefix for zero-allocation duplicate vote dedup

Same approach as transactions: embed a vote ID (SHA-256 of key fields) as a wire prefix, allowing duplicate votes to be rejected from the raw network buffer without BLS signature deserialization.

Estimated savings: ~2-5us per duplicate vote (eliminates BLS pubkey + signature deserialization). With 25 finalizers x 25 peers at 80%+ duplicate rate: ~1-3ms/s saved

small_vector for write buffer descriptors

Replace std::vector<boost::asio::const_buffer> in do_queue_write with small_vector<const_buffer, 16>. Eliminates 2 heap allocations per drain cycle (original vector + copy into async_write's internal write_op).

Estimated savings: ~80-120ns x 2 = ~160-240ns per write drain cycle per connection

Aggregate Performance Estimate

Under production load (1000 TPS, 25 finalizers, 25 peers, ~80% duplicate rate):

Transaction path: ~10-20ms/s saved
Vote path: ~3-8ms/s saved
Write path: ~5-10ms/s saved
Total: ~18-38ms/s of CPU time returned to the event loop

the three most frequently read chain_info fields: fork_db_root_num, head_num, and fork_db_head_num. These scalar getters are called on every block/trx/vote receive path across all net threads, making the exclusive mutex a potential significant contention point.

…tion which also takes a lock.

…eliminate exclusive mutex acquisition in the read-only write_queue_size() method, which is called on every read cycle in start_read_message.

thread pool, saving the received QC immediately on the critical path before fork_db_.add(). Only consider_voting is posted asynchronously. Eliminates thread pool scheduling delay from block processing latency.

…ueued message Replace the std::function<void(ec, size_t)> callback in queued_write with concrete struct fields (connection_id, close_after_send, net_msg, block_num) and inline the callback logic into clear_out_queue. This removes the per-message heap allocation caused by the lambda closure exceeding std::function's SBO limit, along with the shared_from_this() refcount bump/release in enqueue_buffer. At 10K TPS with 25 peers: eliminates ~250K heap alloc/dealloc pairs and ~500K atomic refcount operations per second.

…th stack-local small_vector. peer_connections() returned connection_id_set (unordered_flat_set<uint32_t>) by value, heap-allocating on every transaction broadcast. Switch return type to boost::container::small_vector<connection_id_t, 64> which holds up to 64 connection IDs (256 bytes) on the caller's stack frame via RVO, avoiding heap allocation in the common case.

…cast hot paths. Every transaction/vote broadcast previously posted a lambda per peer to its strand, heap-allocating ~64 bytes and copying two shared_ptrs (8 atomic ops per peer). At 25 peers and 1000 TPS this was 25K heap allocs/s + 200K atomic ops/s. add_write_queue is already mutex-protected, and _out_queue.empty() under the same lock atomically tells us whether an async_write completion handler will drain the queue. When a write IS in flight, newly queued messages are picked up automatically — no strand post needed. When the connection is idle, a single coalesced drain post (guarded by _write_drain_pending atomic in queued_buffer) kicks do_queue_write.

Every inbound packed_transaction triggers make_shared + fc::raw::unpack + reflector_init (6-8 heap allocations) before checking the dedup cache. In a gossip network most messages are duplicates, so these allocations are created and immediately freed — ~24K-32K allocator calls/s wasted at 2000 dups/s. Add a fast path in process_next_trx_message that peeks the wire bytes via mb_peek_datastream and computes the transaction ID directly from the packed_trx SHA-256 hash without deserializing. For the common case (k1/r1/em/ed signatures, compression_type::none), this identifies duplicates with zero heap allocations. - trx_dedup.hpp: templated parse_trx_dedup_info() skips signatures by known fixed sizes, reads compression byte, skips context_free_data, then hashes packed_trx bytes via sha256::encoder + 512-byte stack buffer. static_assert guards against new signature types and transaction_header field reordering. - dispatch_manager::have_txn(): read-only local_txns lookup (no insertion, no connection_id tracking, at worst causes one redundant notice to this peer). - connection::try_early_dedup(): peek-parse-check-consume pipeline, falls back to the full unpack path for webauthn/bls signatures, zlib compression, or any parse error.

This reverts commit 5f99839.

…g calls.

Replace packed_transaction with transaction_message{id, trx}. The sender prepends the 32-byte transaction ID before the packed_transaction payload; the receiver peeks those bytes and checks the dedup table before any heap allocation. Duplicate transactions (the common case in gossip) now skip make_shared + unpack + reflector_init entirely, eliminating ~6-8 heap allocations per duplicate (~24K-32K wasted calls/s at 2000 dups/s).

Every inbound vote_message triggers make_shared<vote_message>() + full fc::raw::unpack including two BLS Montgomery field conversions (~1.1 us) before any duplicate check. In a 21-finalizer network with only 10 peers, ~378 duplicate votes/s arrive, each wasting CPU on deserialization that is immediately discarded. Same pattern as the transaction_message optimization: prepend a 32-byte vote_id (SHA-256 of block_id || strong || finalizer_key, excluding the signature which is deterministic) on the wire. The receiver peeks the ID, checks a bounded dedup cache, and only deserializes for new votes. - protocol.hpp: add vote_id_type alias and compute_vote_id() helper - buffer_factory.hpp: add vote_buffer_factory (prepends vote_id on wire) - net_plugin.cpp: add vote_dedup multi_index cache to dispatch_manager with hashed_unique<vote_id> + ordered_non_unique<block_num> indices; restructure process_next_vote_message() to peek-and-check before deserialization; validate wire vote_id matches computed on full path; prune cache entries in expire_blocks() at LIB advancement

…oadcast path.

Copilot

Pull request overview

This PR optimizes net_plugin hot paths by reducing allocations/locking and enabling zero-allocation duplicate detection for transactions and votes via new wire prefixes.

Changes:

Introduces new wire formats for transaction_message and vote_message with fixed ID prefixes to enable early duplicate drops.
Reduces contention/allocations in send/receive paths (atomics for chain info and write queue size, small_vector buffers, reduced posting/callback overhead).
Adds unit tests validating the new wire formats and ID “peek” behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
tests/trx_generator/trx_provider.cpp	Updates trx generator to emit the new `[which][trx_id][packed_transaction]` payload format
plugins/net_plugin/test/net_msg_wire_unittest.cpp	Adds unit tests for trx/vote wire formats and ID peeking
plugins/net_plugin/test/CMakeLists.txt	Registers the new wire-format unit test
plugins/net_plugin/src/net_plugin.cpp	Main performance changes: atomics, new dedup caches, new send/receive behavior for trx/vote, write-queue changes
plugins/net_plugin/include/sysio/net_plugin/protocol.hpp	Adds `transaction_message` variant and `compute_vote_id` helper
plugins/net_plugin/include/sysio/net_plugin/buffer_factory.hpp	Updates tx buffer construction to include wire ID; adds vote buffer factory with vote_id prefix
libraries/chain/controller.cpp	Makes QC integration synchronous before `fork_db_.add()` while keeping voting async

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-18T22:57:50Z