Skip to content

Comments

net_plugin Performance Optimizations#204

Open
heifner wants to merge 20 commits intomasterfrom
feature/net-plugin-perf
Open

net_plugin Performance Optimizations#204
heifner wants to merge 20 commits intomasterfrom
feature/net-plugin-perf

Conversation

@heifner
Copy link
Contributor

@heifner heifner commented Feb 17, 2026

Summary

Elimination of heap allocations, mutex contention, and redundant computation from net_plugin hot paths: block/transaction/vote receive and broadcast. These are the highest-frequency code paths in net_plugin;
executing thousands of times per second across all peer connections.

Changes

Atomic scalars for chain_info reads

Replace exclusive mutex acquisition with std::atomic<uint32_t> for fork_db_root_num, head_num, and fork_db_head_num. These are read on every block/trx/vote receive across all net threads.

Estimated savings: ~25ns per read x 3 fields x every message = ~75ns per inbound message (eliminates mutex contention across N threads)

Reduce nested lock acquisition in peer_connections()

Move peer_connections() call (which acquires its own lock) out of for_each_connection (which holds connections_mtx), eliminating redundant nested locking.

Atomic _write_queue_size

Convert queued_buffer::_write_queue_size to std::atomic<uint32_t>, eliminating exclusive mutex acquisition in the read-only write_queue_size() called on every read cycle in start_read_message.

Estimated savings: ~25ns per read cycle per connection, fires on every inbound TCP read

Synchronous integrate_received_qc_to_block

Run QC integration synchronously on the critical path before fork_db_.add() instead of posting to thread pool. Only consider_voting is posted asynchronously.

Estimated savings: Eliminates thread pool scheduling delay (~1-5us) from block processing latency

Atomic for queued_buffer::_out_queue empty check

Replace mutex-guarded _out_queue.empty() read with atomic flag.

Estimated savings: ~25ns per enqueue call per connection

Eliminate std::function in queued_write

Replace std::function<void(boost::system::error_code, std::size_t)> callback stored per queued message with a compact struct holding an enum tag + optional fields. Eliminates heap allocation from std::function SBO overflow.

Estimated savings: ~50-100ns per enqueued message (heap alloc + dealloc). At 1000 TPS x 25 peers = 1.25-2.5ms/s saved

Stack-local small_vector for peer_connections()

Replace unordered_flat_set<uint32_t> returned by value (heap-allocating) with small_vector<connection_id_t, 64> (256 bytes on stack via RVO).

Estimated savings: ~80-120ns per transaction broadcast (heap alloc + dealloc eliminated)

Eliminate per-peer boost::asio::post lambda heap allocations

Pre-bind broadcast lambdas to avoid per-peer lambda capture heap allocations in block/trx/vote broadcast fan-out loops. With 25 peers, each broadcast previously created 25 separate lambda heap allocations.

Estimated savings: ~50-80ns x 25 peers = ~1.25-2us per broadcast event

Wire-level transaction ID prefix for zero-allocation duplicate dedup

Embed the transaction ID as a fixed prefix in the transaction_message wire format, allowing the receiver to extract the 32-byte ID directly from the network buffer without deserializing the full packed transaction. Duplicate
transactions (the common case in a gossip network) are rejected via a single memcmp + index lookup with zero heap allocation.

Estimated savings: ~200-500ns per duplicate transaction (eliminates fc::raw::unpack of packed_transaction + packed_transaction_ptr construction). At 80%+ duplicate rate with 1000 TPS x 25 peers: ~5-12ms/s saved

Wire-level vote ID prefix for zero-allocation duplicate vote dedup

Same approach as transactions: embed a vote ID (SHA-256 of key fields) as a wire prefix, allowing duplicate votes to be rejected from the raw network buffer without BLS signature deserialization.

Estimated savings: ~2-5us per duplicate vote (eliminates BLS pubkey + signature deserialization). With 25 finalizers x 25 peers at 80%+ duplicate rate: ~1-3ms/s saved

small_vector for write buffer descriptors

Replace std::vector<boost::asio::const_buffer> in do_queue_write with small_vector<const_buffer, 16>. Eliminates 2 heap allocations per drain cycle (original vector + copy into async_write's internal write_op).

Estimated savings: ~80-120ns x 2 = ~160-240ns per write drain cycle per connection

Aggregate Performance Estimate

Under production load (1000 TPS, 25 finalizers, 25 peers, ~80% duplicate rate):

  • Transaction path: ~10-20ms/s saved
  • Vote path: ~3-8ms/s saved
  • Write path: ~5-10ms/s saved
  • Total: ~18-38ms/s of CPU time returned to the event loop

the three most frequently read chain_info fields: fork_db_root_num,
head_num, and fork_db_head_num. These scalar getters are called on
every block/trx/vote receive path across all net threads, making the
exclusive mutex a potential significant contention point.
…eliminate

exclusive mutex acquisition in the read-only write_queue_size() method, which is
called on every read cycle in start_read_message.
thread pool, saving the received QC immediately on the critical path
before fork_db_.add(). Only consider_voting is posted asynchronously.
Eliminates thread pool scheduling delay from block processing latency.
…ueued message

Replace the std::function<void(ec, size_t)> callback in queued_write with
concrete struct fields (connection_id, close_after_send, net_msg, block_num)
and inline the callback logic into clear_out_queue. This removes the per-message
heap allocation caused by the lambda closure exceeding std::function's SBO limit,
along with the shared_from_this() refcount bump/release in enqueue_buffer.

At 10K TPS with 25 peers: eliminates ~250K heap alloc/dealloc pairs and ~500K
atomic refcount operations per second.
…th stack-local small_vector.

peer_connections() returned connection_id_set (unordered_flat_set<uint32_t>) by value,
heap-allocating on every transaction broadcast. Switch return type to
boost::container::small_vector<connection_id_t, 64> which holds up to 64 connection IDs
(256 bytes) on the caller's stack frame via RVO, avoiding heap allocation in the common case.
…cast hot paths.

Every transaction/vote broadcast previously posted a lambda per peer to its
strand, heap-allocating ~64 bytes and copying two shared_ptrs (8 atomic ops
per peer). At 25 peers and 1000 TPS this was 25K heap allocs/s + 200K atomic
ops/s.

add_write_queue is already mutex-protected, and _out_queue.empty() under the
same lock atomically tells us whether an async_write completion handler will
drain the queue. When a write IS in flight, newly queued messages are picked
up automatically — no strand post needed. When the connection is idle, a
single coalesced drain post (guarded by _write_drain_pending atomic in
queued_buffer) kicks do_queue_write.
Every inbound packed_transaction triggers make_shared + fc::raw::unpack +
reflector_init (6-8 heap allocations) before checking the dedup cache. In a
gossip network most messages are duplicates, so these allocations are created
and immediately freed — ~24K-32K allocator calls/s wasted at 2000 dups/s.

Add a fast path in process_next_trx_message that peeks the wire bytes via
mb_peek_datastream and computes the transaction ID directly from the packed_trx
SHA-256 hash without deserializing. For the common case (k1/r1/em/ed signatures,
compression_type::none), this identifies duplicates with zero heap allocations.

- trx_dedup.hpp: templated parse_trx_dedup_info() skips signatures by known
  fixed sizes, reads compression byte, skips context_free_data, then hashes
  packed_trx bytes via sha256::encoder + 512-byte stack buffer. static_assert
  guards against new signature types and transaction_header field reordering.
- dispatch_manager::have_txn(): read-only local_txns lookup (no insertion,
  no connection_id tracking, at worst causes one redundant notice to this peer).
- connection::try_early_dedup(): peek-parse-check-consume pipeline, falls back
  to the full unpack path for webauthn/bls signatures, zlib compression, or
  any parse error.
Replace packed_transaction with transaction_message{id, trx}.
The sender prepends the 32-byte transaction ID before the packed_transaction
payload; the receiver peeks those bytes and checks the dedup table before any heap
allocation. Duplicate transactions (the common case in gossip) now skip
make_shared + unpack + reflector_init entirely, eliminating
~6-8 heap allocations per duplicate (~24K-32K wasted calls/s at 2000 dups/s).
Every inbound vote_message triggers make_shared<vote_message>() + full
fc::raw::unpack including two BLS Montgomery field conversions (~1.1 us)
before any duplicate check. In a 21-finalizer network with only 10 peers,
~378 duplicate votes/s arrive, each wasting CPU on deserialization that
is immediately discarded.

Same pattern as the transaction_message optimization: prepend a 32-byte
vote_id (SHA-256 of block_id || strong || finalizer_key, excluding the
signature which is deterministic) on the wire. The receiver peeks the ID,
checks a bounded dedup cache, and only deserializes for new votes.

- protocol.hpp: add vote_id_type alias and compute_vote_id() helper
- buffer_factory.hpp: add vote_buffer_factory (prepends vote_id on wire)
- net_plugin.cpp: add vote_dedup multi_index cache to dispatch_manager
  with hashed_unique<vote_id> + ordered_non_unique<block_num> indices;
  restructure process_next_vote_message() to peek-and-check before
  deserialization; validate wire vote_id matches computed on full path;
  prune cache entries in expire_blocks() at LIB advancement
@heifner heifner marked this pull request as ready for review February 17, 2026 21:09
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes net_plugin hot paths by reducing allocations/locking and enabling zero-allocation duplicate detection for transactions and votes via new wire prefixes.

Changes:

  • Introduces new wire formats for transaction_message and vote_message with fixed ID prefixes to enable early duplicate drops.
  • Reduces contention/allocations in send/receive paths (atomics for chain info and write queue size, small_vector buffers, reduced posting/callback overhead).
  • Adds unit tests validating the new wire formats and ID “peek” behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/trx_generator/trx_provider.cpp Updates trx generator to emit the new [which][trx_id][packed_transaction] payload format
plugins/net_plugin/test/net_msg_wire_unittest.cpp Adds unit tests for trx/vote wire formats and ID peeking
plugins/net_plugin/test/CMakeLists.txt Registers the new wire-format unit test
plugins/net_plugin/src/net_plugin.cpp Main performance changes: atomics, new dedup caches, new send/receive behavior for trx/vote, write-queue changes
plugins/net_plugin/include/sysio/net_plugin/protocol.hpp Adds transaction_message variant and compute_vote_id helper
plugins/net_plugin/include/sysio/net_plugin/buffer_factory.hpp Updates tx buffer construction to include wire ID; adds vote buffer factory with vote_id prefix
libraries/chain/controller.cpp Makes QC integration synchronous before fork_db_.add() while keeping voting async

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

_write_queue_size = 0;
_write_queue_size.store(0, std::memory_order_relaxed);
_trx_write_queue.clear();
_out_queue.clear();
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

queued_buffer::reset() clears the queues and _write_queue_size but does not reset _write_drain_pending. If a previous queue_write_mt() set _write_drain_pending=true, subsequent queue_write_mt() calls can fail to post a drain and messages may remain queued until some other strand activity triggers do_queue_write(). Reset should clear _write_drain_pending (same as clear_write_queue()).

Suggested change
_out_queue.clear();
_out_queue.clear();
_write_drain_pending.store(false, std::memory_order_relaxed);

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 3326 to 3338
// Early dedup: check if we already have this transaction — zero heap allocations on the duplicate path.
// Peek the transaction ID (first field after variant which) for zero-allocation dedup.
// Wire format: [which (varint)][transaction_id (32 bytes)][packed_transaction ...]
auto peek_ds = pending_message_buffer.create_peek_datastream();
unsigned_int which{};
fc::raw::unpack( peek_ds, which );
transaction_id_type trx_id;
fc::raw::unpack( peek_ds, trx_id );
if( my_impl->dispatcher.have_txn( trx_id ) ) {
peer_dlog( p2p_trx_log, this, "got a duplicate transaction - dropping {}", trx_id );
pending_message_buffer.advance_read_ptr( message_length );
return true;
}
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early-duplicate fast path drops the message without recording that this peer has announced/sent the transaction. That skips dispatch_manager::add_peer_txn()/add_peer_txn_notice(), so connection_ids and per-connection trx_entries_size accounting won't be updated. This can increase redundant rebroadcasts to the same peer and weakens the per-peer tracking/limit behavior. Consider still recording the peer (e.g., via a lightweight notice/add) before advancing the read pointer.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 3466 to 3467
my_impl->dispatcher.add_vote_id( vote_id, block_header::num_from_id(ptr->block_id) );
handle_message( ptr );
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vote_id is inserted into the global dedup cache before the vote has been validated by the chain (signature, finality rules, etc.). A malformed/invalid vote can therefore poison the cache and cause later valid votes with the same (block_id,strong,key) to be dropped, and can also grow the cache with arbitrary (including far-future) block_nums before pruning can occur. Only add to vote_dedup after successful validation/acceptance, and/or enforce bounds on block_num before caching.

Suggested change
my_impl->dispatcher.add_vote_id( vote_id, block_header::num_from_id(ptr->block_id) );
handle_message( ptr );
handle_message( ptr );
my_impl->dispatcher.add_vote_id( vote_id, block_header::num_from_id(ptr->block_id) );

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 2939 to 2943

const send_buffer_type& sb = buff_factory.get_send_buffer( trx );
fc_dlog( p2p_trx_log, "sending trx: {}, to connection - {}, size {}", trx->id(), cp->connection_id, sb->size() );
boost::asio::post(cp->strand, [cp, sb]() {
cp->enqueue_buffer( msg_type_t::packed_transaction, std::nullopt, queued_buffer::queue_t::general, sb, go_away_reason::no_reason );
} );
cp->queue_write_mt( msg_type_t::transaction_message, queued_buffer::queue_t::general, sb, go_away_reason::no_reason );
} );
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bcast_transaction now unconditionally sends msg_type_t::transaction_message. Since this PR changes the wire format (adds a trx_id prefix and replaces packed_transaction in net_message), this is not backward-compatible with peers speaking the previous protocol. The net plugin typically gates wire/behavior changes on proto_version_t/network_version; consider bumping the protocol version and only using transaction_message with peers that negotiated support (and maintaining a compatibility path during upgrades if needed).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Launching a new chain. No need for backwards compatibility.

Comment on lines 4460 to 4466
boost::asio::post( thread_pool.get_executor(), [exclude_peer, msg, this]() mutable {
buffer_factory buff_factory;
vote_buffer_factory buff_factory;
const auto& send_buffer = buff_factory.get_send_buffer( *msg );

dispatcher.add_vote_id( buff_factory.get_vote_id(), block_header::num_from_id(msg->block_id) );
dispatcher.bcast_vote_msg( exclude_peer, send_buffer );
});
Copy link

Copilot AI Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send_vote now serializes vote_message as [which][vote_id][vote_message]. This changes the on-the-wire representation of vote_message but is still only gated by proto_version_t::savanna elsewhere. Unless all savanna peers are upgraded simultaneously, this needs a new protocol version (and conditional send/receive) to avoid disconnects or decode failures when talking to older savanna nodes.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Launching a new chain. No need for backwards compatibility. Next PR cleans up the proto versions.

…as sent the transaction, preventing redundant rebroadcasts back to the same peer and keeping per-connection tracking consistent.
The vote_id was being inserted into the dedup cache before chain
validation (BLS signature, finality rules). A malformed vote could
poison the cache and cause valid votes with the same
(block_id, strong, key) tuple to be silently dropped.

The post-validation path in bcast_vote_message already calls
add_vote_id on success, so the early insert was redundant for valid
votes. Removing it ensures only chain-validated votes enter the cache.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant