Expose the ability to have zero allocation sends. #4802
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
The Objective:
At a high level I want use zeromq messages to send data as fast as possible from a sending thread. However, as as I have started going through the optimization process, via perf on x64 linux, about 25% of my steady-state time is spent in malloc calls allocating the control blocks for the reference counted long messages.
This pull request is an request to expose a public api to the zero-copy long message type
type_zclmsgused internally on the receive side.References:
As part of this process I have looked at previous issues that reference this topic, trying to not stomp on things, and get a better understanding of the concerns (If I have missed some it would be great to know):
The most discussion of this seems to be in:
#2795
Though this PR also solves the issue here:
#4343
Changes:
The primary change in this is to expose the internal function
init_external_storageand allow users to pass in a pointer to a preallocated memory block for the init function to construct the content_t control block in. The method/struct we want to expose is below:In order to not expose private implementation details, and allow for future modification of the internals, we round up the control block size from ~40 to 64 bytes, and expose the larger structure to the users in the draft api.
Thus, users just allocate a 64 byte control block, and pass it to zmq. They are then responsible for the lifetime of the block (In most use cases, it will probably be handled in the zmq_free_fn * ).
Internally all objects in the control data block are manually destructed as part of zmq_close (any nontrivial types are constructed with placement new):
My Use-case:
So for my sending thread, I have a slab allocator, which pulls memory from a lock-free object pool. Under the hood this uses freelist that batches memory allocations. This is fast, though to reduce thread contention of the CAS loop, I generally pull larger chunks and construct multiple smaller messages within this allocated buffer.
The messages I am using are not huge, but not tiny, at around 1Kb, so I am currently using the
zmq_msg_init_data, this works, and there is a inline reference counted control block at the beginning of the allocation that is decremented in the zmq'sfree_fn.At this point, the
malloctime from control block allocation starts to add up.I realize I could send larger messages, and I will be doing that, but it involves rewriting a lot of the code and message logic, on both the send and receive. Just allowing me to point to a preallocated control block at the beginning of the message gives me an easy ~25% speedup.
Additionally I am generally wary of non-deterministic nature of new/delete, especially in the hot path loop.
Gotchas? / random thoughts?
So trying to think of issues with this, it seems pretty safe, obviously it involves properly managing/releasing the control block memory, but that seems pretty easy for people already using free function. It is not a solution for all issues, but the requirement of zmq to have new/delete for any messages above ~33 bytes seems like a less than ideal scenario.
It also seems like the content object, as a internal API is very stable (last touched 9 years ago?).
One question I was thinking of might be the alignment, 8 bytes on x64 is the minimum, and was easy to copy paste the msg_t alignment, but it may be better to up that to a larger 16 byte, if there is the potential of needing to use 128 bit cas type instructions... That being said, people who use the library should probably respect the the alignment of the types they are given and not assume.
Finally, as a gripe, I personally am not a fan of the name of content_t, since it is not actually the message content but the control block for the message content, it keeps confusing me (why I use control block when I refer to it in this PR)