Experimental, managed ASP.NET Core Transport layer based on io_uring. This library is inspired by kestrel-linux-transport, a similar linux-specific transport layer based on epoll.
This transport layer supports both server (IConnectionListenerFactory) and client (IConnectionFactory) scenarios. It can be registered with services.AddIoUringTransport(); in the ConfigureServices method.
A configurable number of TransportThreads are started. Each thread opens an accept-socket on the server endpoint (IP and port) using the SO_REUSEPORT option. This allows all threads to accept inbound connections and will let the kernel load balance between the accept-sockets. The threads are also able to connect outbound connections.
All threads are provided with the writing end of the same Channel to write accepted connections to. This Channel will be read from when ConnectAsync is invoked on the IConnectionListener. The Channel is unbounded and back-pressure to temporarily disable accepting new connections is not yet supported.
The IConnectionFactory will delegate the request for handling new outbound connections to a TransportThead in a round robin fashion.
Each thread creates an io_uring to schedule IO operations and to get notified of their completion.
Each thread also creates an eventfd in the semaphore-mode (EFD_SEMAPHORE) with an initial value of 0 and places a readv operation (IORING_OP_READV) from that eventfd onto the io_uring. This allows us - as we shall later see and use - to unblock the thread using a normal write to the eventfd if the thread is blocked by a io_uring_enter syscall waiting for an IO operation to complete. This could also be achieved by sending a no-op (IORING_OP_NOP) through the io_uring but that would require synchronization of access to said ring, as now multiple threads may be writing to it. This trick allows us to be mostly lock-free in event loop. (The only exception being data structures such as Channel and ConcurrentQueue)
Before the event loop is started, we place above-mentioned readv from the eventfd as well as a poll (IORING_OP_POLL_ADD) for acceptable connections (POLLIN) on the accept-socket.
The event loop is then made up of the following actions:
- Check the accept-socket-queue. This
ConcurrentQueuecontains newly bound sockets for server endpoints. For each connection in this queue, apoll(IORING_OP_POLL_ADD) for "acceptability" (POLLIN) is added to theio_uring. - Check the client-socket-queue. This
ConcurrentQueuecontains sockets to client endpoints for which a connect is in progress. For each connection in this queue, apoll(IORING_OP_POLL_ADD) for "writability" (POLLOUT) is added to theio_uring. "Writability" will indicate the completion of aconnect. - Check the read-poll-queue. This
ConcurrentQueuecontains connections that could be read from again, after aFlushAsyncto the application completed asynchronously indicating that there was a need for back-pressure. The synchronous case is handled with a fast-path below. For each connection in this queue, apoll(IORING_OP_POLL_ADD) for incoming bytes (POLLIN) is added to theio_uring. - Check the write-poll-queue. This
ConcurrentQueuecontains connections that should be written to, after aReadAsyncfrom the application completed asynchronously. The synchronous case is handled with a fast-path below. For each connection in this queue, apoll(IORING_OP_POLL_ADD) for "writability" (POLLOUT) is added to theio_uring. - Submit all previously prepared operations to the kernel and block until at least on operation completed. (This involves one syscall to
io_uring_enter). - Handle all completed operations. Typically each (successfully) completed operation causes another operation to be prepared for submission in the next iteration of the event loop. Recognized types of completed operations are:
- eventfd poll completion: The
pollfor theeventfdcompleted. This indicates that aReadAsyncfrom or aFlushAsyncto the application completed asynchronously and that the corresponding connection was added to one of the above mentioned queues. The immediate action taken is to prepare anotherpoll(IORING_OP_POLL_ADD) for theeventfd, as the connection specificpolls are added when handling the queues at the beginning of the next event loop iteration. This ensures that the transport thread could again be unlocked, if the nextio_uring_enterblocks. - accept poll completion: The
pollon an accept-socket completed. This indicates that one or more connection could beaccepted. One connection is accepted by invoking the syscallaccept. In a future release, this could be done via theio_uring(IORING_OP_ACCEPT) to avoid the syscall, but this feature will only be available in the kernel version 5.5. that is unreleased by the time of writing this. The accepted connection is added to the above-mentioned channel and two operations are triggered. Apoll(IORING_OP_POLL_ADD) for incoming bytes (POLLIN) is added to theio_uringand aReadAsyncfrom the application is started to get bytes to be sent. If the latter completes synchronously, apoll(IORING_OP_POLL_ADD) for "writability" (POLLOUT) is added to theio_uringdirectly. In the asynchronous case, a callback is scheduled that will register the connection with the write-poll-queue and unblock the transport thread if necessary by writing to theevetfd. - read poll completion: The
pollfor available data (POLLIN) on a socket completed. Areadv(IORING_OP_READV) is added to theio_uringto read the data from the socket. - write poll completion: On of two things could have happened:
- The
pollfor "writability" (POLLOUT) of an outbound socket completed. Apoll(IORING_OP_POLL_ADD) for incoming bytes (POLLIN) is added to theio_uring. Additionally aReadAsyncfrom the application is started as for the write-queue items above. - The
pollfor "writability" (POLLOUT) of an inbound socket completed. Awritev(IORING_OP_WRITEV) for the data previously acquired during aReadAsyncis added to theio_uring.
- The
- read completion: The
readvpreviously added for the affected socket completed. ThePipelineis advanced past the number of bytes read and handed over to the application usingFlushAsync. IfFlushAsynccompletes synchronously, apoll(IORING_OP_POLL_ADD) for incoming bytes (POLLIN) is added to theio_uringdirectly. In the asynchronous case, a callback is scheduled that will register the connection with the read-poll-queue and unblock the transport thread if necessary by writing to theevetfd. - write completion: The
writevpreviously added for the affected socket completed. ThePipelineis advanced past the number of bytes written and more data from the application is read usingReadAsync. IfReadAsynccompletes synchronously, apollfor "writability" (POLLOUT) is added to theio_uringdirectly. In the asynchronous case, a callback is scheduled that will register the connection with the write-poll-queue and unblock the transport thread if necessary by writing to theevetfd.
Once an IO operation handed over to io_uring completes, the application needs to restore some contextual information regarding the operation that completed. This includes:
- The type of operation that completed (listed in bold above).
- The socket (and associated data) the operation was performed on
io_uring allows for 64-bit of user data to be provided with each submission, that will be routed through to the completion of the request. The lower 32-bit of this value are set to the socket file descriptor, the operation is performed on and the high 32-bit are set to an operation indicator. This ensures context can be restored after the completion of an asynchronous operation.
The socket file descriptor is used as index into a Dictionary to fetch the data associated with the socket.
- Error handling in general. This is currently a very minimal PoC.
- Polishing in general. Again, this is currently a very minimal PoC.
- Testing with more than a simple demo app...
- Benchmark and optimize
- Enable CPU affinity
- Investigate whether the use of zero-copy options are profitable (vis-a-vis registered buffers)
- Use multi-
iovecreadvs if more than_memoryPool.MaxBufferSizebytes are readable and ensure that the syscall toioctl(_fd, FIONREAD, &readableBytes)is avoided in the typical cases where oneiovecis enough.
- Create the largest possible (and reasonable)
io_urings. The max number ofentriesdiffers between kernel versions, perform auto-sensing. - Implement
accept-ing new connections usingio_uring, once supported on non-rc kernel versions (v5.5). - Implement
connect-ing to new connections usingio_uring, once supported on non-rc kernel versions (v5.5). - Profit from
IORING_FEAT_NODROPor implement safety measures to ensure no more thanio_uring_params->cq_entriesoperations are in flight at any given moment in time. - Profit form
IORING_FEAT_SUBMIT_STABLE. Currently theiovecs are allocated and fixed per connection to ensure they don't "move" during the execution of an operation. - Profit from
io_uring_registerandIORING_REGISTER_BUFFERSto speed-up IO.