feat(io_uring): Add support for registered buffers by kavirajk · Pull Request #72 · ClickHouse/silk

kavirajk · 2026-06-12T22:28:39Z

Changes

Added following new apis to the scheduler

registerBuffers()
readFixed()
writeFixed()

Which internally using liburing helpers io_uring_register_buffers(), io_uring_prep_read_fixed(), io_uring_prep_write_fixed().

This let us register the pre-allocated buffers that iouring can use during IO operations rather then allocating it per-io.

This is mainly based on best practices learned from TUM DBMS paper
https://arxiv.org/pdf/2512.04859

I also integrated with file-perf benchmark. And the numbers looked promising. See below for the actual improvement numbers

Performance

My setup is m8id.8xlarge EC2 instance

Normal

ec2-dev$ ./bb -b release perf --duration 60s --warmup 10s file
2026-06-12 21:48:21.567 [INFO ] bb:1986: command=perf preset=release
[0/2] Re-checking globbed directories...
ninja: no work to do.

## file-perf -- async file I/O

file=/dev/shm/file-perf.bin, bs=4k, size=1g, duration=60s, warmup=10s

| numjobs  | iodepth  | mode       | IOPS     | BW         | avg      | p50      | p95      | p99      | p99.9    |
|----------|----------|------------|----------|------------|----------|----------|----------|----------|----------|
| 1        | 1        | randwrite  | 185k     | 721.0 MiB/s | 5.39 µs  | 3.06 µs  | 12.83 µs | 13.57 µs | 22.39 µs |
| 1        | 16       | randwrite  | 575k     | 2245.0 MiB/s | 27.81 µs | 26.82 µs | 36.7 µs  | 41.02 µs | 54.04 µs |
| 16       | 1        | randwrite  | 893k     | 3489.0 MiB/s | 17.89 µs | 18.79 µs | 26.6 µs  | 37.29 µs | 52.2 µs  |
| 16       | 16       | randwrite  | 805k     | 3143.0 MiB/s | 318.16 µs | 262.8 µs | 850.48 µs | 1333.68 µs | 1847.18 µs |
| 1        | 1        | randread   | 232k     | 906.0 MiB/s | 4.29 µs  | 2.42 µs  | 12.35 µs | 12.99 µs | 20.79 µs |
| 1        | 16       | randread   | 682k     | 2665.0 MiB/s | 23.43 µs | 25.55 µs | 29.8 µs  | 38.12 µs | 53.12 µs |
| 16       | 1        | randread   | 2663k    | 10404.0 MiB/s | 5.98 µs  | 3.91 µs  | 15.24 µs | 29.43 µs | 98.45 µs |
| 16       | 16       | randread   | 4955k    | 19355.0 MiB/s | 51.63 µs | 50.92 µs | 80.68 µs | 99.89 µs | 126.67 µs |

Fixed Buffers

ec2-dev$ ./bb -b release perf --duration 60s --warmup 10s file --fixed-buffers
2026-06-12 21:58:31.006 [INFO ] bb:1986: command=perf preset=release
[0/2] Re-checking globbed directories...
ninja: no work to do.

## file-perf -- async file I/O

file=/dev/shm/file-perf.bin, bs=4k, size=1g, duration=60s, warmup=10s

| numjobs  | iodepth  | mode       | IOPS     | BW         | avg      | p50      | p95      | p99      | p99.9    |
|----------|----------|------------|----------|------------|----------|----------|----------|----------|----------|
| 1        | 1        | randwrite  | 193k     | 754.0 MiB/s | 5.15 µs  | 2.79 µs  | 12.83 µs | 13.65 µs | 22.48 µs |
| 1        | 16       | randwrite  | 608k     | 2373.0 MiB/s | 26.31 µs | 25.26 µs | 34.84 µs | 40.16 µs | 53.23 µs |
| 16       | 1        | randwrite  | 1117k    | 4362.0 MiB/s | 14.3 µs  | 14.89 µs | 23.68 µs | 29.79 µs | 39.66 µs |
| 16       | 16       | randwrite  | 1040k    | 4063.0 MiB/s | 246.11 µs | 222.31 µs | 505.13 µs | 901.61 µs | 1277.51 µs |
| 1        | 1        | randread   | 236k     | 924.0 MiB/s | 4.21 µs  | 2.38 µs  | 12.33 µs | 12.97 µs | 20.76 µs |
| 1        | 16       | randread   | 694k     | 2711.0 MiB/s | 23.03 µs | 25.11 µs | 29.24 µs | 37.22 µs | 52.57 µs |
| 16       | 1        | randread   | 2710k    | 10587.0 MiB/s | 5.88 µs  | 3.8 µs   | 15.16 µs | 28.89 µs | 100.26 µs |
| 16       | 16       | randread   | 5495k    | 21465.0 MiB/s | 46.56 µs | 45.05 µs | 71.7 µs  | 90.96 µs | 120.31 µs |

Throughput diff (normal vs fixed buffers)


| numjobs | iodepth | mode | IOPS before | IOPS after | Δ | BW before | BW after | Δ |
|---------|---------|-----------|-------------|------------|---------|-----------|----------|---------|
| 1       | 1       | randwrite | 185k        | 193k       | +4.3%   | 721.0     | 754.0    | +4.6%   |
| 1       | 16      | randwrite | 575k        | 608k       | +5.7%   | 2245.0    | 2373.0   | +5.7%   |
| 16      | 1       | randwrite | 893k        | 1117k      | +25.1%  | 3489.0    | 4362.0   | +25.0%  |
| 16      | 16      | randwrite | 805k        | 1040k      | +29.2%  | 3143.0    | 4063.0   | +29.3%  |
| 1       | 1       | randread  | 232k        | 236k       | +1.7%   | 906.0     | 924.0    | +2.0%   |
| 1       | 16      | randread  | 682k        | 694k       | +1.8%   | 2665.0    | 2711.0   | +1.7%   |
| 16      | 1       | randread  | 2663k       | 2710k      | +1.8%   | 10404.0   | 10587.0  | +1.8%   |
| 16      | 16      | randread  | 4955k       | 5495k      | +10.9%  | 19355.0   | 21465.0  | +10.9%  |

Latency diff (normal vs fixed buffers)

| numjobs | iodepth | mode      | avg                     | p50                     | p95                      | p99                       | p99.9                     |
|---------|---------|-----------|-------------------------|-------------------------|--------------------------|---------------------------|---------------------------|
| 1       | 1       | randwrite | 5.39→5.15 (−4.5%)       | 3.06→2.79 (−8.8%)       | 12.83→12.83 (0.0%)       | 13.57→13.65 (+0.6%)       | 22.39→22.48 (+0.4%)       |
| 1       | 16      | randwrite | 27.81→26.31 (−5.4%)     | 26.82→25.26 (−5.8%)     | 36.7→34.84 (−5.1%)       | 41.02→40.16 (−2.1%)       | 54.04→53.23 (−1.5%)       |
| 16      | 1       | randwrite | 17.89→14.3 (−20.1%)     | 18.79→14.89 (−20.8%)    | 26.6→23.68 (−11.0%)      | 37.29→29.79 (−20.1%)      | 52.2→39.66 (−24.0%)       |
| 16      | 16      | randwrite | 318.16→246.11 (−22.6%)  | 262.8→222.31 (−15.4%)   | 850.48→505.13 (−40.6%)   | 1333.68→901.61 (−32.4%)   | 1847.18→1277.51 (−30.8%)  |
| 1       | 1       | randread  | 4.29→4.21 (−1.9%)       | 2.42→2.38 (−1.7%)       | 12.35→12.33 (−0.2%)      | 12.99→12.97 (−0.2%)       | 20.79→20.76 (−0.1%)       |
| 1       | 16      | randread  | 23.43→23.03 (−1.7%)     | 25.55→25.11 (−1.7%)     | 29.8→29.24 (−1.9%)       | 38.12→37.22 (−2.4%)       | 53.12→52.57 (−1.0%)       |
| 16      | 1       | randread  | 5.98→5.88 (−1.7%)       | 3.91→3.8 (−2.8%)        | 15.24→15.16 (−0.5%)      | 29.43→28.89 (−1.8%)       | 98.45→100.26 (+1.8%)      |
| 16      | 16      | randread  | 51.63→46.56 (−9.8%)     | 50.92→45.05 (−11.5%)    | 80.68→71.7 (−11.1%)      | 99.89→90.96 (−8.9%)       | 126.67→120.31 (−5.0%)     |

vadimskipin

Cool! I thought about this optimization but have not try. Results are really impressive.

Copilot

Pull request overview

Adds io_uring “fixed/registered buffer” support to FiberScheduler and integrates it into the file-perf benchmark to avoid per-IO buffer pinning/allocation overhead.

Changes:

Added FiberScheduler::registerBuffers(), readFixed(), and writeFixed() APIs backed by liburing helpers.
Updated file-perf to optionally register per-job buffers and issue IORING_OP_READ_FIXED / IORING_OP_WRITE_FIXED via a new --fixed-buffers flag.
Extended the bb perf runner to pass through --fixed-buffers for file-perf runs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
src/perf/file-perf.cpp	Adds a `--fixed-buffers` mode that registers per-job buffers and switches IO submission to fixed-buffer ops.
src/fibers/fiber.cpp	Implements `readFixed`, `writeFixed`, and `registerBuffers` on the scheduler’s per-CPU rings.
include/silk/fibers/fiber.h	Exposes and documents the new fixed-buffer APIs on the public scheduler interface.
bb	Adds CLI plumbing to enable fixed-buffer mode for `file-perf` via `bb perf` and `bb file-perf`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vadimskipin · 2026-06-14T07:19:04Z

+        {
+            continue;
+        }
+        int r = ::io_uring_register_buffers(&processor->ring, iovecs, count);


What is about NUMA awareness here? Allocate once and use on any CPU does not look optimal. Should we maintain separate buffers per-node (per-cpu)?

Fair point 👍 This may require some changes on all three new apis (read_fixed/write_fixed/register_buffers) I think. Currently the buffers can physically live on one node and cores on other nodes have to pay remote-memory cost.

I'm thinking of having something simple struct

struct FixedBuf { void * base[SILK_MAX_NUMA_NODES]; // node-local pinned bases uint32_t index; // node-relative index uint32_t len; };

and make read_fixed and write_fixed apis accepts this FixedBuffer along with offset instead of plain void * pointer.

what do you think? May be it's complex? open to other ideas if you got any simpler approach (I'm not super familiar with NUMA in general :) )

It seems silk just need to expose raw register-buffer API. It would be better to write client code first and then decide what can be pushed into silk.

After bit of playing I think it's better to keep the NUMA aware part, out of the silk's registerBuffer api.

Here are the rationale.

Basically what io_uring_register_buffer does is just register the given buffer (block of memory address) to be used on kernel side. It doesn't do any malloc or memset. It's up to the client side that does those. Whatever the (CPU, NODE) touches that memory address first (via memset for example) it get resident on that node. For example [how we do allign_maloc and memset on file-perf itself.](https://github.com/kavirajk/silk/blob/c8ee4a7c3189c27b4cac905016cd8a4421e1cfbe/src/perf/file-p
erf.cpp?plain=1#L209-L211)

The Nature of work-stealing in silk itself is a tension here. With "buffer on specific node", I think no amount of carefulness on making sure the buffer becomes resident on local node, doesn't guarantee once the fiber is stolen from other CPU (paying the remote node cost anyway). We shouldn't complicate with move_pages to move those address to local node during such work stealing cases at this point I think.

I ran few tests with numactl --membind and --cpubind to understand the remote node cost in file-perf. I think if user want's to really take advantage of "local node buffer register", they can bind the node via numactl --cpunodebind=N to avoid work-stealing from CPU of different node. Even that doesn't need any api changes on silk side.

My honest take is, we should leave the registerBuffers api as is and document it's "shared nature of the buffer and how work-stealing can add remote node latency".

Curious to know your opinion @vadimskipin

kavirajk · 2026-06-24T09:40:01Z

@vadimskipin curious to get your thoughts here.
#72 (comment)

vadimskipin

So client is responsible to correctly pick a buffer index. OK, looks good.

vadimskipin · 2026-06-24T17:27:06Z

probably, would be better to rename this file into fiber-io-fixed-test.cpp

vadimskipin

Need to squash commits

praktika-gh · 2026-06-24T17:49:32Z

Workflow [PR], commit [e3eea85]

vadimskipin · 2026-06-24T17:51:38Z

Please rebase your changes. Main branch must have a clean linear history of commits. No merge commits!

Add support for new apis to scheduler 1. register_buffers 2. read_fixed() 3. write_fixe() This let us register the pre-allocated buffers that iouring can use during IO operations rather then allocating it per-io. This is mainly based on best practices learned from TUM DBMS paper https://arxiv.org/pdf/2512.04859 Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

Add a flag to run file-perf with register buffer iouring api ``` ./bb -b release perf --duration 60s --warmup 10s file --fixed-buffers ``` The numbers looks super interesting. So worth adding it to upstream Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

`./bb fmt` Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

Changes 1. Make sure the readFixed api on the registered buffer is checked by msan for uninitialized memory (similar to readv api) 2. Fix the nbytes len field (uint64_t -> uint32_t) because that's the underlying io_uring_* api expects 3. Add a round trip test for new api Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

Document the new apis in corresponding docs Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

kavirajk · 2026-06-24T18:33:23Z

thanks for the review @vadimskipin. Rebased with clean commit history.

vadimskipin

Please squash commits when merging!

kavirajk · 2026-06-24T18:45:19Z

@vadimskipin I don't have permission to merge :)

vadimskipin reviewed Jun 14, 2026

View reviewed changes

Comment thread src/fibers/fiber.cpp Outdated

vadimskipin requested a review from Copilot June 14, 2026 06:43

Copilot started reviewing on behalf of vadimskipin June 14, 2026 06:44 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

Comment thread src/fibers/fiber.cpp Outdated

Comment thread src/fibers/fiber.cpp Outdated

Comment thread src/fibers/fiber.cpp

Comment thread include/silk/fibers/fiber.h

vadimskipin reviewed Jun 14, 2026

View reviewed changes

kavirajk marked this pull request as ready for review June 14, 2026 23:03

kavirajk requested a review from vadimskipin June 14, 2026 23:03

vadimskipin previously approved these changes Jun 24, 2026

View reviewed changes

vadimskipin reviewed Jun 24, 2026

View reviewed changes

kavirajk dismissed vadimskipin’s stale review via ed7cffd June 24, 2026 17:54

kavirajk added 8 commits June 24, 2026 17:57

chore: make formatter happy

aa6ba0c

`./bb fmt` Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

chore: doc strings and assert fix

c6fa6c6

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

chore: fix ASSERT convention with key=value

3f68265

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

doc: update README, perf and scheduler doc

aaaf64a

Document the new apis in corresponding docs Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

chore: rename the test file and doc comment

3944ece

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

kavirajk force-pushed the feat/registered-buffers branch from ed7cffd to 3944ece Compare June 24, 2026 17:59

chore: fix build with explicit size_t convertion

e3eea85

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>

kavirajk requested a review from vadimskipin June 24, 2026 18:32

vadimskipin approved these changes Jun 24, 2026

View reviewed changes

vadimskipin merged commit 3b14bc8 into ClickHouse:main Jun 24, 2026
15 checks passed

kavirajk mentioned this pull request Jun 28, 2026

Potential io_uring optimizations, experiments and improvements. #92

Open

9 tasks

Uh oh!

Conversation

kavirajk commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Performance

Normal

Fixed Buffers

Throughput diff (normal vs fixed buffers)

Latency diff (normal vs fixed buffers)

Uh oh!

vadimskipin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vadimskipin Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

kavirajk Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

vadimskipin Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

kavirajk Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

kavirajk commented Jun 24, 2026

Uh oh!

vadimskipin left a comment

Choose a reason for hiding this comment

Uh oh!

vadimskipin Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

vadimskipin left a comment

Choose a reason for hiding this comment

Uh oh!

praktika-gh Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vadimskipin commented Jun 24, 2026

Uh oh!

kavirajk commented Jun 24, 2026

Uh oh!

vadimskipin left a comment

Choose a reason for hiding this comment

Uh oh!

kavirajk commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kavirajk commented Jun 12, 2026 •

edited

Loading

praktika-gh Bot commented Jun 24, 2026 •

edited

Loading