Skip to content

feat(io_uring): Add support for registered buffers#72

Draft
kavirajk wants to merge 3 commits into
ClickHouse:mainfrom
kavirajk:feat/registered-buffers
Draft

feat(io_uring): Add support for registered buffers#72
kavirajk wants to merge 3 commits into
ClickHouse:mainfrom
kavirajk:feat/registered-buffers

Conversation

@kavirajk

@kavirajk kavirajk commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Changes

Added following new apis to the scheduler

  1. registerBuffers()
  2. readFixed()
  3. writeFixed()

Which internally using liburing helpers io_uring_register_buffers(), io_uring_prep_read_fixed(), io_uring_prep_write_fixed().

This let us register the pre-allocated buffers that iouring can use during IO operations rather then allocating it per-io.

This is mainly based on best practices learned from TUM DBMS paper
https://arxiv.org/pdf/2512.04859

I also integrated with file-perf benchmark. And the numbers looked promising. See below for the actual improvement numbers

Performance

My setup is m8id.8xlarge EC2 instance

Normal

ec2-dev$ ./bb -b release perf --duration 60s --warmup 10s file
2026-06-12 21:48:21.567 [INFO ] bb:1986: command=perf preset=release
[0/2] Re-checking globbed directories...
ninja: no work to do.

## file-perf -- async file I/O

file=/dev/shm/file-perf.bin, bs=4k, size=1g, duration=60s, warmup=10s

| numjobs  | iodepth  | mode       | IOPS     | BW         | avg      | p50      | p95      | p99      | p99.9    |
|----------|----------|------------|----------|------------|----------|----------|----------|----------|----------|
| 1        | 1        | randwrite  | 185k     | 721.0 MiB/s | 5.39 µs  | 3.06 µs  | 12.83 µs | 13.57 µs | 22.39 µs |
| 1        | 16       | randwrite  | 575k     | 2245.0 MiB/s | 27.81 µs | 26.82 µs | 36.7 µs  | 41.02 µs | 54.04 µs |
| 16       | 1        | randwrite  | 893k     | 3489.0 MiB/s | 17.89 µs | 18.79 µs | 26.6 µs  | 37.29 µs | 52.2 µs  |
| 16       | 16       | randwrite  | 805k     | 3143.0 MiB/s | 318.16 µs | 262.8 µs | 850.48 µs | 1333.68 µs | 1847.18 µs |
| 1        | 1        | randread   | 232k     | 906.0 MiB/s | 4.29 µs  | 2.42 µs  | 12.35 µs | 12.99 µs | 20.79 µs |
| 1        | 16       | randread   | 682k     | 2665.0 MiB/s | 23.43 µs | 25.55 µs | 29.8 µs  | 38.12 µs | 53.12 µs |
| 16       | 1        | randread   | 2663k    | 10404.0 MiB/s | 5.98 µs  | 3.91 µs  | 15.24 µs | 29.43 µs | 98.45 µs |
| 16       | 16       | randread   | 4955k    | 19355.0 MiB/s | 51.63 µs | 50.92 µs | 80.68 µs | 99.89 µs | 126.67 µs |

Fixed Buffers

ec2-dev$ ./bb -b release perf --duration 60s --warmup 10s file --fixed-buffers
2026-06-12 21:58:31.006 [INFO ] bb:1986: command=perf preset=release
[0/2] Re-checking globbed directories...
ninja: no work to do.

## file-perf -- async file I/O

file=/dev/shm/file-perf.bin, bs=4k, size=1g, duration=60s, warmup=10s

| numjobs  | iodepth  | mode       | IOPS     | BW         | avg      | p50      | p95      | p99      | p99.9    |
|----------|----------|------------|----------|------------|----------|----------|----------|----------|----------|
| 1        | 1        | randwrite  | 193k     | 754.0 MiB/s | 5.15 µs  | 2.79 µs  | 12.83 µs | 13.65 µs | 22.48 µs |
| 1        | 16       | randwrite  | 608k     | 2373.0 MiB/s | 26.31 µs | 25.26 µs | 34.84 µs | 40.16 µs | 53.23 µs |
| 16       | 1        | randwrite  | 1117k    | 4362.0 MiB/s | 14.3 µs  | 14.89 µs | 23.68 µs | 29.79 µs | 39.66 µs |
| 16       | 16       | randwrite  | 1040k    | 4063.0 MiB/s | 246.11 µs | 222.31 µs | 505.13 µs | 901.61 µs | 1277.51 µs |
| 1        | 1        | randread   | 236k     | 924.0 MiB/s | 4.21 µs  | 2.38 µs  | 12.33 µs | 12.97 µs | 20.76 µs |
| 1        | 16       | randread   | 694k     | 2711.0 MiB/s | 23.03 µs | 25.11 µs | 29.24 µs | 37.22 µs | 52.57 µs |
| 16       | 1        | randread   | 2710k    | 10587.0 MiB/s | 5.88 µs  | 3.8 µs   | 15.16 µs | 28.89 µs | 100.26 µs |
| 16       | 16       | randread   | 5495k    | 21465.0 MiB/s | 46.56 µs | 45.05 µs | 71.7 µs  | 90.96 µs | 120.31 µs |

Throughput diff (normal vs fixed buffers)


| numjobs | iodepth | mode | IOPS before | IOPS after | Δ | BW before | BW after | Δ |
|---------|---------|-----------|-------------|------------|---------|-----------|----------|---------|
| 1       | 1       | randwrite | 185k        | 193k       | +4.3%   | 721.0     | 754.0    | +4.6%   |
| 1       | 16      | randwrite | 575k        | 608k       | +5.7%   | 2245.0    | 2373.0   | +5.7%   |
| 16      | 1       | randwrite | 893k        | 1117k      | +25.1%  | 3489.0    | 4362.0   | +25.0%  |
| 16      | 16      | randwrite | 805k        | 1040k      | +29.2%  | 3143.0    | 4063.0   | +29.3%  |
| 1       | 1       | randread  | 232k        | 236k       | +1.7%   | 906.0     | 924.0    | +2.0%   |
| 1       | 16      | randread  | 682k        | 694k       | +1.8%   | 2665.0    | 2711.0   | +1.7%   |
| 16      | 1       | randread  | 2663k       | 2710k      | +1.8%   | 10404.0   | 10587.0  | +1.8%   |
| 16      | 16      | randread  | 4955k       | 5495k      | +10.9%  | 19355.0   | 21465.0  | +10.9%  |



Latency diff (normal vs fixed buffers)

| numjobs | iodepth | mode      | avg                     | p50                     | p95                      | p99                       | p99.9                     |
|---------|---------|-----------|-------------------------|-------------------------|--------------------------|---------------------------|---------------------------|
| 1       | 1       | randwrite | 5.39→5.15 (−4.5%)       | 3.06→2.79 (−8.8%)       | 12.83→12.83 (0.0%)       | 13.57→13.65 (+0.6%)       | 22.39→22.48 (+0.4%)       |
| 1       | 16      | randwrite | 27.81→26.31 (−5.4%)     | 26.82→25.26 (−5.8%)     | 36.7→34.84 (−5.1%)       | 41.02→40.16 (−2.1%)       | 54.04→53.23 (−1.5%)       |
| 16      | 1       | randwrite | 17.89→14.3 (−20.1%)     | 18.79→14.89 (−20.8%)    | 26.6→23.68 (−11.0%)      | 37.29→29.79 (−20.1%)      | 52.2→39.66 (−24.0%)       |
| 16      | 16      | randwrite | 318.16→246.11 (−22.6%)  | 262.8→222.31 (−15.4%)   | 850.48→505.13 (−40.6%)   | 1333.68→901.61 (−32.4%)   | 1847.18→1277.51 (−30.8%)  |
| 1       | 1       | randread  | 4.29→4.21 (−1.9%)       | 2.42→2.38 (−1.7%)       | 12.35→12.33 (−0.2%)      | 12.99→12.97 (−0.2%)       | 20.79→20.76 (−0.1%)       |
| 1       | 16      | randread  | 23.43→23.03 (−1.7%)     | 25.55→25.11 (−1.7%)     | 29.8→29.24 (−1.9%)       | 38.12→37.22 (−2.4%)       | 53.12→52.57 (−1.0%)       |
| 16      | 1       | randread  | 5.98→5.88 (−1.7%)       | 3.91→3.8 (−2.8%)        | 15.24→15.16 (−0.5%)      | 29.43→28.89 (−1.8%)       | 98.45→100.26 (+1.8%)      |
| 16      | 16      | randread  | 51.63→46.56 (−9.8%)     | 50.92→45.05 (−11.5%)    | 80.68→71.7 (−11.1%)      | 99.89→90.96 (−8.9%)       | 126.67→120.31 (−5.0%)     |

kavirajk added 3 commits June 12, 2026 22:03
Add support for new apis to scheduler

1. register_buffers
2. read_fixed()
3. write_fixe()

This let us register the pre-allocated buffers that iouring can use
during IO operations rather then allocating it per-io.

This is mainly based on best practices learned from TUM DBMS paper
https://arxiv.org/pdf/2512.04859

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Add a flag to run file-perf with register buffer iouring api

```
./bb -b release perf --duration 60s --warmup 10s file --fixed-buffers
```

The numbers looks super interesting. So worth adding it to upstream
Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
`./bb fmt`

Signed-off-by: Kaviraj <kavirajkanagaraj@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant