Skip to content

fio: prevent OOM by not allocating buffer for TRIM range#2069

Open
dennischerchang wants to merge 1 commit intoaxboe:masterfrom
dennischerchang:prevent_large_trim_size_and_high_iodepth_causing_oom
Open

fio: prevent OOM by not allocating buffer for TRIM range#2069
dennischerchang wants to merge 1 commit intoaxboe:masterfrom
dennischerchang:prevent_large_trim_size_and_high_iodepth_causing_oom

Conversation

@dennischerchang
Copy link
Copy Markdown

Large TRIM ranges combined with high iodepth can cause memory usage to spike and trigger Out Of Memory (OOM) errors. Since TRIM is a metadata-only operation, it does not require a data payload buffer. This patch adds a function to calculate the buffer size without DDIR_TRIM, preventing unnecessary memory allocation.

Fixes: Issue #2056 #2056

Large TRIM ranges combined with high iodepth can cause memory usage
to spike and trigger Out Of Memory (OOM) errors. Since TRIM is a
metadata-only operation, it does not require a data payload buffer.
This patch adds a function to calculate the buffer size without DDIR_TRIM,
preventing unnecessary memory allocation.

Fixes: Issue axboe#2056 axboe#2056
Signed-off-by: Dennis Chang <cherhungc@google.com>
@sitsofe
Copy link
Copy Markdown
Collaborator

sitsofe commented Mar 7, 2026

@dennischerchang:

Please don't close an existing PR and open a new one to continue a minor rewrite. If you use git's (interactive) rebase and then force push your changes to the same branch you can keep updating an existing PR even after squashing which is a benefit because we keep prior PR conversation.

A few notes:

  • Your Fixes line references the same issue twice
  • I did a quick rg --type c '\?.*"' and fio house style is not to have brackets on the first part of a teneray

My biggest concern is that a fuller motivating example needs to be included in the commit message because it's not obvious how to hit the problem that is being resolved:

I tried to reproduce the problem with the following but ran into various issues:

sudo -s
modprobe null_blk memory_backed=1 discard=1
rm -trace.*; blktrace -o trace -d /dev/nullb0 & rm -f log; ./fio --name=makelog --write_iolog=log --rw=trim --bs=4k,4k,4g --filename /dev/nullb0 --number_ios=16 --iodepth=4 --ioengine=io_uring; killall blktrace;
blkparse -i trace -d trace.p
./fio --thread --name=replay --read_iolog trace.p --ioengine=posixaio --iodepth=4 --bs=128,128,128 --rw=trim

@dennischerchang
Copy link
Copy Markdown
Author

dennischerchang commented Mar 9, 2026

Please don't close an existing PR and open a new one to continue a minor rewrite.

Will be careful next time. Sorry about the confusion.

but how do you read a value of 4*1024**3 when the bytes field in a binary blkparse log is only a u32

Correct. The original command was actually TRIM 3.8 TB (3,801,035,759,616 or 0x00000374ff62e000) and overflowed when converting to blktrace and becomes roughtly 4.2GB (4_284_669_952 or 0x00000000ff62e000. high bytes 0x0000037400000000 are dropped).

What ioengine is being used and what command line options are being used?

Original ioengine being used is nvme_kq. There won't be any issue sending the TRIM with 3.8TB but there would be OOM during replay. I should have make it clear it is replay specific issue at very beginning. My appology.

how come you're hitting memory problems when the memory is only allocated but is would otherwise be untouched

Here is the message:

Before being killed:
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1599374 root      20   0 2045.4g 112.5g   3712 S  96.7  89.6   0:19.81 fio_ssd

dmesg:
[487219.869947] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=sys,mems_allowed=0,oom_memcg=sys,task_memcg=/sys,task=fio_ssd,pid=1599374,uid=0
[487219.869976] Memory cgroup out of memory: Killed process 1599374 (fio_ssd) total-vm:2144714648kB, anon-rss:6086320kB, file-rss:1280kB, shmem-rss:2176kB, UID:0 pgtables:12452kB oom_score_adj:0, hugetlb-usage:0kB

@dennischerchang
Copy link
Copy Markdown
Author

dennischerchang commented Mar 9, 2026

Here is the command sequence after converting to blktrace

--debug=blktrace

// 10 flushes
blktrace 1606666 store flush delay=0
blktrace 1606666 store flush delay=3000021
blktrace 1606666 store flush delay=3000000
blktrace 1606666 store flush delay=3000053
blktrace 1606666 store flush delay=12000066
blktrace 1606666 store flush delay=3000052
blktrace 1606666 store flush delay=6000003
blktrace 1606666 store flush delay=9000035
blktrace 1606666 store flush delay=6000065
blktrace 1606666 store flush delay=9000073
// 11th event: trim
blktrace 1606666 store discard, off=39470252032, len=4284669952, delay=7155652
// all writes starting from 12th event
blktrace 1606666 store ddir=1, off=39477592064, len=65536, delay=684366
blktrace 1606666 store ddir=1, off=39477657600, len=65536, delay=2
blktrace 1606666 store ddir=1, off=39477723136, len=65536, delay=0
blktrace 1606666 store ddir=1, off=39477788672, len=65536, delay=0

--debug=mem (iodepth=512)

mem      1608550 io_u alloc 0x7f66d485cc40, index 509
mem      1608550 io_u alloc 0x7f66d485df00, index 510
mem      1608550 io_u alloc 0x7f66d485f1c0, index 511
// 2193751023614 =~ 2TiB
mem      1608550 Alloc 2193751023614 for buffers
mem      1608550 malloc 2193751023614 0x7d67db306010
mem      1608550 io_u alloc 0x7f66dc001b80, index 0
mem      1608550 io_u 0x7f66dc001b80, mem 0x7d67db307000
mem      1608550 io_u alloc 0x7f66d46064c0, index 1
mem      1608550 io_u 0x7f66d46064c0, mem 0x7d68da935000
// OOM
Killed1 (f=0)

Another experiment I did is to use read_iolog_chunked=1
It still hits OOM with default code. But if I change the default from 10 to 12, there is no OOM.

int64_t iolog_items_to_fetch(struct thread_data *td)
{
        if (!td->io_log_highmark)
                return 10; // no OOM if changed to 12

Note: existing code does not handle this trace properly because of the corner case.

the iolog_items_to_fetch() will return 0 and does not fetch any more when the consumption rate is less than 1 item per second. In this case the consumption rate of first 10 events is less than 0.03 event per second so it will exit after consuming out first 10 flushes without fetching more.

I made some change at my end to proceed ( ex: still fetch more even there is 0 event per second ). With changing the default (the number of first batch) from 10 to 12 it does not OOM. I think OOM still happens with default 10 and fetching even with 0 eps but I am not so sure my memory for now.

Summary (with iodepth=512)
read_iolog_chunked=0: OOM
read_iolog_chunked=1 with default fetch 10 items + fetch more when 0 eps: OOM
read_iolog_chunked=1 with fetch 12 items: no OOM

@sitsofe
Copy link
Copy Markdown
Collaborator

sitsofe commented Mar 9, 2026

@dennischerchang:

Will be careful next time. Sorry about the confusion.

Not to worry.

Original ioengine being used is nvme_kq

I'm not familiar with that fio ioengine (nor does a quick search on Google turn up anything). Is it an open source ioengine? Are you also able to reproduce the problem with something like the plain io_uring ioengine so we can rule that out as a factor?

What ioengine is being used and what command line options are being used?

It might help if you can share your full (replay) command line/job file.

The original command was actually TRIM 3.8 TB (3,801,035,759,616 or 0x00000374ff62e000) and overflowed when converting to blktrace and becomes roughtly 4.2GB (4_284_669_952 or 0x00000000ff62e000. high bytes 0x0000037400000000 are dropped).

Ah that explains that. The wraparound issue extends beyond fio though... I assume the non-processed blktrace trace file stores sizes in units of 512 and thus can represent larger values than the blkparse processed trace file.

Before being killed:
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1599374 root      20   0 2045.4g 112.5g   3712 S  96.7  89.6   0:19.81 fio_ssd

That RES is indeed huge. Which part of fio do you believe is causing the high usage - is it storing the log itself pushing usage through the roof or is there more to it e.g. is a high iodepth also required (but don't stop there)?

the iolog_items_to_fetch() will return 0 and does not fetch any more when the consumption rate is less than 1 item per second. In this case the consumption rate of first 10 events is less than 0.03 event per second so it will exit after consuming out first 10 flushes without fetching more.

Another problem! Perhaps a command line option to control it would be useful...

One more question: are you able to "manually" create a synthetic text based iolog that shows the OOM while using a built-in ioengine (note that text iologs have the same 32 bit length restriction for now)?

@dennischerchang
Copy link
Copy Markdown
Author

dennischerchang commented Mar 10, 2026

ioengine= nvme_kq
One more question: are you able to "manually" create a synthetic text based iolog that shows the OOM while using a built-in ioengine (note that text iologs have the same 32 bit length restriction for now)?

nvme_kq is Google proprietary ioengine but I think ioengine is irrelevant. I can reproduce the OOM when I replay against any ioengines, such as libaio.

large TRIM size, overflow, etc.

It is independent to this issue and we can discuss in another thread.

That RES is indeed huge. Which part of fio do you believe is causing the high usage - is it storing the log itself pushing usage through the roof or is there more to it e.g. is a high iodepth also required (but don't stop there)?

It is the io buffer which FIO uses to send along with Read/Write event. FIO allocates a continuous buffer (2TiB continuous Virtual address) for iodepth at once.

int init_io_u_buffers(struct thread_data *td)
{
	max_units = td->o.iodepth; // 512
	max_bs = td_max_bs(td); // =~ 4GB, which is this proposed change trying to fix
	// 4GB * 512 =~ 2TB
	td->orig_buffer_size = (unsigned long long) max_bs
					* (unsigned long long) max_units;
			
int allocate_io_mem(struct thread_data *td)
{
	total_mem = td->orig_buffer_size;
	// malloc(2TB), BOOOOOOOOM
	ret = alloc_mem_malloc(td, total_mem);

@vincentkfu
Copy link
Copy Markdown
Collaborator

Here is a way to trigger the issue on my system with 16G of RAM:

vincent@fedora:~/fio-dev/fio-2069$ cat iolog3
fio version 3 iolog
30 /dev/nvme1n1 add
10731 /dev/nvme1n1 open
10736 /dev/nvme1n1 trim 0 2147483648
10752 /dev/nvme1n1 write 0 4096
10780 /dev/nvme1n1 trim 4096 2147483648
10782 /dev/nvme1n1 write 4096 4096
10784 /dev/nvme1n1 trim 8192 2147483648
10785 /dev/nvme1n1 write 8192 4096
10787 /dev/nvme1n1 trim 12288 2147483648
10788 /dev/nvme1n1 write 12288 4096
10791 /dev/nvme1n1 trim 16384 2147483648
10793 /dev/nvme1n1 write 16384 4096
10795 /dev/nvme1n1 trim 20480 2147483648
10797 /dev/nvme1n1 write 20480 4096
10798 /dev/nvme1n1 trim 24576 2147483648
10799 /dev/nvme1n1 write 24576 4096
10801 /dev/nvme1n1 trim 28672 2147483648
10802 /dev/nvme1n1 write 28672 4096
10804 /dev/nvme1n1 trim 32768 2147483648
10806 /dev/nvme1n1 write 32768 4096
10808 /dev/nvme1n1 trim 36864 2147483648
10809 /dev/nvme1n1 write 36864 4096
10892 /dev/nvme1n1 close
vincent@fedora:~/fio-dev/fio-2069$ sudo ./fio --name=test --iodepth=16 --read_iolog=iolog3 --debug=mem
fio: set debug option mem
test: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=psync, iodepth=16
fio-3.41
Starting 1 process
note: both iodepth >= 1 and synchronous I/O engine are selected, queue depth will be capped at 1
mem      9158  io_u alloc 0x12ce200, index 0
mem      9158  io_u alloc 0x12ce3c0, index 1
mem      9158  io_u alloc 0x12ce5c0, index 2
mem      9158  io_u alloc 0x12ce780, index 3
mem      9158  io_u alloc 0x12ce980, index 4
mem      9158  io_u alloc 0x12ceb40, index 5
mem      9158  io_u alloc 0x12ced40, index 6
mem      9158  io_u alloc 0x12cef00, index 7
mem      9158  io_u alloc 0x12cf100, index 8
mem      9158  io_u alloc 0x12cf2c0, index 9
mem      9158  io_u alloc 0x12cf4c0, index 10
mem      9158  io_u alloc 0x12cf680, index 11
mem      9158  io_u alloc 0x12cf880, index 12
mem      9158  io_u alloc 0x12cfa40, index 13
mem      9158  io_u alloc 0x12cfc40, index 14
mem      9158  io_u alloc 0x12cfe00, index 15
mem      9158  Alloc 34359738368 for buffers
mem      9158  malloc 34359738368 (nil)
fio: pid=9158, err=12/file:memory.c:341, func=iomem allocation, error=Cannot allocate memory
mem      9158  free malloc mem (nil)


Run status group 0 (all jobs):

However, this patch does not resolve the issue because by default --trim_verify_zero=1, so o->trim_zero==1. The condition to check would be td->trim_verify && td->o.trim_zero. To make the code easier to read, please just use an if statement instead of a ternary operator.

@sitsofe
Copy link
Copy Markdown
Collaborator

sitsofe commented Mar 11, 2026

@vincentkfu ah thanks - that's what I was looking for!

So basically if there is even a single read or write then space will be reserved for the largest sized I/O (which might be a giant trim) * iodepth; my previous job only had trims so it never triggered the problem. I can see how a revised patch would address the author's original issue.

@sitsofe
Copy link
Copy Markdown
Collaborator

sitsofe commented Mar 11, 2026

@vincentkfu

Am I correct in thinking that that:

  1. trim_verify_zero is only really for verifying that trimming after writing returns zeroes?
  2. trim_verify_zero=1 only works when trim_percentage > 0 AND trim_backlog > 0?
  3. rw=write trim_verify_zero=1 trim_percentage=100 trim_backlog=1 will only send trims that are the size of the write block?
  4. If you have a pure trim workload (e.g. rw=trim) that doesn't do writes, then trim_verify_zero=1 trim_percentage=100 trim_backlog=1 verify=crc32c won't actually trigger any verification reads?

(I've been suing a job like:

modprobe null_blk memory_backed=1 discard=1
./fio --debug=io --name=trimzerotest --filename=/dev/nullb0 --rw=write --verify_pattern=0xff --verify=pattern --bs=4k,8k,16k --size=10m --trim_percentage=100 --trim_backlog=1 --number_ios=4
hexdump -C /dev/nullb0 -n 256

and varying the rw values to test various things things out)

Additionally what do you think of td_max_rw_bs() being kept only inside backend.c as for now that's the only user? It would save fio.h carrying code that has to be continually optimised out from all the places it gets included...

@vincentkfu
Copy link
Copy Markdown
Collaborator

vincentkfu commented Mar 11, 2026

@vincentkfu

Am I correct in thinking that that:

  1. trim_verify_zero is only really for verifying that trimming after writing returns zeroes?

Yes, although I should add that fio needs to be running a verify workload for this to be activated.

  1. trim_verify_zero=1 only works when trim_percentage > 0 AND trim_backlog > 0?

Yes, although, I think it would be nice if trim_backlog worked the way verify_backlog did. In other words, if trim_backlog is zero, fio should just iterate through the trim list at the conclusion of the write workload.

  1. rw=write trim_verify_zero=1 trim_percentage=100 trim_backlog=1 will only send trims that are the size of the write block?

Yes, this feature is designed to verify trims of a block after writing data to that block.

  1. If you have a pure trim workload (e.g. rw=trim) that doesn't do writes, then trim_verify_zero=1 trim_percentage=100 trim_backlog=1 verify=crc32c won't actually trigger any verification reads?

Yes, the trim zero feature appears to be designed as a variation of a verify workload where instead of validating a checksum, fio will trim the block and then make sure it's all zeroes.

(I've been suing a job like:

modprobe null_blk memory_backed=1 discard=1
./fio --debug=io --name=trimzerotest --filename=/dev/nullb0 --rw=write --verify_pattern=0xff --verify=pattern --bs=4k,8k,16k --size=10m --trim_percentage=100 --trim_backlog=1 --number_ios=4
hexdump -C /dev/nullb0 -n 256

and varying the rw values to test various things things out)

Additionally what do you think of td_max_rw_bs() being kept only inside backend.c as for now that's the only user? It would save fio.h carrying code that has to be continually optimised out from all the places it gets included...

I don't have strong feelings either way about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants