fio: prevent OOM by not allocating buffer for TRIM range#2069
fio: prevent OOM by not allocating buffer for TRIM range#2069dennischerchang wants to merge 1 commit intoaxboe:masterfrom
Conversation
Large TRIM ranges combined with high iodepth can cause memory usage to spike and trigger Out Of Memory (OOM) errors. Since TRIM is a metadata-only operation, it does not require a data payload buffer. This patch adds a function to calculate the buffer size without DDIR_TRIM, preventing unnecessary memory allocation. Fixes: Issue axboe#2056 axboe#2056 Signed-off-by: Dennis Chang <cherhungc@google.com>
|
Please don't close an existing PR and open a new one to continue a minor rewrite. If you use git's (interactive) rebase and then force push your changes to the same branch you can keep updating an existing PR even after squashing which is a benefit because we keep prior PR conversation. A few notes:
My biggest concern is that a fuller motivating example needs to be included in the commit message because it's not obvious how to hit the problem that is being resolved:
I tried to reproduce the problem with the following but ran into various issues: |
Will be careful next time. Sorry about the confusion.
Correct. The original command was actually TRIM 3.8 TB (3,801,035,759,616 or 0x00000374ff62e000) and overflowed when converting to blktrace and becomes roughtly 4.2GB (4_284_669_952 or 0x00000000ff62e000. high bytes 0x0000037400000000 are dropped).
Original ioengine being used is nvme_kq. There won't be any issue sending the TRIM with 3.8TB but there would be OOM
Here is the message: |
|
Here is the command sequence after converting to blktrace --debug=blktrace --debug=mem (iodepth=512) Another experiment I did is to use read_iolog_chunked=1 Note: existing code does not handle this trace properly because of the corner case. the I made some change at my end to proceed ( ex: still fetch more even there is 0 event per second ). With changing the default (the number of first batch) from 10 to 12 it does not OOM. I think OOM still happens with default 10 and fetching even with 0 eps but I am not so sure my memory for now. Summary (with iodepth=512) |
Not to worry.
I'm not familiar with that fio ioengine (nor does a quick search on Google turn up anything). Is it an open source ioengine? Are you also able to reproduce the problem with something like the plain
It might help if you can share your full (replay) command line/job file.
Ah that explains that. The wraparound issue extends beyond fio though... I assume the non-processed blktrace trace file stores sizes in units of 512 and thus can represent larger values than the blkparse processed trace file.
That RES is indeed huge. Which part of fio do you believe is causing the high usage - is it storing the log itself pushing usage through the roof or is there more to it e.g. is a high
Another problem! Perhaps a command line option to control it would be useful... One more question: are you able to "manually" create a synthetic text based iolog that shows the OOM while using a built-in ioengine (note that text iologs have the same 32 bit length restriction for now)? |
nvme_kq is Google proprietary ioengine but I think ioengine is irrelevant. I can reproduce the OOM when I replay against any ioengines, such as libaio.
It is independent to this issue and we can discuss in another thread.
It is the io buffer which FIO uses to send along with Read/Write event. FIO allocates a continuous buffer (2TiB continuous Virtual address) for iodepth at once. |
|
Here is a way to trigger the issue on my system with 16G of RAM: However, this patch does not resolve the issue because by default |
|
@vincentkfu ah thanks - that's what I was looking for! So basically if there is even a single read or write then space will be reserved for the largest sized I/O (which might be a giant trim) * iodepth; my previous job only had trims so it never triggered the problem. I can see how a revised patch would address the author's original issue. |
|
Am I correct in thinking that that:
(I've been suing a job like: and varying the Additionally what do you think of |
Yes, although I should add that fio needs to be running a verify workload for this to be activated.
Yes, although, I think it would be nice if
Yes, this feature is designed to verify trims of a block after writing data to that block.
Yes, the trim zero feature appears to be designed as a variation of a verify workload where instead of validating a checksum, fio will trim the block and then make sure it's all zeroes.
I don't have strong feelings either way about this. |
Large TRIM ranges combined with high iodepth can cause memory usage to spike and trigger Out Of Memory (OOM) errors. Since TRIM is a metadata-only operation, it does not require a data payload buffer. This patch adds a function to calculate the buffer size without DDIR_TRIM, preventing unnecessary memory allocation.
Fixes: Issue #2056 #2056