eBPF (Extended Berkeley Packet Filter) is a new technology in the Linux kernel that allows users to execute custom programmes in kernel space without changing the kernel code. This provides system administrators and developers with powerful tools to gain insight into and monitor system behaviour for optimisation.
In this tutorial, we will explore how to use eBPF to write programs to count random and sequential disk I/O. Disk I/O is one of the key metrics of computer performance, especially in data-intensive applications.
As technology advances and data volumes explode, disk I/O becomes a critical bottleneck in system performance. The performance of an application depends heavily on how it interacts with the storage tier. Therefore, it becomes especially important to deeply understand and optimise disk I/O, especially random and sequential I/O.
-
Random I/O: Random I/O occurs when an application reads or writes data from or to a non-sequential location on the disk. The main characteristic of this I/O mode is that the disk head needs to move frequently between locations, causing it to be typically slower than sequential I/O. Typical scenarios that generate random I/O include database queries, file system metadata operations, and concurrent tasks in virtualised environments.
-
Sequential I/O: In contrast to random I/O, sequential I/O occurs when an application continuously reads or writes blocks of data to or from disk. The advantage of this I/O mode is that the disk head can move continuously in one direction, which greatly increases the speed at which data can be read and written. Video playback, downloading or uploading large files, and continuous logging are typical applications that generate sequential I/O.
To optimise storage performance, it is critical to understand both random and sequential disk I/O. For example, random I/O-sensitive applications typically perform far better on SSDs than on traditional hard drives because SSDs have virtually no addressing latency when dealing with random I/Os. Conversely, for applications with a lot of sequential I/O, it is much more critical to maximize the sequential read and write speed of the disk.
In the rest of this tutorial, we will discuss in detail how to use the eBPF tool to monitor and count both types of disk I/O in real time, which will not only help us better understand the I/O behaviour of the system, but will also provide us with strong data for further performance optimization.
Biopattern counts the percentage of random/sequential disk I/Os.
First of all, make sure that you have installed libbpf and the associated toolset correctly, you can find the source code here: bpf-developer-tutorial
Navigate to the biopattern
source directory and compile it using the make
command:
cd ~/bpf-developer-tutorial/src/17-biopattern
make
After successful compilation, you should see the biopattern
executable in the current directory. The basic runtime commands are as follows:
sudo ./biopattern [interval] [count]
For example, to print the output once per second for 10 seconds, you can run:
$ sudo ./biopattern 1 10
Tracing block device I/O requested seeks... Hit Ctrl-C to end.
DISK %RND %SEQ COUNT KBYTES
sr0 0 100 3 0
sr1 0 100 8 0
sda 0 100 1 4
sda 100 0 26 136
sda 0 100 1 4
The output columns have the following meanings:
DISK
: Name of the disk being tracked.%RND
: Percentage of random I/O.%SEQ
: percentage of sequential I/O.COUNT
: Number of I/O requests in the specified interval.KBYTES
: amount of data (in KB) read and written in the specified time interval.
From the above output, we can draw the following conclusions:
- The
sr0
andsr1
devices performed mostly sequential I/O during the observation period, but the amount of data was small. - The
sda
device performed only random I/O during some time periods and only sequential I/O during other time periods.
This information can help us understand the I/O pattern of the system so that we can target optimisation.
First, let's look at the eBPF kernel state code at the heart of biopattern:
#include <vmlinux.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "biopattern.h"
#include "maps.bpf.h"
#include "core_fixes.bpf.h"
const volatile bool filter_dev = false;
const volatile __u32 targ_dev = 0;
struct {
__uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 64);
__type(key, u32);
__type(value, struct counter);
} counters SEC(".maps");
SEC("tracepoint/block/block_rq_complete")
int handle__block_rq_complete(void *args)
{
struct counter *counterp, zero = {};
sector_t sector;
u32 nr_sector;
u32 dev;
if (has_block_rq_completion()) {
struct trace_event_raw_block_rq_completion___x *ctx = args;
sector = BPF_CORE_READ(ctx, sector);
nr_sector = BPF_CORE_READ(ctx, nr_sector);
dev = BPF_CORE_READ(ctx, dev);
} else {
struct trace_event_raw_block_rq_complete___x *ctx = args;
sector = BPF_CORE_READ(ctx, sector);
nr_sector = BPF_CORE_READ(ctx, nr_sector);
dev = BPF_CORE_READ(ctx, dev);
}
if (filter_dev && targ_dev != dev)
return 0;
counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero);
if (!counterp)
return 0;
if (counterp->last_sector) {
if (counterp->last_sector == sector)
__sync_fetch_and_add(&counterp->sequential, 1);
else
__sync_fetch_and_add(&counterp->random, 1);
__sync_fetch_and_add(&counterp->bytes, nr_sector * 512);
}
counterp->last_sector = sector + nr_sector;
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Global variable definitions:
const volatile bool filter_dev = false;
const volatile __u32 targ_dev = 0;
These two global variables are used for device filtering. filter_dev
determines whether device filtering is enabled or not, and targ_dev
is the identifier of the target device we want to track.
BPF map definition:
struct { __uint(type, BPF_MAP_TYPE_HASH);
__uint(max_entries, 64); __type(key, u32);
__type(value, struct counter);
} counters SEC(".maps").
This part of the code defines a BPF map of type hash table. The key of the map is the identifier of the device, and the value is a counter
struct, which is used to store the I/O statistics of the device.
The tracepoint function:
SEC("tracepoint/block/block_rq_complete")
int handle__block_rq_complete(void *args)
{
struct counter *counterp, zero = {};
sector_t sector;
u32 nr_sector;
u32 dev;
if (has_block_rq_completion()) {
struct trace_event_raw_block_rq_completion___x *ctx = args;
sector = BPF_CORE_READ(ctx, sector);
nr_sector = BPF_CORE_READ(ctx, nr_sector);
dev = BPF_CORE_READ(ctx, dev);
} else {
struct trace_event_raw_block_rq_complete___x *ctx = args;
sector = BPF_CORE_READ(ctx, sector);
nr_sector = BPF_CORE_READ(ctx, nr_sector);
dev = BPF_CORE_READ(ctx, dev);
}
if (filter_dev && targ_dev != dev)
return 0;
counterp = bpf_map_lookup_or_try_init(&counters, &dev, &zero);
if (!counterp)
return 0;
if (counterp->last_sector) {
if (counterp->last_sector == sector)
__sync_fetch_and_add(&counterp->sequential, 1);
else
__sync_fetch_and_add(&counterp->random, 1);
__sync_fetch_and_add(&counterp->bytes, nr_sector * 512);
}
counterp->last_sector = sector + nr_sector;
return 0;
}
In Linux, a trace point called block_rq_complete
is triggered every time an I/O request for a block device completes. This provides an opportunity to capture these events with eBPF and further analyse the I/O patterns.
Main Logic Analysis:
-
Extracting I/O request information: get information about the I/O request from the incoming parameters. There are two possible context structures depending on the return value of
has_block_rq_completion
. This is because different versions of the Linux kernel may have different tracepoint definitions. In either case, we extract the sector number(sector)
, the number of sectors(nr_sector)
and the device identifier(dev)
from the context. -
Device filtering: If device filtering is enabled
(filter_dev
istrue
) and the current device is not the target device(targ_dev
), it is returned directly. This allows the user to track only specific devices, not all devices. -
Statistics update:
- Lookup or initialise statistics: use the
bpf_map_lookup_or_try_init
function to lookup or initialise statistics related to the current device. If there is no statistics for the current device in the map, it will be initialised using thezero
structure.
- Lookup or initialise statistics: use the
-
Determine the I/O mode: Based on the sector number of the current I/O request and the previous I/O request, we can determine whether the current request is random or sequential. If the sector numbers of the two requests are the same, then it is sequential; otherwise, it is random. We then use the
__sync_fetch_and_add
function to update the corresponding statistics. This is an atomic operation that ensures data consistency in a concurrent environment. -
Update the amount of data: we also update the total amount of data for the device, which is done by multiplying the number of sectors
(nr_sector
) by 512 (the number of bytes per sector). -
Update the sector number of the last I/O request: for the next comparison, we update the value of
last_sector
.
In some versions of the Linux kernel, the naming and structure of the tracepoint has changed due to the introduction of a new tracepoint, block_rq_error
. This means that the structural name of the former block_rq_complete
tracepoint has been changed from trace_event_raw_block_rq_complete
to trace_event_raw_block_rq_completion
, a change which may cause compatibility issues with eBPF programs on different versions of the kernel. This change may cause compatibility issues with eBPF programs on different versions of the kernel.
To address this issue, the biopattern
utility introduces a mechanism to dynamically detect which trace point structure is currently used by the kernel, namely the has_block_rq_completion
function.
- Define two trace point structures:
struct trace_event_raw_block_rq_complete___x {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
} __attribute__((preserve_access_index));
struct trace_event_raw_block_rq_completion___x {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
} __attribute__((preserve_access_index));
Two tracepoint structures are defined here, corresponding to different versions of the kernel. Each structure contains a device identifier (dev
), sector number (sector
), and number of sectors (nr_sector
).
Dynamic detection of trackpoint structures:
static __always_inline bool has_block_rq_completion()
{
if (bpf_core_type_exists(struct trace_event_raw_block_rq_completion___x))
return true;
return false;
}
The has_block_rq_completion
function uses the bpf_core_type_exists
function to detect the presence of the structure trace_event_raw_block_rq_completion___x
in the current kernel. If it exists, the function returns true
, indicating that the current kernel is using the new tracepoint structure; otherwise, it returns false
, indicating that it is using the old structure. The two different definitions are handled separately in the corresponding eBPF code, which is a common solution for adapting to changes between kernel versions.
The biopattern
tool's userland code is responsible for reading statistics from the BPF mapping and presenting them to the user. In this way, system administrators can monitor the I/O patterns of each device in real time to better understand and optimise the I/O performance of the system.
- Main loop
/* main: poll */
while (1) {
sleep(env.interval);
err = print_map(obj->maps.counters, partitions);
if (err)
break;
if (exiting || --env.times == 0)
break;
}
This is the main loop of the biopattern
utility, and its workflow is as follows:
- Wait: use the
sleep
function to wait for the specified interval(env.interval
). print_map
: callprint_map
function to print the statistics in BPF map.- Exit condition: if an exit signal is received
(exiting
istrue
) or if the specified number of runs is reached(env.times
reaches 0), the loop exits.
Print mapping function:
static int print_map(struct bpf_map *counters, struct partitions *partitions)
{
__u32 total, lookup_key = -1, next_key;
int err, fd = bpf_map__fd(counters);
const struct partition *partition;
struct counter counter;
struct tm *tm;
char ts[32];
time_t t;
while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) {
err = bpf_map_lookup_elem(fd, &next_key, &counter);
if (err < 0) {
fprintf(stderr, "failed to lookup counters: %d\n", err);
return -1;
}
lookup_key = next_key;
total = counter.sequential + counter.random;
if (!total)
continue;
if (env.timestamp) {
time(&t);
tm = localtime(&t);
strftime(ts, sizeof(ts), "%H:%M:%S", tm);
printf("%-9s ", ts);
}
partition = partitions__get_by_dev(partitions, next_key);
printf("%-7s %5ld %5ld %8d %10lld\n",
partition ? partition->name : "Unknown",
counter.random * 100L / total,
counter.sequential * 100L / total, total,
counter.bytes / 1024);
}
lookup_key = -1;
while (!bpf_map_get_next_key(fd, &lookup_key, &next_key)) {
err = bpf_map_delete_elem(fd, &next_key);
if (err < 0) {
fprintf(stderr, "failed to cleanup counters: %d\n", err);
return -1;
}
lookup_key = next_key;
}
return 0;
}
The print_map
function is responsible for reading statistics from the BPF map and printing them to the console. The main logic is as follows:
- Traverse the BPF map: Use the
bpf_map_get_next_key
andbpf_map_lookup_elem
functions to traverse the BPF map and get the statistics for each device. - Calculate totals: Calculate the total number of random and sequential I/Os for each device.
- Print statistics: If timestamp is enabled
(env.timestamp
istrue
), the current time is printed first. Next, the device name, percentage of random I/O, percentage of sequential I/O, total I/O, and total data in KB are printed. - Cleaning up the BPF map: For the next count, use the
bpf_map_get_next_key
andbpf_map_delete_elem
functions to clean up all entries in the BPF map.
In this tutorial, we have taken an in-depth look at how to use the eBPF tool biopattern to monitor and count random and sequential disk I/O in real-time. we started by understanding the importance of random and sequential disk I/O and their impact on system performance. We then describe in detail how biopattern works, including how to define and use BPF maps, how to deal with tracepoint variations in different versions of the Linux kernel, and how to capture and analyse disk I/O events in an eBPF program.
You can visit our tutorial code repository at https://github.com/eunomia-bpf/bpf-developer-tutorial or our website at https://eunomia.dev/zh/tutorials/ for more examples and a complete tutorial.
- Source repo:https://github.com/eunomia-bpf/bpf-developer-tutorial/tree/main/src/17-biopattern
- bcc tool:https://github.com/iovisor/bcc/blob/master/libbpf-tools/biopattern.c
The original link of this article: https://eunomia.dev/tutorials/17-biopattern