by Sam Hadow
On my server I self-host quite a lot of services but only have 5900rpm HDDs for the data, and a SSD only for the OS and binaries.
Sometimes these HDDs struggle to keep up with the I/O operations of all my services.
So in this short blog post I’ll show you the troubleshhoting steps to find the culprit of a high disk I/O and how to limit its disk usage.
To check disks usage we can use the tool iostat (provided by the package sysstat on fedora, debian and archlinux)
to see the extended stats every second:
iostat -x 1
you’ll then get an output like this:
avg-cpu: %user %nice %system %iowait %steal %idle
4.37 0.00 6.94 21.85 0.00 66.84
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 524.00 8384.00 0.00 0.00 47.82 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 25.06 100.00
dm-4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda 0.00 0.00 0.00 0.00 0.00 0.00 524.00 8384.00 0.00 0.00 46.02 16.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 24.11 99.30
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
zram0 1.00 4.00 0.00 0.00 0.00 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
let’s explain each column:
| Column | Meaning |
|---|---|
| Device | The block device name (e.g. sda, dm-0, etc.). |
| r/s | Number of read requests per second issued to the device. |
| rkB/s | Amount of data read per second, in kilobytes. |
| rrqm/s | Number of merged read requests per second (the kernel merges adjacent reads into a single I/O). |
| %rrqm | Percentage of read requests merged - calculated as 100 * rrqm/s / (r/s + rrqm/s). |
| r_await | Average time (in milliseconds) for read requests to be served - includes both queue time and service time. |
| rareq-sz | Average size (in kilobytes) of each read request. |
| w/s | Number of write requests per second issued to the device. |
| wkB/s | Amount of data written per second, in kilobytes. |
| wrqm/s | Number of merged write requests per second. |
| %wrqm | Percentage of write requests that were merged - calculated in a similar way to %rrqm. |
| w_await | Average time (ms) for write requests to complete. |
| wareq-sz | Average size (kB) of each write request. |
| d/s | Number of discard requests per second (TRIM / UNMAP commands - mostly on SSDs). |
| dkB/s | Amount of data discarded per second (in kB). |
| drqm/s | Merged discard requests per second. |
| %drqm | Percentage of discard requests merged. |
| d_await | Average time (ms) for discard requests to complete. |
| dareq-sz | Average discard request size (kB). |
| f/s | Number of flush requests per second — these force buffered data to non-volatile storage. |
| f_await | Average time (ms) for flush requests to complete. |
| aqu-sz | Average queue size — the average number of I/O requests waiting in the queue or being serviced during the sample interval. |
| %util | Percentage of time the device was busy processing I/O requests. Values near 100% indicate full utilization; but it can exceed 100% with multi-queues or parallel I/Os. |
The most interesting columns for us are:
fun fact: Although iostat displays units corresponding to kilobytes (kB), megabytes (MB)…, it actually uses kibibytes (kiB), mebibytes (MiB)… A kibibyte is equal to 1024 bytes, and a mebibyte is equal to 1024 kibibytes. source
In the previous example, dm-* devices are actually virtual block devices managed by LVM (the device mapper).
To identify which physical volumes they correspond to, we can run this command:
ls -l /dev/mapper
Or this command as root:
dmsetup ls --tree
For that we can use the iotop command (package is named iotop on fedora, debian and archlinux). We usually need to run iotop as root as it needs elevated privileges.
With the following options it’s easier to spot processes causing a high I/O usage:
sudo iotop -aoP
what these options do:
-a = accumulated I/O since start
-o = only show processes actually doing I/O
-P = show per-process, not per-thread
We can also use pidstat (provided by the package sysstat on fedora, debian and archlinux). It’s better to run this command as root, otherwise you’ll only the processes from your the user running the command and not all the processes.
to show per process read/write operations, updating every second:
pidstat -d 1
We can then write down the PID, or the command, corresponding to the line with a lot of disk writes, or disk reads.
With podman we can use arguments in the run command to limit disk I/Os as mentionned in the documentation
| Argument | effect |
|---|---|
| –device-read-bps=path:rate | Limit read rate (in bytes per second) from a device (e.g. –device-read-bps=/dev/sda:1mb). |
| –device-read-iops=path:rate | Limit read rate (in IO operations per second) from a device (e.g. –device-read-iops=/dev/sda:1000). |
| –device-write-bps=path:rate | Limit write rate (in bytes per second) to a device (e.g. –device-write-bps=/dev/sda:1mb). |
| –device-write-iops=path:rate | Limit write rate (in IO operations per second) to a device (e.g. –device-write-iops=/dev/sda:1000). |
These may not work in rootless mode unless I/O delegation is enabled.
You can verify resource limit delegations enabled with this command:
cat "/sys/fs/cgroup/user.slice/user-$(id -u).slice/user@$(id -u).service/cgroup.controllers"
In our case we need io in the output.
If it’s not present you can create the file /etc/systemd/system/user@.service.d/delegate.conf with the following content:
[Service]
Delegate=io
You can also add the other resource limit you want to delegate to users, for example: memory pids cpu cpuset, the file would look like this:
[Service]
Delegate=io memory pids cpu cpuset
You then need to log out and log back in to have the correct limits permissions.
To limit disk I/O for a systemd service we can use slices
The most useful options we can put in a slice section are the following, you can see all the available options in the documentation
| Property | Description |
|---|---|
IOAccounting= |
Enables collection of I/O statistics (used by systemd-cgtop, systemd-analyze, etc.). |
IOWeight=weight |
Sets relative I/O priority (1–10000, default 100). A higher value gives the unit a larger share of available bandwidth when multiple units compete. |
IODeviceWeight=device weight |
Assigns a per-device weight, overriding IOWeight for that device. |
IOReadBandwidthMax=device bytes |
Sets an absolute cap on read bandwidth, e.g. /dev/sda 10M. Units cannot exceed this, even if idle bandwidth exists. Possible units are: K, M, G, or T for Kilobytes, Megabytes, Gigabytes, or Terabytes, respectively. Otherwise the bandwidth is parsed in bytes/s. |
IOWriteBandwidthMax=device bytes |
Same, but for write bandwidth. |
IOReadIOPSMax=device limit |
Caps the number of read operations per second, e.g. /dev/nvme0n1 500. |
IOWriteIOPSMax=device limit |
Caps the number of write operations per second. |
To create a slice unit, for example: /etc/systemd/system/io-limited.slice
[Unit]
Description=Slice for IO-limited services
[Slice]
IOAccounting=yes
IOWriteBandwidthMax=/dev/sda 20M
IOReadBandwidthMax=/dev/sda 20M
We can then assign services to this slice, in the service section we would have:
[Service]
Slice=io-limited.slice