Skip to content

Commit 9773add

Browse files
committed
feat(disk-io): add normalized iowait monitoring on Linux
1 parent 32d6838 commit 9773add

File tree

3 files changed

+342
-58
lines changed

3 files changed

+342
-58
lines changed

check-plugins/disk-io/README.md

Lines changed: 75 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,22 @@ On Linux, the check plugin by default tries to find "important" disks automatica
88

99
Disk I/O always starts at 10 MiB/sec, but stores the highest measured bandwidth, so it adjusts the `RWmax/s` value accordingly. For this reason, this check takes some time to warm up its (cached) readings: The check will throw some warnings and criticals during the first major disk activities above 10Mib/sec until the maximum bandwidth of the disk has been determined.
1010

11-
Example: The (shortened) result of `./disk-io --count 5 --warning 80 --critical 90` could look like this:
11+
12+
### iowait (Linux only)
13+
14+
On Linux, the check also monitors the system-wide iowait percentage. iowait represents CPU time spent idle while waiting for I/O operations to complete. While technically a CPU metric, its diagnostic value is entirely in the disk I/O context, which is why it is part of this check rather than a separate one.
15+
16+
The raw iowait value is normalized by multiplying it with the number of logical CPUs, so that 100% always means one CPU core is fully I/O-saturated, regardless of the total number of CPUs. Values above 100% indicate that more than one core is waiting for I/O. This normalization approach is inspired by [Glances](https://github.com/nicolargo/glances), which uses `100 / N` (where N = number of CPUs) as its critical threshold for raw iowait. The reason such thresholds appear low in Glances is that raw iowait is reported as a percentage of total CPU time across all cores: on a 4-core system, 25% raw iowait already means one entire core is doing nothing but waiting for I/O. By normalizing the value, the default thresholds (80/90%) work consistently across any hardware.
17+
18+
Like bandwidth alerts, iowait alerts only trigger after `--count` consecutive threshold violations, suppressing short spikes.
19+
20+
21+
### Example
22+
23+
The (shortened) result of `./disk-io --count 5 --warning 80 --critical 90` could look like this:
1224

1325
```text
14-
/dev/dm-4: 0.0B/s read1, 48.7KiB/s write1, 48.7KiB/s total, 227.9MiB/s max
26+
iowait: 0.1%. /dev/dm-4: 0.0B/s read1, 48.7KiB/s write1, 48.7KiB/s total, 227.9MiB/s max
1527
1628
Name ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s
1729
-----+---------+----------+----------+----------+----------+--------------------
@@ -20,7 +32,7 @@ dm-1 ! 10.0MiB ! 4.7KiB ! 4.0KiB ! 2.0KiB ! 6.8KiB ! 8.7KiB
2032
...
2133
```
2234

23-
The first line always shows the disk with the currently highest bandwidth usage (here `dm-0`).
35+
The first line shows the current iowait percentage followed by the disk with the currently highest bandwidth usage (here `dm-0`).
2436

2537
The table columns mean:
2638

@@ -52,45 +64,67 @@ Hints:
5264

5365
```text
5466
usage: disk-io [-h] [-V] [--always-ok] [--count COUNT] [--critical CRIT]
67+
[--iowait-critical IOWAIT_CRIT] [--iowait-warning IOWAIT_WARN]
5568
[--match MATCH] [--top TOP] [--warning WARN]
5669
5770
Checks disk I/O bandwidth over time and alerts on sustained saturation, not
5871
short spikes. The check records per-disk read/write counters and then derives
5972
current (R1/W1) and period averages (R{COUNT}/W{COUNT}). It compares the
6073
period’s total bandwidth against the maximum ever observed for that disk
6174
(RWmax). WARN/CRIT trigger if the period average exceeds the configured
62-
percentage of RWmax for COUNT consecutive runs. Perfdata is emitted for each
63-
disk (busy_time, read_bytes, read_time, write_bytes, write_time) so you can
64-
graph trends. On Linux the check automatically focuses on "real" block devices
65-
with mountpoints; on Windows it uses psutil’s disk counters. Optionally,
66-
`--top` lists the processes that generated the most I/O traffic (read/write
67-
totals) to help identify offenders. This check is cross-platform and works on
68-
Linux, Windows, and all psutil-supported systems. The check stores its short
69-
trend state locally in an SQLite DB to evaluate sustained load across runs.
75+
percentage of RWmax for COUNT consecutive runs. On Linux, the check also
76+
monitors the system-wide iowait percentage (CPU time spent waiting for I/O).
77+
The raw iowait value is normalized by multiplying it with the number of
78+
logical CPUs, so that 100% always means one CPU core is fully I/O-saturated,
79+
regardless of the total number of CPUs. This makes the default thresholds
80+
(80/90%) work consistently across different hardware. Like bandwidth alerts,
81+
iowait alerts require COUNT consecutive threshold violations. Perfdata is
82+
emitted for each disk (busy_time, read_bytes, read_time, write_bytes,
83+
write_time) and for iowait, so you can graph trends. On Linux the check
84+
automatically focuses on "real" block devices with mountpoints; on Windows it
85+
uses psutil’s disk counters. Optionally, `--top` lists the processes that
86+
generated the most I/O traffic (read/write totals) to help identify offenders.
87+
This check is cross-platform and works on Linux, Windows, and all psutil-
88+
supported systems. The check stores its short trend state locally in an SQLite
89+
DB to evaluate sustained load across runs.
7090
7191
options:
72-
-h, --help show this help message and exit
73-
-V, --version show program's version number and exit
74-
--always-ok Always returns OK.
75-
--count COUNT Number of times the value must exceed specified thresholds
76-
before alerting. Default: 5
77-
--critical CRIT Threshold for disk bandwidth saturation (over the last
78-
`--count` measurements) as a percentage of the maximum
79-
bandwidth the disk can support. Default: >= 90
80-
--match MATCH Match on disk names. Uses Python regular expressions
81-
without any external flags like `re.IGNORECASE`. The
82-
regular expression is applied to each line of the output.
83-
Examples: `(?i)example` to match the word "example" in a
84-
case-insensitive manner. `^(?!.*example).*$` to match any
85-
string except "example" (negative lookahead). `(?: ... )*`
86-
is a non-capturing group that matches any sequence of
87-
characters that satisfy the condition inside it, zero or
88-
more times. Default:
89-
--top TOP List x "Top processes that generated the most I/O traffic".
90-
Use `--top=0` to disable this feature. Default: 5
91-
--warning WARN Threshold for disk bandwidth saturation (over the last
92-
`--count` measurements) as a percentage of the maximum
93-
bandwidth the disk can support. Default: >= 80
92+
-h, --help show this help message and exit
93+
-V, --version show program's version number and exit
94+
--always-ok Always returns OK.
95+
--count COUNT Number of times the value must exceed specified
96+
thresholds before alerting. Default: 5
97+
--critical CRIT Threshold for disk bandwidth saturation (over the last
98+
`--count` measurements) as a percentage of the maximum
99+
bandwidth the disk can support. Default: >= 90
100+
--iowait-critical IOWAIT_CRIT
101+
Set the critical threshold for normalized iowait in
102+
percent (Linux only). The iowait value is normalized
103+
so that 100% means one CPU core is fully
104+
I/O-saturated. Values above 100% indicate that more
105+
than one core is waiting for I/O. Default: >= 90
106+
--iowait-warning IOWAIT_WARN
107+
Set the warning threshold for normalized iowait in
108+
percent (Linux only). The iowait value is normalized
109+
so that 100% means one CPU core is fully
110+
I/O-saturated. Values above 100% indicate that more
111+
than one core is waiting for I/O. Default: >= 80
112+
--match MATCH Match on disk names. Uses Python regular expressions
113+
without any external flags like `re.IGNORECASE`. The
114+
regular expression is applied to each line of the
115+
output. Examples: `(?i)example` to match the word
116+
"example" in a case-insensitive manner.
117+
`^(?!.*example).*$` to match any string except
118+
"example" (negative lookahead). `(?: ... )*` is a non-
119+
capturing group that matches any sequence of
120+
characters that satisfy the condition inside it, zero
121+
or more times. Default:
122+
--top TOP List x "Top processes that generated the most I/O
123+
traffic". Use `--top=0` to disable this feature.
124+
Default: 5
125+
--warning WARN Threshold for disk bandwidth saturation (over the last
126+
`--count` measurements) as a percentage of the maximum
127+
bandwidth the disk can support. Default: >= 80
94128
```
95129

96130

@@ -111,7 +145,7 @@ Match all disks except `vdc`, `vdh` and `vdz`:
111145
Example Output:
112146

113147
```text
114-
/dev/dm-8: 5.6KiB/s read1, 2.2MiB/s write1, 2.2MiB/s total, 10.0MiB/s max
148+
iowait: 0.1%. /dev/dm-8: 5.6KiB/s read1, 2.2MiB/s write1, 2.2MiB/s total, 10.0MiB/s max
115149
116150
Name ! MntPnts ! DvMppr ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s
117151
-----+----------------+------------------+---------+--------+---------+--------+---------+---------
@@ -138,10 +172,17 @@ Top 5 processes that generate the most I/O traffic (r/w):
138172
## States
139173

140174
* WARN or CRIT if the bandwidth over the last n measured values is above a certain percentage, compared to the all time maximum bandwidth of this drive.
175+
* WARN or CRIT if iowait exceeds the threshold for `--count` consecutive runs (Linux only).
141176

142177

143178
## Perfdata / Metrics
144179

180+
Global:
181+
182+
| Name | Type | Description |
183+
|----|----|----|
184+
| iowait | Percentage | System-wide iowait (Linux only). |
185+
145186
Per (matched) disk, where <disk\> is the block device name:
146187

147188
| Name | Type | Description |

0 commit comments

Comments
 (0)