Skip to content

Commit 7258c2e

Browse files
committed
feat(mailq)!: replace count-based alerting with age-based, add --mta autodetection
BREAKING CHANGE: --warning and --critical now take a duration string (1h, 3D, 30m) instead of a mail count. The defaults change from 2 / 250 mails to 1h / 3D. The rationale is that a queue with 100 fresh mails is still OK when they are delivered within minutes, while a single mail stuck for more than an hour is always interesting. Adds --mta=auto|postfix|exim|sendmail to override MTA autodetection. Postfix is now read via "postqueue -j" (JSON with arrival_time as Unix epoch) for a rock-solid timestamp that needs no date parser. Exim still reads "mailq" (aliased to "exim -bp") and uses the age literals exim prints next to each queued message. Everything else falls back to "mailq" with RFC2822 Date: line parsing. The mail count stays in perfdata (mailq) so existing Grafana dashboards keep working, and a new oldest_mail_age perfdata metric carries the age in seconds with warn/crit thresholds. Existing Icinga services that pass the old mail-count syntax (e.g. mailq_warning=2 mailq_critical=250) will hard-UNKNOWN with a clear "Invalid duration" message after the upgrade, forcing the admin to read the release notes and migrate to duration strings. A silent fallback that reinterprets old integer values as hours was considered and rejected because it would silently change the alerting semantic from "queue size" to "queue age" without the admin noticing, which is the worst possible outcome for monitoring. Unit test fixtures were renamed to follow the CONTRIBUTING convention (name describes the shape of the data, not the expected plugin state). Tests pin TZ=UTC and use a hidden --now test hook to make age-based assertions deterministic regardless of the host timezone. Bump linuxfabrik-lib pin from 3.1.0 to 3.1.1, picking up the human2seconds() / humanduration2seconds() lowercase d/w day/week marker support that the mailq plugin now relies on to parse exim age literals like 1d12h without normalizing the input first. The mailq Icinga Director basket needs a regen to pick up the new --mta parameter and the changed --warning/--critical semantics. Closes #781
1 parent 203b2ef commit 7258c2e

21 files changed

+457
-130
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Build, CI/CD:
1616

1717
Monitoring Plugins:
1818

19+
* mailq: the alerting semantic has flipped from "number of mails in the queue" to "age of the oldest mail in the queue". `--warning` and `--critical` now take a duration string with a unit suffix (`1h`, `3D`, `30m`, `72h`, ...) and the defaults are `1h` / `3D` (down from `2` / `250` mails). The rationale is that a queue with 100 fresh mails is still OK when they are delivered within minutes, while a single mail stuck for more than an hour is always interesting and is exactly when an admin wants to look. The mail count stays in perfdata (`mailq`) so Grafana trending keeps working, and the new `oldest_mail_age` perfdata metric carries the age in seconds. Existing Icinga services that set `mailq_warning=10` and `mailq_critical=500` need to be migrated to duration strings (e.g. `mailq_warning=1h`, `mailq_critical=3D`). Also adds `--mta=auto|postfix|exim|sendmail` to override MTA autodetection: Postfix is now read via `postqueue -j` (JSON with `arrival_time` as Unix epoch) for a rock-solid timestamp, Exim still uses `mailq` (= `exim -bp`) and its built-in age literals, and everything else falls back to `mailq` with `Date:` line parsing ([#781](https://github.com/Linuxfabrik/monitoring-plugins/issues/781))
1920
* procs: `--argument`, `--command` and `--username` now use regular expressions instead of substring/startswith matching. Existing filters like `--command=httpd` still work but now match anywhere in the name. Use `--command='^httpd'` for the previous startswith behavior, or `--username='^apache$'` for exact matches.
2021

2122

@@ -70,7 +71,7 @@ Assets:
7071

7172
Build, CI/CD:
7273

73-
* Bump pinned `linuxfabrik-lib` dependency from 3.0.0 to 3.1.0, picking up the new `run_mariadb()` / `MARIADB_LTS_IMAGES` container-test helpers, the `attach_each()` / `attach_tests()` unit-test helpers and the `disk.dir_exists()` directory check
74+
* Bump pinned `linuxfabrik-lib` dependency from 3.0.0 to 3.1.1, picking up the new `run_mariadb()` / `MARIADB_LTS_IMAGES` container-test helpers, the `attach_each()` / `attach_tests()` unit-test helpers, the `disk.dir_exists()` directory check, and `human2seconds()` / `humanduration2seconds()` accepting the lowercase `d` / `w` day/week markers (which the mailq plugin needs to parse exim age literals)
7475
* Windows MSI still installs all plugins to ProgramFiles64Folder/ICINGA2/sbin/linuxfabrik, but does not depend on an Icinga2 agent any longer
7576

7677

check-plugins/mailq/README.md

Lines changed: 36 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -31,21 +31,44 @@ Checks the number of messages in the mail queue using the `mailq` command. Alert
3131
## Help
3232

3333
```text
34-
usage: mailq [-h] [-V] [--always-ok] [-c CRIT] [--test TEST] [-w WARN]
35-
36-
Checks the number of messages in the mail queue using the "mailq" command.
37-
Alerts when the queue length exceeds the configured thresholds.
34+
usage: mailq [-h] [-V] [--always-ok] [-c CRIT]
35+
[--mta {auto,postfix,exim,sendmail}] [--test TEST] [-w WARN]
36+
37+
Checks how long the oldest mail in the local mail queue has been waiting and
38+
alerts when it exceeds the configured duration thresholds. On hosts with
39+
Postfix, reads the queue via `postqueue -j` (JSON, with `arrival_time` as Unix
40+
epoch) for maximum accuracy. On Exim hosts, reads `mailq` (which is aliased to
41+
`exim -bp` by exim) and parses the age literal that exim prints next to each
42+
queued message. On other hosts, falls back to running `mailq` and parsing
43+
`Date:` lines from the output. A non-empty queue with 100 mails that are all a
44+
few minutes old is still OK, while a single mail stuck for more than an hour
45+
triggers a WARN, which matches how most admins actually want to be alerted on
46+
a mail queue.
3847
3948
options:
40-
-h, --help show this help message and exit
41-
-V, --version show program's version number and exit
42-
--always-ok Always returns OK.
43-
-c, --critical CRIT CRIT threshold for the number of mails in the queue.
44-
Default: 250
45-
--test TEST For unit tests. Needs "path-to-stdout-file,path-to-
46-
stderr-file,expected-retc".
47-
-w, --warning WARN WARN threshold for the number of mails in the queue.
48-
Default: 2
49+
-h, --help show this help message and exit
50+
-V, --version show program's version number and exit
51+
--always-ok Always returns OK.
52+
-c, --critical CRIT CRIT threshold for the age of the oldest mail in the
53+
queue. Accepts a duration with a unit suffix (`Ns`,
54+
`Nm`, `Nh`, `ND`, `NW`, `NM`, `NY`, case-sensitive
55+
units). Example: `--critical=3D` to alert when the
56+
oldest mail has been in the queue for 3 days or more.
57+
Default: 3D
58+
--mta {auto,postfix,exim,sendmail}
59+
Which mail transfer agent to query. The default `auto`
60+
probes for `postqueue` (Postfix), then `exim`/`exim4`
61+
(Exim), and falls back to `mailq` (Sendmail-style)
62+
otherwise. Override this if the detection picks the
63+
wrong MTA. Default: auto
64+
--test TEST For unit tests. Needs "path-to-stdout-file,path-to-
65+
stderr-file,expected-retc".
66+
-w, --warning WARN WARN threshold for the age of the oldest mail in the
67+
queue. Accepts a duration with a unit suffix (`Ns`,
68+
`Nm`, `Nh`, `ND`, `NW`, `NM`, `NY`, case-sensitive
69+
units). Example: `--warning=1h` to alert when the
70+
oldest mail has been in the queue for an hour or more.
71+
Default: 1h
4972
```
5073

5174

0 commit comments

Comments
 (0)