Skip to content

Commit fa64795

Browse files
committed
feat(procs): add CPU usage monitoring, regex filters and top process table
Move --top from cpu-usage to procs with a detailed table showing CPU user/system/total time and process status per process name. Add --warning-cpu-percent/--critical-cpu-percent thresholds using SQLite delta calculation. Convert --command, --argument and --username filters to regular expressions. Show system uptime in the summary line for CPU time context.
1 parent 9773add commit fa64795

File tree

8 files changed

+599
-317
lines changed

8 files changed

+599
-317
lines changed

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,10 @@ Build, CI/CD:
1919

2020
* Remove `flatdict` dependency as it breaks builds on newer targets (RHEL 10, SLE 15/16) and no fix is available that also supports older targets (such as RHEL 8). The statuspal plugin has been rewritten to no longer depend on `flatdict` ([#1044](https://github.com/Linuxfabrik/monitoring-plugins/issues/1044)).
2121

22+
Monitoring Plugins:
23+
24+
* procs: `--argument`, `--command` and `--username` now use regular expressions instead of substring/startswith matching. Existing filters like `--command=httpd` still work but now match anywhere in the name. Use `--command='^httpd'` for the previous startswith behavior, or `--username='^apache$'` for exact matches.
25+
2226

2327
### Added
2428

@@ -35,6 +39,8 @@ Monitoring Plugins:
3539
* infomaniak-swiss-backup-devices: add `--ignore-customer`, `--ignore-name`, `--ignore-tag`, `--ignore-user` parameters to skip devices by regex
3640
* infomaniak-swiss-backup-products: add `--ignore-customer`, `--ignore-tag` parameters to skip products by regex
3741
* nextcloud-enterprise: provides information about an installed Nextcloud Enterprise subscription
42+
* procs: add `--top` parameter to list the top N processes by CPU time (user/system/total) with status, excluding sleeping processes by default
43+
* procs: add `--warning-cpu-percent` / `--critical-cpu-percent` thresholds for aggregated CPU usage of filtered processes (requires SQLite for delta calculation between runs)
3844
* statuspal: also detect 'emergency-maintenance' state
3945
* valkey-status: support user and password credentials [PR #954](https://github.com/Linuxfabrik/monitoring-plugins/pull/954), thanks to [Claudio Kuenzler](https://github.com/Napsty)
4046

@@ -65,6 +71,7 @@ Monitoring Plugins:
6571

6672
* all plugins: ignore unknown arguments instead of generating an error (this helps with updating Icinga and Nagios service definitions considerably)
6773
* by-ssh, by-winrm, disk-usage, example, file-ownership, fs-ro, infomaniak-events, journald-query, logfile, matomo-reporting, mysql-logfile, php-status, pip-updates, systemd-unit: fix `append` parameters so that user-specified values replace defaults instead of being appended to them ([#540](https://github.com/Linuxfabrik/monitoring-plugins/issues/540))
74+
* cpu-usage: remove `--top` parameter (the top N processes by CPU time are now reported by the procs check via `--top`)
6875
* disk-io: also monitor normalized iowait on Linux (100% = one fully I/O-saturated core)
6976
* file-count: stopping when number of files actually exceed thresholds, therefore dramatically faster for large directories
7077
* file-ownership: `--filename` now merges with the default file list instead of replacing it; use `--no-default-files` to check only user-supplied files
@@ -80,6 +87,11 @@ Monitoring Plugins:
8087

8188
### Removed
8289

90+
Monitoring Plugins:
91+
92+
* cpu-usage: remove `--top` parameter (the top N processes by CPU time are now reported by the procs check via `--top`)
93+
94+
8395
Tools:
8496

8597
* remove legacy `grafana-tool`

check-plugins/cpu-usage/README.md

Lines changed: 5 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,6 @@ Monitors system-wide CPU utilization with sustained load detection to avoid fals
1616
* System-wide aggregate CPU statistics (not per-core)
1717
* Non-blocking measurement using SQLite state persistence between runs
1818
* Platform-specific extended metrics where available (context switches, interrupts, soft interrupts)
19-
* Optional top-N CPU-consuming processes (`--top`, default: 5)
20-
2119
**Compatibility:**
2220

2321
* Cross-platform: Linux, Windows, and all psutil-supported systems
@@ -40,8 +38,7 @@ Monitors system-wide CPU utilization with sustained load detection to avoid fals
4038
## Help
4139

4240
```text
43-
usage: cpu-usage [-h] [-V] [--always-ok] [--count COUNT] [-c CRIT] [--top TOP]
44-
[-w WARN]
41+
usage: cpu-usage [-h] [-V] [--always-ok] [--count COUNT] [-c CRIT] [-w WARN]
4542
4643
Reports CPU utilization percentages for all available time categories (user,
4744
system, idle, nice, iowait, irq, softirq, steal, guest, guest_nice) plus the
@@ -50,10 +47,9 @@ against user, system, iowait, and cpu-usage. An alert is raised only if the
5047
threshold is exceeded for COUNT consecutive runs, suppressing short spikes and
5148
focusing on sustained load. Perfdata is emitted for every field to enable full
5249
graphing. Extended stats (context switches, interrupts, etc.) are included if
53-
supported on this platform. With `--top`, the most CPU-intensive processes are
54-
also listed for quick diagnosis. This check is cross-platform and works on
55-
Linux, Windows, and all psutil-supported systems. The check stores its short
56-
trend state locally in an SQLite DB to evaluate sustained load across runs.
50+
supported on this platform. This check is cross-platform and works on Linux,
51+
Windows, and all psutil-supported systems. The check stores its short trend
52+
state locally in an SQLite DB to evaluate sustained load across runs.
5753
5854
options:
5955
-h, --help show this help message and exit
@@ -63,8 +59,6 @@ options:
6359
thresholds before alerting. Default: 5
6460
-c, --critical CRIT Set the critical threshold CPU Usage Percentage.
6561
Default: 90
66-
--top TOP List x "Top processes using the most cpu time". Use
67-
`--top=0` to disable this feature. Default: 5
6862
-w, --warning WARN Set the warning threshold CPU Usage Percentage.
6963
Default: 80
7064
```
@@ -73,7 +67,7 @@ options:
7367
## Usage Examples
7468

7569
```bash
76-
./cpu-usage --count=15 --warning=50 --critical=70 --top=3
70+
./cpu-usage --count=15 --warning=50 --critical=70
7771
```
7872

7973
Output:
@@ -82,11 +76,6 @@ Output:
8276
2.6% - user: 1.6%, system: 0.7%, irq: 0.2%, softirq: 0.1%
8377
guest: 0.0%, iowait: 0.0%, guest_nice: 0.0%, steal: 0.0%, nice: 0.0%
8478
interrupts: 582.9M, soft_interrupts: 343.6M, ctx_switches: 1.1G
85-
86-
Top 3 processes using the most cpu time:
87-
1. Xorg: 2h 13m
88-
2. gnome-shell: 2h 1m
89-
3. firefox: 1h 24m
9079
```
9180

9281

check-plugins/cpu-usage/cpu-usage

Lines changed: 2 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@
1313

1414
import argparse
1515
import sys
16-
from collections import Counter
1716
from types import SimpleNamespace
1817

1918
import lib.base
@@ -30,7 +29,7 @@ except ImportError:
3029

3130

3231
__author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
33-
__version__ = '2025100601'
32+
__version__ = '2026040701'
3433

3534
DESCRIPTION = """Reports CPU utilization percentages for all available time categories
3635
(user, system, idle, nice, iowait, irq, softirq, steal, guest, guest_nice) plus the overall
@@ -41,8 +40,7 @@ only if the threshold is exceeded for COUNT consecutive runs, suppressing short
4140
on sustained load.
4241
4342
Perfdata is emitted for every field to enable full graphing. Extended stats (context switches,
44-
interrupts, etc.) are included if supported on this platform. With `--top`, the most CPU-intensive
45-
processes are also listed for quick diagnosis.
43+
interrupts, etc.) are included if supported on this platform.
4644
4745
This check is cross-platform and works on Linux, Windows, and all psutil-supported systems.
4846
The check stores its short trend state locally in an SQLite DB to evaluate sustained load across
@@ -51,7 +49,6 @@ runs."""
5149

5250
DEFAULT_COUNT = 5 # measurements; if check runs once per minute, this is a 5 minute interval
5351
DEFAULT_CRIT = 90 # %
54-
DEFAULT_TOP = 5
5552
DEFAULT_WARN = 80 # %
5653

5754

@@ -92,16 +89,6 @@ def parse_args():
9289
default=DEFAULT_CRIT,
9390
)
9491

95-
parser.add_argument(
96-
'--top',
97-
help='List x "Top processes using the most cpu time". '
98-
'Use `--top=0` to disable this feature. '
99-
'Default: %(default)s',
100-
dest='TOP',
101-
type=int,
102-
default=DEFAULT_TOP,
103-
)
104-
10592
parser.add_argument(
10693
'-w', '--warning',
10794
help='Set the warning threshold CPU Usage Percentage. '
@@ -238,60 +225,6 @@ def get_from_db(conn, threshold):
238225
return int(result['cnt'])
239226

240227

241-
def top(count):
242-
"""Get top X processes using the most cpu time.
243-
"""
244-
# Fast path: nothing to print, so nothing to scan
245-
if count <= 0:
246-
return ''
247-
248-
cnt = Counter()
249-
msg = f'\n\nTop {count} processes using the most cpu time:\n'
250-
251-
# Prefer attrs path (psutil >= 5.3.0): fewer syscalls, fewer exceptions
252-
if lib.version.version(psutil.__version__) >= lib.version.version('5.3.0'):
253-
try:
254-
# name + cpu_times are all we need
255-
# use ad_value to avoid AccessDenied; ad_value=None keeps types intact
256-
for p in psutil.process_iter(attrs=['name', 'cpu_times'], ad_value=None):
257-
try:
258-
name = p.info.get('name') or ''
259-
if lib.base.WINDOWS and name == 'System Idle Process':
260-
# yes, the System Idle Process on Windows consumes CPU time
261-
continue
262-
cput = p.info.get('cpu_times')
263-
if cput:
264-
# user + system
265-
cnt[name] += (cput.user or 0) + (getattr(cput, 'system', 0) or 0)
266-
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
267-
# process vanished or denied: skip
268-
continue
269-
except Exception:
270-
# Defensive: if psutil attrs path misbehaves on some platform/version, fall back below.
271-
pass
272-
273-
# Legacy / fallback path
274-
if not cnt:
275-
try:
276-
for proc in psutil.process_iter():
277-
try:
278-
p = proc.as_dict(attrs=['name', 'cpu_times'])
279-
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
280-
continue
281-
name = p.get('name') or ''
282-
if lib.base.WINDOWS and name == 'System Idle Process':
283-
continue
284-
cput = p.get('cpu_times')
285-
if cput:
286-
cnt[name] += sum(cput[:2])
287-
except psutil.NoSuchProcess:
288-
pass
289-
290-
for i, (name, seconds) in enumerate(cnt.most_common(count), start=1):
291-
msg += f'{i}. {name}: {lib.human.seconds2human(seconds)}\n'
292-
return msg
293-
294-
295228
def main():
296229
"""The main function. Hier spielt die Musik.
297230
"""
@@ -443,9 +376,6 @@ def main():
443376
perfdata += lib.base.get_perfdata(key, val, 'c', None, None, 0, None)
444377
msg += '\n' + ', '.join(ext_parts)
445378

446-
# Top X processes using the most cpu time
447-
msg += top(args.TOP)
448-
449379
# over and out
450380
lib.base.oao(msg, state, perfdata, always_ok=args.ALWAYS_OK)
451381

check-plugins/cpu-usage/icingaweb2-module-director/cpu-usage.json

Lines changed: 10 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,6 @@
1111
"--critical": {
1212
"value": "$cpu_usage_critical$"
1313
},
14-
"--top": {
15-
"value": "$cpu_usage_top$"
16-
},
1714
"--warning": {
1815
"value": "$cpu_usage_warning$"
1916
}
@@ -40,11 +37,6 @@
4037
"datafield_id": 4,
4138
"is_required": "n",
4239
"var_filter": null
43-
},
44-
{
45-
"datafield_id": 5,
46-
"is_required": "n",
47-
"var_filter": null
4840
}
4941
],
5042
"imports": [],
@@ -68,9 +60,6 @@
6860
"--critical": {
6961
"value": "$cpu_usage_windows_critical$"
7062
},
71-
"--top": {
72-
"value": "$cpu_usage_windows_top$"
73-
},
7463
"--warning": {
7564
"value": "$cpu_usage_windows_warning$"
7665
}
@@ -79,27 +68,22 @@
7968
"disabled": false,
8069
"fields": [
8170
{
82-
"datafield_id": 6,
83-
"is_required": "n",
84-
"var_filter": null
85-
},
86-
{
87-
"datafield_id": 7,
71+
"datafield_id": 5,
8872
"is_required": "n",
8973
"var_filter": null
9074
},
9175
{
92-
"datafield_id": 8,
76+
"datafield_id": 6,
9377
"is_required": "n",
9478
"var_filter": null
9579
},
9680
{
97-
"datafield_id": 9,
81+
"datafield_id": 7,
9882
"is_required": "n",
9983
"var_filter": null
10084
},
10185
{
102-
"datafield_id": 10,
86+
"datafield_id": 8,
10387
"is_required": "n",
10488
"var_filter": null
10589
}
@@ -162,7 +146,7 @@
162146
"tpl-service-generic"
163147
],
164148
"max_check_attempts": 5,
165-
"notes": "Reports CPU utilization percentages for all available time categories (user, system, idle, nice, iowait, irq, softirq, steal, guest, guest_nice) plus the overall cpu-usage (100 \u2212 idle \u2212 nice). Thresholds (WARN/CRIT) are checked against user, system, iowait, and cpu-usage. An alert is raised only if the threshold is exceeded for COUNT consecutive runs, suppressing short spikes and focusing on sustained load. Perfdata is emitted for every field to enable full graphing. Extended stats (context switches, interrupts, etc.) are included if supported on this platform. With `--top`, the most CPU-intensive processes are also listed for quick diagnosis. This check is cross-platform and works on Linux, Windows, and all psutil-supported systems. The check stores its short trend state locally in an SQLite DB to evaluate sustained load across runs.",
149+
"notes": "Reports CPU utilization percentages for all available time categories (user, system, idle, nice, iowait, irq, softirq, steal, guest, guest_nice) plus the overall cpu-usage (100 \u2212 idle \u2212 nice). Thresholds (WARN/CRIT) are checked against user, system, iowait, and cpu-usage. An alert is raised only if the threshold is exceeded for COUNT consecutive runs, suppressing short spikes and focusing on sustained load. Perfdata is emitted for every field to enable full graphing. Extended stats (context switches, interrupts, etc.) are included if supported on this platform. This check is cross-platform and works on Linux, Windows, and all psutil-supported systems. The check stores its short trend state locally in an SQLite DB to evaluate sustained load across runs.",
166150
"notes_url": "https://github.com/Linuxfabrik/monitoring-plugins/tree/main/check-plugins/cpu-usage",
167151
"object_name": "tpl-service-cpu-usage",
168152
"object_type": "template",
@@ -176,7 +160,6 @@
176160
"cpu_usage_always_ok": false,
177161
"cpu_usage_count": 5,
178162
"cpu_usage_critical": 90,
179-
"cpu_usage_top": 5,
180163
"cpu_usage_warning": 80
181164
},
182165
"volatile": null,
@@ -212,7 +195,7 @@
212195
"tpl-service-generic"
213196
],
214197
"max_check_attempts": 5,
215-
"notes": "Reports CPU utilization percentages for all available time categories (user, system, idle, nice, iowait, irq, softirq, steal, guest, guest_nice) plus the overall cpu-usage (100 \u2212 idle \u2212 nice). Thresholds (WARN/CRIT) are checked against user, system, iowait, and cpu-usage. An alert is raised only if the threshold is exceeded for COUNT consecutive runs, suppressing short spikes and focusing on sustained load. Perfdata is emitted for every field to enable full graphing. Extended stats (context switches, interrupts, etc.) are included if supported on this platform. With `--top`, the most CPU-intensive processes are also listed for quick diagnosis. This check is cross-platform and works on Linux, Windows, and all psutil-supported systems. The check stores its short trend state locally in an SQLite DB to evaluate sustained load across runs.",
198+
"notes": "Reports CPU utilization percentages for all available time categories (user, system, idle, nice, iowait, irq, softirq, steal, guest, guest_nice) plus the overall cpu-usage (100 \u2212 idle \u2212 nice). Thresholds (WARN/CRIT) are checked against user, system, iowait, and cpu-usage. An alert is raised only if the threshold is exceeded for COUNT consecutive runs, suppressing short spikes and focusing on sustained load. Perfdata is emitted for every field to enable full graphing. Extended stats (context switches, interrupts, etc.) are included if supported on this platform. This check is cross-platform and works on Linux, Windows, and all psutil-supported systems. The check stores its short trend state locally in an SQLite DB to evaluate sustained load across runs.",
216199
"notes_url": "https://github.com/Linuxfabrik/monitoring-plugins/tree/main/check-plugins/cpu-usage",
217200
"object_name": "tpl-service-cpu-usage-windows",
218201
"object_type": "template",
@@ -226,7 +209,6 @@
226209
"cpu_usage_windows_always_ok": false,
227210
"cpu_usage_windows_count": 5,
228211
"cpu_usage_windows_critical": 90,
229-
"cpu_usage_windows_top": 5,
230212
"cpu_usage_windows_warning": 80
231213
},
232214
"volatile": null,
@@ -267,17 +249,6 @@
267249
"uuid": "a9c69583-5e98-4ae8-bec0-a7f4826dceb1"
268250
},
269251
"4": {
270-
"varname": "cpu_usage_top",
271-
"caption": "CPU Usage: Top",
272-
"description": "List x \"Top processes using the most cpu time\". Use `--top=0` to disable this feature.",
273-
"datatype": "Icinga\\Module\\Director\\DataType\\DataTypeString",
274-
"format": null,
275-
"settings": {
276-
"visibility": "visible"
277-
},
278-
"uuid": "217b99f6-47ba-45ea-9dfd-43fe139fe2d1"
279-
},
280-
"5": {
281252
"varname": "cpu_usage_warning",
282253
"caption": "CPU Usage: Warning",
283254
"description": "Set the warning threshold CPU Usage Percentage.",
@@ -288,7 +259,7 @@
288259
},
289260
"uuid": "d59b27e4-44f7-4fa6-aaf4-36e4885af075"
290261
},
291-
"6": {
262+
"5": {
292263
"varname": "cpu_usage_windows_always_ok",
293264
"caption": "CPU Usage: Always OK?",
294265
"description": "Always returns OK.",
@@ -297,7 +268,7 @@
297268
"settings": {},
298269
"uuid": "5a7722fb-e5ab-48a2-bcc9-8dc3036e92de"
299270
},
300-
"7": {
271+
"6": {
301272
"varname": "cpu_usage_windows_count",
302273
"caption": "CPU Usage: Count",
303274
"description": "Number of times the value must exceed specified thresholds before alerting.",
@@ -308,7 +279,7 @@
308279
},
309280
"uuid": "f3f24630-5a4a-4461-accc-0c4399ed9d8a"
310281
},
311-
"8": {
282+
"7": {
312283
"varname": "cpu_usage_windows_critical",
313284
"caption": "CPU Usage: Critical",
314285
"description": "Set the critical threshold CPU Usage Percentage.",
@@ -319,18 +290,7 @@
319290
},
320291
"uuid": "368a0e17-0456-4024-8b4b-412a6367d387"
321292
},
322-
"9": {
323-
"varname": "cpu_usage_windows_top",
324-
"caption": "CPU Usage: Top",
325-
"description": "List x \"Top processes using the most cpu time\". Use `--top=0` to disable this feature.",
326-
"datatype": "Icinga\\Module\\Director\\DataType\\DataTypeString",
327-
"format": null,
328-
"settings": {
329-
"visibility": "visible"
330-
},
331-
"uuid": "838e00dc-a791-4354-ae03-56794931f211"
332-
},
333-
"10": {
293+
"8": {
334294
"varname": "cpu_usage_windows_warning",
335295
"caption": "CPU Usage: Warning",
336296
"description": "Set the warning threshold CPU Usage Percentage.",

0 commit comments

Comments
 (0)