Skip to content

Commit e9e6b1c

Browse files
committed
refactor(docker-stats,podman-stats): replace per-container perfdata with aggregates
- Remove per-container cpu_usage and mem_usage perfdata (too dynamic, containers come and go) - Add aggregate perfdata: containers_running, cpu - podman-stats: switch to JSON output via podman stats --format '{{json .}}' for precise numeric values; add block_input, block_output, images, net_rx, net_tx, ram perfdata - Use CRIT (not UNKNOWN) on return codes != 0 across all four plugins - Align States section in all READMEs
1 parent e9f64f5 commit e9e6b1c

File tree

14 files changed

+164
-118
lines changed

14 files changed

+164
-118
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,8 @@ Monitoring Plugins:
132132
* by-ssh: add missing `--verbose` parameter
133133
* cpu-usage: fix false 100% readings on Windows with 64+ cores caused by all-zero CPU time samples from psutil ([#626](https://github.com/Linuxfabrik/monitoring-plugins/issues/626))
134134
* docker-stats: fix memory perfdata using CPU thresholds instead of memory thresholds
135+
* docker-stats: replace per-container perfdata with aggregate metrics (containers, cpu)
136+
* podman-stats: use `podman stats --format '{{json .}}'` for precise numeric values; aggregate perfdata includes block I/O and network I/O totals
135137
* file-age: handle `FileNotFoundError` race condition when files disappear on busy file systems
136138
* fs-ro: ignore `/run/credentials` (https://systemd.io/CREDENTIALS/)
137139
* keycloak-stats: fix incorrect symlink for lib

check-plugins/docker-info/docker-info

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ def main():
7575
lib.shell.shell_exec('docker info'),
7676
)
7777
if retc != 0:
78-
lib.base.cu(f'{stderr}\n{stdout}')
78+
lib.base.oao(f'{stderr}\n{stdout}', STATE_CRIT)
7979
if 'server version:' not in stdout.lower():
8080
lib.base.cu(
8181
'Unable to parse docker info output.'

check-plugins/docker-stats/README.md

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -83,19 +83,20 @@ myconti_ds_1 ! 0.0 ! 11.42
8383

8484
## States
8585

86-
Alerts if
86+
* CRIT on `docker info` or `docker stats` return codes != 0
87+
* WARN if any container cpu usage is above the warning cpu threshold during the last n checks (default: 5)
88+
* CRIT if any container cpu usage is above the critical cpu threshold during the last n checks (default: 5)
89+
* WARN or CRIT if any container memory usage is above the memory thresholds
8790

88-
* any container memory usage is above the memory thresholds
89-
* any container cpu usage is above the cpu thresholds during the last n checks (default: 5)
91+
CPU usage is normalized by dividing by the number of host CPUs, so 100% means all host CPUs are fully utilized. On an 8-core system, a container using one core at full capacity would show 12.5%. Memory usage is relative to the container's memory limit if one is set, otherwise relative to the total host memory.
9092

9193

9294
## Perfdata / Metrics
9395

94-
| Name | Type | Description |
95-
|------------------------------|------------|------------------------------------|
96-
| cpu | Number | Number of Host CPUs |
97-
| \<containername\>\_cpu_usage | Percentage | Container's CPU usage (normalized) |
98-
| \<containername\>\_mem_usage | Percentage | Container's memory usage (Percent) |
96+
| Name | Type | Description |
97+
|------------|--------|------------------------------|
98+
| containers_running | Number | Number of running containers |
99+
| cpu | Number | Number of Host CPUs |
99100

100101

101102
## Credits, License

check-plugins/docker-stats/docker-stats

Lines changed: 17 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ from lib.globals import (STATE_CRIT, STATE_OK,
2424
STATE_UNKNOWN, STATE_WARN)
2525

2626
__author__ = 'Linuxfabrik GmbH, Zurich/Switzerland'
27-
__version__ = '2026040802'
27+
__version__ = '2026040803'
2828

2929
DESCRIPTION = """This check prints cpu and memory statistics for all running Docker
3030
containers, using the "docker stats" command. Container CPU usage is divided
@@ -166,7 +166,7 @@ def main():
166166
lib.shell.shell_exec('docker info'),
167167
)
168168
if retc != 0:
169-
lib.base.cu(f'{stderr}\n{stdout}')
169+
lib.base.oao(f'{stderr}\n{stdout}', STATE_CRIT)
170170
if 'server version:' not in stdout.lower():
171171
lib.base.cu(
172172
'Unable to parse docker info output.'
@@ -181,7 +181,7 @@ def main():
181181
lib.shell.shell_exec('docker stats --no-stream'),
182182
)
183183
if retc != 0:
184-
lib.base.cu(stderr)
184+
lib.base.oao(stderr, STATE_CRIT)
185185
else:
186186
# do not call the command, put in test data
187187
host_cpus = 1
@@ -190,11 +190,7 @@ def main():
190190
# init some vars
191191
msg = ''
192192
state = STATE_OK
193-
perfdata = lib.base.get_perfdata(
194-
'cpu',
195-
host_cpus,
196-
_min=0,
197-
)
193+
perfdata = ''
198194
table_values = []
199195

200196
# analyze data
@@ -209,22 +205,13 @@ def main():
209205
if not args.FULL_NAME:
210206
name = shorten(name)
211207
cpu_percent = container[2]
212-
mem_usage = container[3]
213208
mem_percent = container[6]
214209
except Exception:
215210
continue
216211

217212
# divide by number of cores (got by docker info)
218213
cpu_usage = round(float(cpu_percent.replace('%', '').strip()) / host_cpus, 1)
219-
perfdata += lib.base.get_perfdata(
220-
f'{name}_cpu_usage',
221-
cpu_usage,
222-
uom='%',
223-
warn=args.WARN_CPU,
224-
crit=args.CRIT_CPU,
225-
_min=0,
226-
_max=100,
227-
)
214+
mem_usage = round(float(mem_percent.replace('%', '').strip()), 1)
228215

229216
# save trend data to local sqlite database, limited to "count" rows max.
230217
lib.base.coe(
@@ -246,16 +233,6 @@ def main():
246233
state = lib.base.get_worst(cpu_state, state)
247234

248235
# alert when container mem_usage is exceeded
249-
mem_usage = float(mem_percent.replace('%', '').strip())
250-
perfdata += lib.base.get_perfdata(
251-
f'{name}_mem_usage',
252-
mem_usage,
253-
uom='%',
254-
warn=args.WARN_MEM,
255-
crit=args.CRIT_MEM,
256-
_min=0,
257-
_max=100,
258-
)
259236
mem_state = lib.base.get_state(mem_usage, args.WARN_MEM, args.CRIT_MEM)
260237
if mem_state != STATE_OK:
261238
msg += f'"{name}" memory {mem_usage}% {lib.base.state2str(mem_state)}, '
@@ -271,6 +248,18 @@ def main():
271248
lib.db_sqlite.commit(conn)
272249
lib.db_sqlite.close(conn)
273250

251+
# build perfdata
252+
perfdata += lib.base.get_perfdata(
253+
'containers_running',
254+
len(table_values),
255+
_min=0,
256+
)
257+
perfdata += lib.base.get_perfdata(
258+
'cpu',
259+
host_cpus,
260+
_min=0,
261+
)
262+
274263
# create output
275264
if state == STATE_OK:
276265
msg = f'Everything is ok, {len(table_values)} containers checked.\n\n'

check-plugins/docker-stats/unit-test/run

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,9 @@ class TestCheck(unittest.TestCase):
5757
self.assertIn('Everything is ok,', stdout)
5858
self.assertIn('Container ! CPU % ! Mem %', stdout)
5959
self.assertIn('--------------+-------+------', stdout)
60-
self.assertIn('elasticsearch ! 188.8 ! 16.73', stdout)
61-
self.assertIn('graylog ! 204.2 ! 5.69', stdout)
62-
self.assertIn('mongo ! 0.3 ! 1.95', stdout)
60+
self.assertIn('elasticsearch ! 188.8 ! 16.7', stdout)
61+
self.assertIn('graylog ! 204.2 ! 5.7', stdout)
62+
self.assertIn('mongo ! 0.3 ! 1.9', stdout)
6363
self.assertEqual(stderr, '')
6464
self.assertEqual(retc, STATE_OK)
6565

@@ -68,9 +68,9 @@ class TestCheck(unittest.TestCase):
6868
self.assertIn('Everything is ok,', stdout)
6969
self.assertIn('Container ! CPU % ! Mem %', stdout)
7070
self.assertIn('--------------------------------------------------------------------+-------+------', stdout)
71-
self.assertIn('runner-7ayh6h5f-project-107-concurrent-0-37b2c7aee9359db9-build ! 95.0 ! 1.22 ', stdout)
72-
self.assertIn('runner-7ayh6h5f-project-19-concurrent-0-99f0211c36d59d01-build ! 59.5 ! 0.99 ', stdout)
73-
self.assertIn('runner-7ayh6h5f-project-49-concurrent-0-e180afe41fc754dc-predefined ! 79.5 ! 0.15', stdout)
71+
self.assertIn('runner-7ayh6h5f-project-107-concurrent-0-37b2c7aee9359db9-build ! 95.0 ! 1.2', stdout)
72+
self.assertIn('runner-7ayh6h5f-project-19-concurrent-0-99f0211c36d59d01-build ! 59.5 ! 1.0', stdout)
73+
self.assertIn('runner-7ayh6h5f-project-49-concurrent-0-e180afe41fc754dc-predefined ! 79.5 ! 0.1', stdout)
7474
self.assertEqual(stderr, '')
7575
self.assertEqual(retc, STATE_OK)
7676

check-plugins/podman-info/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,8 @@ Output:
5151

5252
## States
5353

54+
* WARN on `podman info` warnings
55+
* CRIT on `podman info` errors
5456
* CRIT on `podman info` return codes != 0
5557

5658

check-plugins/podman-info/podman-info

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ def main():
7777
lib.shell.shell_exec('podman info --format json'),
7878
)
7979
if retc != 0:
80-
lib.base.cu(f'{stderr}\n{stdout}')
80+
lib.base.oao(f'{stderr}\n{stdout}', STATE_CRIT)
8181
try:
8282
result = json.loads(stdout)
8383
except Exception:

check-plugins/podman-stats/README.md

Lines changed: 15 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -84,19 +84,26 @@ myconti_ds_1 ! 0.0 ! 11.42
8484

8585
## States
8686

87-
Alerts if
87+
* CRIT on `podman info` or `podman stats` return codes != 0
88+
* WARN if any container cpu usage is above the warning cpu threshold during the last n checks (default: 5)
89+
* CRIT if any container cpu usage is above the critical cpu threshold during the last n checks (default: 5)
90+
* WARN or CRIT if any container memory usage is above the memory thresholds
8891

89-
* any container memory usage is above the memory thresholds
90-
* any container cpu usage is above the cpu thresholds during the last n checks (default: 5)
92+
CPU usage is normalized by dividing by the number of host CPUs, so 100% means all host CPUs are fully utilized. On an 8-core system, a container using one core at full capacity would show 12.5%. Memory usage is relative to the container's memory limit if one is set, otherwise relative to the total host memory.
9193

9294

9395
## Perfdata / Metrics
9496

95-
| Name | Type | Description |
96-
|------------------------------|------------|------------------------------------|
97-
| cpu | Number | Number of Host CPUs |
98-
| \<containername\>\_cpu_usage | Percentage | Container's CPU usage (normalized) |
99-
| \<containername\>\_mem_usage | Percentage | Container's memory usage (Percent) |
97+
| Name | Type | Description |
98+
|--------------------|--------|----------------------------------------------------------|
99+
| block_input | Bytes | Total data read from block device across all containers |
100+
| block_output | Bytes | Total data written to block device across all containers |
101+
| containers_running | Number | Number of running containers |
102+
| cpu | Number | Number of Host CPUs |
103+
| images | Number | Number of images |
104+
| net_rx | Bytes | Total network bytes received across all containers |
105+
| net_tx | Bytes | Total network bytes transmitted across all containers |
106+
| ram | Bytes | Total Host Memory |
100107

101108

102109
## Credits, License

0 commit comments

Comments
 (0)