Skip to content

Commit 685facd

Browse files
committed
[configure] Collecting Performance Metrics With sar and iostat on Minimal-OS Nodes
1 parent c8b7b50 commit 685facd

1 file changed

Lines changed: 89 additions & 0 deletions

File tree

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
---
2+
kind:
3+
- How To
4+
products:
5+
- Alauda Container Platform
6+
ProductsVersion:
7+
- 4.1.0,4.2.x
8+
---
9+
## Issue
10+
11+
Troubleshooting a slow node usually starts with `sar`, `iostat`, or `vmstat` to see CPU load, memory pressure, disk throughput, and context-switch rates. On modern container-optimised node OSes these tools are not installed into the host PATH by design — the OS image is intentionally minimal. Running `sar` via SSH returns `command not found`, and operators lose a familiar first-level triage surface.
12+
13+
## Root Cause
14+
15+
ACP nodes typically run an immutable, minimal operating system. These distributions omit discretionary packages (including `sysstat`, which provides `sar`, `iostat`, `mpstat`) to reduce the attack surface and keep the image small. Installing the package directly on the host is either impossible (read-only root) or discouraged (the change is rolled away at the next node reconcile).
16+
17+
The pragmatic replacement is `kubectl debug node`. It schedules a privileged ephemeral pod that shares the node's namespaces, mounts the host's root at `/host`, and runs a toolbox image of your choosing. Anything that would have been installed on the host can live in that image instead, so a node is never modified outside its declarative configuration.
18+
19+
## Resolution
20+
21+
Run `sysstat` inside a debug container rather than on the node. Two patterns work well:
22+
23+
### One-Off Samples
24+
25+
For an ad-hoc look, run a one-shot debug pod that includes the tools. Any image carrying `sysstat` works; the example below uses a public image that ships the package.
26+
27+
```bash
28+
NODE=<node-name>
29+
IMAGE=quay.io/praqma/network-multitool:latest # includes iproute, sysstat, tcpdump
30+
31+
kubectl debug node/$NODE -it --image=$IMAGE -- chroot /host sh -c '
32+
sar -q 1 10; # load average
33+
sar -r 1 10; # memory
34+
sar -u 1 10; # CPU total
35+
sar -P ALL 1 10;# CPU per core
36+
sar -d 1 10; # block devices
37+
sar -w 1 10; # context switches
38+
iostat -xz 1 10 # extended I/O stats
39+
'
40+
```
41+
42+
If you only need a quick single reading, drop `chroot /host` — the `sysstat` tools read from `/proc` and `/sys`, which the debug pod can see through the host-namespace mount at its own `/proc` when started with the `--profile=sysadmin` flag.
43+
44+
### Continuous Captures
45+
46+
For flaky slowness that only shows up intermittently, dump the sampling to a file on the node and copy it out after the fact. The debug pod exits after the command; write the output to a host path that survives:
47+
48+
```bash
49+
NODE=<node-name>
50+
kubectl debug node/$NODE -it --image=quay.io/praqma/network-multitool:latest \
51+
-- chroot /host sh -c '
52+
mkdir -p /var/log/perf-capture
53+
nohup sar -A -o /var/log/perf-capture/sar-$(date -u +%Y%m%dT%H%M%SZ).dat 10 360 >/dev/null 2>&1 &
54+
echo "captured PID $!; will run for 1h"
55+
'
56+
```
57+
58+
Collect the artefact once the window closes:
59+
60+
```bash
61+
kubectl debug node/$NODE -it --image=quay.io/praqma/network-multitool:latest \
62+
-- chroot /host ls -lh /var/log/perf-capture/
63+
```
64+
65+
Use `kubectl cp` from a temporary sidecar pod that mounts the same host path, or rsync from a DaemonSet that exposes `/var/log/perf-capture``kubectl cp` cannot read node paths directly.
66+
67+
### Guardrails
68+
69+
- Always bound the sampling interval: `sar X N` (where `N` is iterations). An unbounded run can pile up gigabytes of samples on a node that is already under pressure.
70+
- Pick the debug image deliberately. Any image you run gains root on the node; stick with a vetted internal registry image when possible.
71+
- Prefer Prometheus-based metrics for long-running diagnosis. `node-exporter` scrapes the same counters that `sar` reports, and keeps historical data available for post-incident analysis without needing to return to the node.
72+
73+
## Diagnostic Steps
74+
75+
Confirm the node supports debug pods and you have permission to start one:
76+
77+
```bash
78+
kubectl auth can-i create pods.ephemeralcontainers --subresource=ephemeralcontainers
79+
kubectl get node <node> -o yaml | grep -A2 -E 'conditions|taints'
80+
```
81+
82+
Verify the debug image you picked actually carries the tools before you rely on it in an incident:
83+
84+
```bash
85+
kubectl debug node/<node> -it --image=<image> \
86+
-- sh -c 'command -v sar iostat mpstat vmstat && rpm -q sysstat 2>/dev/null || apk info sysstat 2>/dev/null'
87+
```
88+
89+
If `kubectl debug node/` is unavailable (e.g., a hardened PodSecurity admission policy blocks the host-namespace pod), use the platform's node-inspection feature under `observability/inspection` instead. That surface already has debug permissions granted and renders the same counters from a browser.

0 commit comments

Comments
 (0)