|
| 1 | +--- |
| 2 | +kind: |
| 3 | + - How To |
| 4 | +products: |
| 5 | + - Alauda Container Platform |
| 6 | +ProductsVersion: |
| 7 | + - 4.1.0,4.2.x |
| 8 | +--- |
| 9 | +## Issue |
| 10 | + |
| 11 | +Troubleshooting a slow node usually starts with `sar`, `iostat`, or `vmstat` to see CPU load, memory pressure, disk throughput, and context-switch rates. On modern container-optimised node OSes these tools are not installed into the host PATH by design — the OS image is intentionally minimal. Running `sar` via SSH returns `command not found`, and operators lose a familiar first-level triage surface. |
| 12 | + |
| 13 | +## Root Cause |
| 14 | + |
| 15 | +ACP nodes typically run an immutable, minimal operating system. These distributions omit discretionary packages (including `sysstat`, which provides `sar`, `iostat`, `mpstat`) to reduce the attack surface and keep the image small. Installing the package directly on the host is either impossible (read-only root) or discouraged (the change is rolled away at the next node reconcile). |
| 16 | + |
| 17 | +The pragmatic replacement is `kubectl debug node`. It schedules a privileged ephemeral pod that shares the node's namespaces, mounts the host's root at `/host`, and runs a toolbox image of your choosing. Anything that would have been installed on the host can live in that image instead, so a node is never modified outside its declarative configuration. |
| 18 | + |
| 19 | +## Resolution |
| 20 | + |
| 21 | +Run `sysstat` inside a debug container rather than on the node. Two patterns work well: |
| 22 | + |
| 23 | +### One-Off Samples |
| 24 | + |
| 25 | +For an ad-hoc look, run a one-shot debug pod that includes the tools. Any image carrying `sysstat` works; the example below uses a public image that ships the package. |
| 26 | + |
| 27 | +```bash |
| 28 | +NODE=<node-name> |
| 29 | +IMAGE=quay.io/praqma/network-multitool:latest # includes iproute, sysstat, tcpdump |
| 30 | + |
| 31 | +kubectl debug node/$NODE -it --image=$IMAGE -- chroot /host sh -c ' |
| 32 | + sar -q 1 10; # load average |
| 33 | + sar -r 1 10; # memory |
| 34 | + sar -u 1 10; # CPU total |
| 35 | + sar -P ALL 1 10;# CPU per core |
| 36 | + sar -d 1 10; # block devices |
| 37 | + sar -w 1 10; # context switches |
| 38 | + iostat -xz 1 10 # extended I/O stats |
| 39 | +' |
| 40 | +``` |
| 41 | + |
| 42 | +If you only need a quick single reading, drop `chroot /host` — the `sysstat` tools read from `/proc` and `/sys`, which the debug pod can see through the host-namespace mount at its own `/proc` when started with the `--profile=sysadmin` flag. |
| 43 | + |
| 44 | +### Continuous Captures |
| 45 | + |
| 46 | +For flaky slowness that only shows up intermittently, dump the sampling to a file on the node and copy it out after the fact. The debug pod exits after the command; write the output to a host path that survives: |
| 47 | + |
| 48 | +```bash |
| 49 | +NODE=<node-name> |
| 50 | +kubectl debug node/$NODE -it --image=quay.io/praqma/network-multitool:latest \ |
| 51 | + -- chroot /host sh -c ' |
| 52 | + mkdir -p /var/log/perf-capture |
| 53 | + nohup sar -A -o /var/log/perf-capture/sar-$(date -u +%Y%m%dT%H%M%SZ).dat 10 360 >/dev/null 2>&1 & |
| 54 | + echo "captured PID $!; will run for 1h" |
| 55 | + ' |
| 56 | +``` |
| 57 | + |
| 58 | +Collect the artefact once the window closes: |
| 59 | + |
| 60 | +```bash |
| 61 | +kubectl debug node/$NODE -it --image=quay.io/praqma/network-multitool:latest \ |
| 62 | + -- chroot /host ls -lh /var/log/perf-capture/ |
| 63 | +``` |
| 64 | + |
| 65 | +Use `kubectl cp` from a temporary sidecar pod that mounts the same host path, or rsync from a DaemonSet that exposes `/var/log/perf-capture` — `kubectl cp` cannot read node paths directly. |
| 66 | + |
| 67 | +### Guardrails |
| 68 | + |
| 69 | +- Always bound the sampling interval: `sar X N` (where `N` is iterations). An unbounded run can pile up gigabytes of samples on a node that is already under pressure. |
| 70 | +- Pick the debug image deliberately. Any image you run gains root on the node; stick with a vetted internal registry image when possible. |
| 71 | +- Prefer Prometheus-based metrics for long-running diagnosis. `node-exporter` scrapes the same counters that `sar` reports, and keeps historical data available for post-incident analysis without needing to return to the node. |
| 72 | + |
| 73 | +## Diagnostic Steps |
| 74 | + |
| 75 | +Confirm the node supports debug pods and you have permission to start one: |
| 76 | + |
| 77 | +```bash |
| 78 | +kubectl auth can-i create pods.ephemeralcontainers --subresource=ephemeralcontainers |
| 79 | +kubectl get node <node> -o yaml | grep -A2 -E 'conditions|taints' |
| 80 | +``` |
| 81 | + |
| 82 | +Verify the debug image you picked actually carries the tools before you rely on it in an incident: |
| 83 | + |
| 84 | +```bash |
| 85 | +kubectl debug node/<node> -it --image=<image> \ |
| 86 | + -- sh -c 'command -v sar iostat mpstat vmstat && rpm -q sysstat 2>/dev/null || apk info sysstat 2>/dev/null' |
| 87 | +``` |
| 88 | + |
| 89 | +If `kubectl debug node/` is unavailable (e.g., a hardened PodSecurity admission policy blocks the host-namespace pod), use the platform's node-inspection feature under `observability/inspection` instead. That surface already has debug permissions granted and renders the same counters from a browser. |
0 commit comments