Skip to content

Commit fc91411

Browse files
committed
azure-diagnostics: Add Inspektor Gadget Reference
Signed-off-by: Qasim Sarfraz <qasimsarfraz@microsoft.com>
1 parent 946fa11 commit fc91411

6 files changed

Lines changed: 222 additions & 2 deletions

File tree

plugin/skills/azure-diagnostics/aks-troubleshooting/aks-troubleshooting.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ Primary AKS troubleshooting guide for incidents routed from [../SKILL.md](../SKI
2020

2121
When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
2222

23+
When standard diagnostics do not reveal root cause, use **Inspektor Gadget** for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See [references/inspektor-gadget.md](references/inspektor-gadget.md) for the gadget catalog, command patterns, and symptom-to-gadget mapping.
24+
2325
See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
2426

2527
## Required Inputs
@@ -46,7 +48,8 @@ If cluster identity is missing, stop and ask for it.
4648

4749
1. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
4850
2. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
49-
3. Use detector, warning-event, or metrics modes when the incoming data already matches them.
51+
3. Deep diagnostics third: when steps 1–2 do not reveal root cause, use Inspektor Gadget for real-time tracing and snapshots on the affected node.
52+
4. Use detector, warning-event, or metrics modes when the incoming data already matches them.
5053

5154
## Workflow
5255

plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,32 @@ kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
3030

3131
Pods that are running but not Ready are removed from Endpoints. Check `kubectl get pod <pod> -n <ns>`.
3232

33+
**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
34+
35+
```bash
36+
NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
37+
38+
# Check what ports the pod is actually listening on
39+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
40+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
41+
-- ig run snapshot_socket:v0.51.0 -o json --timeout 5 --k8s-namespace <ns> --k8s-podname <pod-name>
42+
43+
# Trace TCP connections in real-time to see connect/accept/close events
44+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
45+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
46+
-- ig run trace_tcp:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
47+
# Trace TCP retransmissions and packet drops for deeper network issues
48+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
49+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
50+
-- ig run trace_tcpdrop:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
51+
52+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
53+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
54+
-- ig run trace_tcpretrans:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
55+
```
56+
57+
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
58+
3359
---
3460

3561
## DNS Resolution Failures
@@ -69,6 +95,20 @@ kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
6995

7096
Custom VNet DNS must forward `.cluster.local` to the CoreDNS ClusterIP and other lookups to `168.63.129.16`.
7197

98+
**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
99+
100+
```bash
101+
# Resolve node for the affected pod
102+
NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
103+
104+
# Trace live DNS queries and responses — look for rcode != 0
105+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
106+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
107+
-- ig run trace_dns:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
108+
```
109+
110+
Key signals: `rcode=3` (NXDOMAIN), `rcode=2` (SERVFAIL), high `latency` values, queries going to unexpected destinations. See [references/inspektor-gadget.md](references/inspektor-gadget.md).
111+
72112
---
73113

74114
## Load Balancer Stuck in Pending

plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ kubectl describe node <node-name>
1717
| `Ready` | `False` | kubelet stopped reporting | SSH to node; if unrecoverable, consider cordon/drain/delete\* |
1818
| `MemoryPressure` | `True` | Node running out of memory | Evict pods; scale out pool; reduce pod density |
1919
| `DiskPressure` | `True` | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk |
20-
| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes |
20+
| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes; use IG `snapshot_process` |
2121
| `NetworkUnavailable` | `True` | CNI plugin issue | Check CNI pods in kube-system; node network config |
2222

2323
\*Only after explicit user request for remediation and confirmation of workload impact.
@@ -103,6 +103,17 @@ kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.
103103

104104
Common culprit: high-volume container logs accumulating in `/var/log/containers`.
105105

106+
**Deep diagnostics with Inspektor Gadget** (PID pressure or unknown process load):
107+
108+
```bash
109+
# List all processes on the node to find the offender
110+
kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
111+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
112+
-- ig run snapshot_process:v0.51.0 -o json --timeout 5
113+
```
114+
115+
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
116+
106117
---
107118

108119
## Node Image / OS Upgrade Issues

plugin/skills/azure-diagnostics/aks-troubleshooting/pod-failures.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,31 @@ kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"
5252

5353
Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.
5454

55+
For real-time OOM kill tracing, use Inspektor Gadget `trace_oomkill` — see [references/inspektor-gadget.md](references/inspektor-gadget.md).
56+
57+
**Deep diagnostics with Inspektor Gadget** (when logs and describe are inconclusive):
58+
59+
```bash
60+
NODE=$(kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}')
61+
62+
# See what the container actually executes at startup (catch bad entrypoints, unexpected child processes)
63+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
64+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
65+
-- ig run trace_exec:v0.51.0 -o json --timeout 30 --k8s-namespace <namespace> --k8s-podname <pod-name>
66+
67+
# Trace file opens to find missing configs, secrets, or permission errors (retval -2 = ENOENT, -13 = EACCES)
68+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
69+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
70+
-- ig run trace_open:v0.51.0 -o json --timeout 30 --k8s-namespace <namespace> --k8s-podname <pod-name>
71+
72+
# List running processes in the pod
73+
kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
74+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
75+
-- ig run snapshot_process:v0.51.0 -o json --timeout 5 --k8s-namespace <namespace> --k8s-podname <pod-name>
76+
```
77+
78+
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
79+
5580
---
5681

5782
## ImagePullBackOff

plugin/skills/azure-diagnostics/aks-troubleshooting/references/command-flows.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,14 @@ kubectl get pvc -n <namespace>
7676
kubectl describe quota -n <namespace>
7777
```
7878

79+
## Deep Diagnostics Flow (Inspektor Gadget)
80+
81+
```text
82+
Standard diagnostics inconclusive -> resolve target node -> select gadget from symptom-to-gadget map -> run IG command with namespace/pod filters -> interpret output -> correlate with prior evidence
83+
```
84+
85+
Use when steps 1–2 of the evidence order (Azure-side and Kubernetes-side) do not reveal root cause. See [inspektor-gadget.md](inspektor-gadget.md) for the full gadget catalog and command patterns.
86+
7987
## Safety Boundary
8088

8189
Treat the following as change operations and avoid them unless the user explicitly asks for remediation:
Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
# Inspektor Gadget (IG) Reference
2+
3+
Use Inspektor Gadget for real-time, low-level node/pod diagnostics when standard `kubectl` is insufficient.
4+
5+
## Base Command Pattern
6+
7+
```bash
8+
kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
9+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
10+
-- ig run <gadget>:v0.51.0 -o json --timeout <seconds> [filters...]
11+
```
12+
13+
Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.
14+
15+
**Required:** Resolve the node name first:
16+
17+
```bash
18+
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'
19+
```
20+
21+
## Common Filters
22+
23+
| Filter | Description |
24+
|---|---|
25+
| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |
26+
| `--k8s-podname <pod>` | Scope to a specific pod |
27+
| `--k8s-containername <ctr>` | Scope to a specific container |
28+
| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |
29+
| `--max-entries <n>` | Max entries per batch for top/profile gadgets |
30+
| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |
31+
| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |
32+
| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |
33+
34+
> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.
35+
>
36+
> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.
37+
38+
## Gadget Catalog
39+
40+
### Networking
41+
42+
| Gadget | Type | What It Does | When To Use |
43+
|---|---|---|---|
44+
| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |
45+
| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |
46+
| `trace_tcpdrop` | trace | Trace kernel TCP packet drops | Silent connection failures, packet loss, buffer overflows |
47+
| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |
48+
| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |
49+
| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |
50+
| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |
51+
| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |
52+
53+
#### tcpdump gadget
54+
55+
`tcpdump` outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:
56+
57+
```bash
58+
kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
59+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
60+
-- ig run tcpdump:v0.51.0 -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \
61+
--timeout 30 --filter "port 80" \
62+
| tcpdump -nvr -
63+
```
64+
65+
Parameters:
66+
- `--filter "<expr>"` — tcpdump filter expression (e.g., `port 80`, `host 10.0.0.1`, `tcp and port 443`)
67+
- `-o pcap-ng` — required output format (not `-o json`)
68+
69+
### Process & Workload
70+
71+
| Gadget | Type | What It Does | When To Use |
72+
|---|---|---|---|
73+
| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |
74+
| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |
75+
| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |
76+
| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |
77+
| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |
78+
| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |
79+
| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |
80+
81+
### File & Storage
82+
83+
| Gadget | Type | What It Does | When To Use |
84+
|---|---|---|---|
85+
| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |
86+
| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |
87+
| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |
88+
89+
### Security & Audit
90+
91+
| Gadget | Type | What It Does | When To Use |
92+
|---|---|---|---|
93+
| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |
94+
95+
## Symptom-to-Gadget Map
96+
97+
| Symptom | Gadget(s) | Why |
98+
|---|---|---|
99+
| DNS resolution failures | `trace_dns` | See actual queries, responses, rcodes, and latency |
100+
| Connection refused / timeout | `trace_tcp` + `snapshot_socket` | Trace connections and verify listening ports |
101+
| Silent connection drops | `trace_tcpdrop` + `trace_tcpretrans` | Kernel-level packet drops and retransmissions |
102+
| High network latency | `trace_tcpretrans` | Spot retransmissions indicating congestion or lossy links |
103+
| TLS / HTTPS routing issues | `trace_sni` | See which SNI values are sent in TLS handshakes |
104+
| Port already in use | `trace_bind` + `snapshot_socket` | See bind failures and what holds the port |
105+
| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` | See what runs at startup and what files it opens |
106+
| OOMKilled pods | `trace_oomkill` + `top_process` | See OOM victim details and memory-heavy processes |
107+
| Pod killed unexpectedly | `trace_signal` | See which signal killed the process and who sent it |
108+
| PID pressure on node | `snapshot_process` + `top_process` | List and rank all processes |
109+
| "Too many open files" | `top_file` | Find I/O-heavy files contributing to FD pressure |
110+
| Missing config / secret mount | `trace_open` | See ENOENT / EACCES on file opens |
111+
| Slow disk / PVC performance | `trace_fsslower` + `top_file` | Find slow FS operations and I/O-heavy files |
112+
| Permission denied (capabilities) | `trace_capabilities` | See which capability check failed |
113+
| High CPU (unknown cause) | `profile_cpu` + `top_process` | Profile stack traces and rank processes |
114+
| Deep packet inspection | `tcpdump` | Raw pcap capture piped to `tcpdump -nvr -` |
115+
| Catch-all / intermittent issues | `traceloop` | Syscall flight recorder — specify `--syscall-filters` to limit output |
116+
117+
## Gadget Type Reference
118+
119+
| Type | Behavior | IG --timeout |
120+
|---|---|---|
121+
| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |
122+
| `top` | Aggregated view, returns quickly | `--timeout 5` |
123+
| `trace` | Streams events in real-time | `--timeout 30` |
124+
| `profile` | Samples over a duration | `--timeout 30` |
125+
| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |
126+
127+
## Guardrails
128+
129+
- All IG gadgets are **read-only** — they do not modify cluster or application state.
130+
- Resolve the correct node name before running IG commands.
131+
- Set `--timeout` on every IG command to cap runtime.
132+
- Prefer snapshot/top for quick checks; use trace/profile to observe behavior over time.
133+
- Check the ephemeral debug pod logs to view the gadget output again if needed.

0 commit comments

Comments
 (0)