Skip to content

Commit e5298d0

Browse files
committed
azure-diagnostics: Add Inspektor Gadget Reference
Signed-off-by: Qasim Sarfraz <qasimsarfraz@microsoft.com>
1 parent 946fa11 commit e5298d0

7 files changed

Lines changed: 179 additions & 2 deletions

File tree

plugin/skills/azure-diagnostics/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: "Debug Azure production issues on Azure using AppLens, Azure Monito
44
license: MIT
55
metadata:
66
author: Microsoft
7-
version: "1.0.4"
7+
version: "1.0.5"
88
---
99

1010
# Azure Diagnostics

plugin/skills/azure-diagnostics/aks-troubleshooting/aks-troubleshooting.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,8 @@ Primary AKS troubleshooting guide for incidents routed from [../SKILL.md](../SKI
2020

2121
When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
2222

23+
When standard diagnostics do not reveal root cause, use **Inspektor Gadget** for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See [references/inspektor-gadget.md](references/inspektor-gadget.md) for the gadget catalog, command patterns, and symptom-to-gadget mapping.
24+
2325
See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
2426

2527
## Required Inputs
@@ -47,6 +49,7 @@ If cluster identity is missing, stop and ask for it.
4749
1. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
4850
2. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
4951
3. Use detector, warning-event, or metrics modes when the incoming data already matches them.
52+
4. Deep diagnostics; when steps 1–3 do not reveal root cause, use Inspektor Gadget for real-time tracing and snapshots on the affected node.
5053

5154
## Workflow
5255

plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,16 @@ kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
3030

3131
Pods that are running but not Ready are removed from Endpoints. Check `kubectl get pod <pod> -n <ns>`.
3232

33+
**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
34+
35+
Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and these gadgets:
36+
37+
- `snapshot_socket` (timeout 5) — check what ports the pod is listening on
38+
- `trace_tcp` (timeout 30) — trace connect/accept/close events
39+
- `trace_tcpdrop` + `trace_tcpretrans` (timeout 30) — packet drops and retransmissions
40+
41+
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
42+
3343
---
3444

3545
## DNS Resolution Failures
@@ -69,6 +79,12 @@ kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
6979

7080
Custom VNet DNS must forward `.cluster.local` to the CoreDNS ClusterIP and other lookups to `168.63.129.16`.
7181

82+
**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
83+
84+
Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <ns> --k8s-podname <pod-name>` and `trace_dns` (timeout 30). Key signals: `rcode=3` (NXDOMAIN), `rcode=2` (SERVFAIL), high `latency` values, queries going to unexpected destinations.
85+
86+
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
87+
7288
---
7389

7490
## Load Balancer Stuck in Pending

plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ kubectl describe node <node-name>
1717
| `Ready` | `False` | kubelet stopped reporting | SSH to node; if unrecoverable, consider cordon/drain/delete\* |
1818
| `MemoryPressure` | `True` | Node running out of memory | Evict pods; scale out pool; reduce pod density |
1919
| `DiskPressure` | `True` | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk |
20-
| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes |
20+
| `PIDPressure` | `True` | Too many processes | App spawning excessive threads/processes; use IG `snapshot_process` |
2121
| `NetworkUnavailable` | `True` | CNI plugin issue | Check CNI pods in kube-system; node network config |
2222

2323
\*Only after explicit user request for remediation and confirmation of workload impact.
@@ -103,6 +103,10 @@ kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.
103103

104104
Common culprit: high-volume container logs accumulating in `/var/log/containers`.
105105

106+
**Deep diagnostics with Inspektor Gadget** (PID pressure or unknown process load):
107+
108+
Use `snapshot_process` (timeout 5) to list all processes on the node. For node-wide scope, omit pod filters. See [references/inspektor-gadget.md](references/inspektor-gadget.md).
109+
106110
---
107111

108112
## Node Image / OS Upgrade Issues

plugin/skills/azure-diagnostics/aks-troubleshooting/pod-failures.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,18 @@ kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"
5252

5353
Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.
5454

55+
**OOM kill tracing with Inspektor Gadget:** Use `trace_oomkill` (timeout 30) with `--k8s-namespace <namespace> --k8s-podname <pod-name>` to see which process was killed and memory at kill time. See [references/inspektor-gadget.md](references/inspektor-gadget.md).
56+
57+
**Deep diagnostics with Inspektor Gadget** (when logs and describe are inconclusive):
58+
59+
Use the [IG base command pattern](references/inspektor-gadget.md) with `--k8s-namespace <namespace> --k8s-podname <pod-name>` and these gadgets:
60+
61+
- `trace_exec` (timeout 30) — see what the container executes at startup
62+
- `trace_open` (timeout 30) — find missing configs/secrets (retval -2 = ENOENT, -13 = EACCES)
63+
- `snapshot_process` (timeout 5) — list running processes in the pod
64+
65+
See [references/inspektor-gadget.md](references/inspektor-gadget.md).
66+
5567
---
5668

5769
## ImagePullBackOff

plugin/skills/azure-diagnostics/aks-troubleshooting/references/command-flows.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,14 @@ kubectl get pvc -n <namespace>
7676
kubectl describe quota -n <namespace>
7777
```
7878

79+
## Deep Diagnostics Flow (Inspektor Gadget)
80+
81+
```text
82+
Standard diagnostics inconclusive -> resolve target node -> select gadget from symptom-to-gadget map -> run IG command with namespace/pod filters -> interpret output -> correlate with prior evidence
83+
```
84+
85+
Use when steps 1–3 of the evidence order ((Azure-side, Kubernetes-side, and detector evidence)) do not reveal root cause. See [inspektor-gadget.md](inspektor-gadget.md) for the full gadget catalog and command patterns.
86+
7987
## Safety Boundary
8088

8189
Treat the following as change operations and avoid them unless the user explicitly asks for remediation:
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Inspektor Gadget (IG) Reference
2+
3+
Use Inspektor Gadget for real-time, low-level node/pod diagnostics when `kubectl` is insufficient.
4+
5+
## IG Version
6+
7+
`<ig-version>` = `v0.51.0` — substitute this exact tag (with `v` prefix) wherever `<ig-version>` appears. Bump this line only.
8+
9+
## Base Command Pattern
10+
11+
```bash
12+
kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
13+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
14+
-- ig run <gadget>:<ig-version> -o json --timeout <seconds> [filters...]
15+
```
16+
17+
Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.
18+
19+
> **Note:** IG uses `kubectl debug --profile=sysadmin` (privileged debug pod). Only run with explicit user approval and appropriate RBAC.
20+
21+
**Required:** Resolve the node name first:
22+
23+
```bash
24+
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'
25+
```
26+
27+
## Common Filters
28+
29+
| Filter | Description |
30+
|---|---|
31+
| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |
32+
| `--k8s-podname <pod>` | Scope to a specific pod |
33+
| `--k8s-containername <ctr>` | Scope to a specific container |
34+
| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |
35+
| `--max-entries <n>` | Max entries per batch for top/profile gadgets |
36+
| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |
37+
| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |
38+
| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |
39+
40+
> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.
41+
>
42+
> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.
43+
44+
## Gadget Catalog
45+
46+
### Networking
47+
48+
| Gadget | Type | What It Does | When To Use |
49+
|---|---|---|---|
50+
| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |
51+
| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |
52+
| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |
53+
| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |
54+
| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |
55+
| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |
56+
| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |
57+
58+
#### tcpdump gadget
59+
60+
Outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:
61+
62+
```bash
63+
kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
64+
--image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \
65+
-- ig run tcpdump:<ig-version> -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \
66+
--timeout 30 --filter "port 80" \
67+
| tcpdump -nvr -
68+
```
69+
70+
Use `--filter "<expr>"` for tcpdump filters (e.g., `port 80`, `host 10.0.0.1`). Output must be `-o pcap-ng` (not `-o json`).
71+
72+
### Process & Workload
73+
74+
| Gadget | Type | What It Does | When To Use |
75+
|---|---|---|---|
76+
| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |
77+
| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |
78+
| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |
79+
| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |
80+
| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |
81+
| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |
82+
| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |
83+
84+
### File & Storage
85+
86+
| Gadget | Type | What It Does | When To Use |
87+
|---|---|---|---|
88+
| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |
89+
| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |
90+
| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |
91+
92+
### Security & Audit
93+
94+
| Gadget | Type | What It Does | When To Use |
95+
|---|---|---|---|
96+
| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |
97+
98+
## Symptom-to-Gadget Map
99+
100+
| Symptom | Gadget(s) |
101+
|---|---|
102+
| DNS resolution failures | `trace_dns` |
103+
| Connection refused / timeout | `trace_tcp` + `snapshot_socket` |
104+
| Silent connection drops | `trace_tcpretrans` |
105+
| High network latency | `trace_tcpretrans` |
106+
| TLS / HTTPS routing issues | `trace_sni` |
107+
| Port already in use | `trace_bind` + `snapshot_socket` |
108+
| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` |
109+
| OOMKilled pods | `trace_oomkill` + `top_process` |
110+
| Pod killed unexpectedly | `trace_signal` |
111+
| PID pressure on node | `snapshot_process` + `top_process` |
112+
| "Too many open files" | `top_file` |
113+
| Missing config / secret mount | `trace_open` |
114+
| Slow disk / PVC performance | `trace_fsslower` + `top_file` |
115+
| Permission denied (capabilities) | `trace_capabilities` |
116+
| High CPU (unknown cause) | `profile_cpu` + `top_process` |
117+
| Deep packet inspection | `tcpdump` |
118+
| Catch-all / intermittent issues | `traceloop` (use `--syscall-filters`) |
119+
120+
## Gadget Type Reference
121+
122+
| Type | Behavior | IG --timeout |
123+
|---|---|---|
124+
| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |
125+
| `top` | Aggregated view, returns quickly | `--timeout 5` |
126+
| `trace` | Streams events in real-time | `--timeout 30` |
127+
| `profile` | Samples over a duration | `--timeout 30` |
128+
| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |
129+
130+
## Guardrails
131+
132+
- IG gadgets are **read-only** — they do not modify cluster or application state.
133+
- Resolve the correct node name before running any IG command.
134+
- Always set `--timeout` to cap runtime. Prefer snapshot/top for quick checks; trace/profile for behavior over time.

0 commit comments

Comments
 (0)