azure-diagnostics: Add Inspektor Gadget Reference

mqasimsarfraz · mqasimsarfraz · commit fc91411b76a7 · 2026-04-20T16:34:33.000+02:00
Signed-off-by: Qasim Sarfraz &lt;qasimsarfraz@microsoft.com&gt;
diff --git a/plugin/skills/azure-diagnostics/aks-troubleshooting/aks-troubleshooting.md b/plugin/skills/azure-diagnostics/aks-troubleshooting/aks-troubleshooting.md
@@ -20,6 +20,8 @@ Primary AKS troubleshooting guide for incidents routed from [../SKILL.md](../SKI
 
 When gathering AKS diagnostic evidence, prefer `mcp_azure_mcp_aks`, then the smallest discovered AKS-MCP tool that fits the read, then supporting Azure tools such as `mcp_azure_mcp_applens`, `mcp_azure_mcp_monitor`, or `mcp_azure_mcp_resourcehealth`. Use raw `az aks` and `kubectl` only when the AKS-MCP surface cannot perform the needed check.
 
+When standard diagnostics do not reveal root cause, use **Inspektor Gadget** for real-time, low-level node and pod observability (DNS traces, TCP traces, process snapshots, file access traces). See [references/inspektor-gadget.md](references/inspektor-gadget.md) for the gadget catalog, command patterns, and symptom-to-gadget mapping.
+
 See [references/aks-mcp.md](references/aks-mcp.md), [references/structured-input-modes.md](references/structured-input-modes.md), [references/command-flows.md](references/command-flows.md)
 
 ## Required Inputs
@@ -46,7 +48,8 @@ If cluster identity is missing, stop and ask for it.
 
 1. Azure-side state first: cluster state, resource health, recent operations, node pool state, detector or monitoring output.
 2. Kubernetes-side state second: cluster reachability, nodes, `kube-system`, events, affected namespace, pod detail, logs.
-3. Use detector, warning-event, or metrics modes when the incoming data already matches them.
+3. Deep diagnostics third: when steps 1–2 do not reveal root cause, use Inspektor Gadget for real-time tracing and snapshots on the affected node.
+4. Use detector, warning-event, or metrics modes when the incoming data already matches them.
 
 ## Workflow
 
diff --git a/plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md b/plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md
@@ -30,6 +30,32 @@ kubectl run netdebug --image=curlimages/curl -it --rm -n <ns> -- \
 
 Pods that are running but not Ready are removed from Endpoints. Check `kubectl get pod <pod> -n <ns>`.
 
+**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
+
+```bash
+NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
+
+# Check what ports the pod is actually listening on
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run snapshot_socket:v0.51.0 -o json --timeout 5 --k8s-namespace <ns> --k8s-podname <pod-name>
+
+# Trace TCP connections in real-time to see connect/accept/close events
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run trace_tcp:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
+# Trace TCP retransmissions and packet drops for deeper network issues
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run trace_tcpdrop:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
+
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run trace_tcpretrans:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
+```
+
+See [references/inspektor-gadget.md](references/inspektor-gadget.md).
+
 ---
 
 ## DNS Resolution Failures
@@ -69,6 +95,20 @@ kubectl get svc kube-dns -n kube-system -o jsonpath='{.spec.clusterIP}'
 
 Custom VNet DNS must forward `.cluster.local` to the CoreDNS ClusterIP and other lookups to `168.63.129.16`.
 
+**Deep diagnostics with Inspektor Gadget** (when the above checks are inconclusive):
+
+```bash
+# Resolve node for the affected pod
+NODE=$(kubectl get pod <pod-name> -n <ns> -o jsonpath='{.spec.nodeName}')
+
+# Trace live DNS queries and responses — look for rcode != 0
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run trace_dns:v0.51.0 -o json --timeout 30 --k8s-namespace <ns>
+```
+
+Key signals: `rcode=3` (NXDOMAIN), `rcode=2` (SERVFAIL), high `latency` values, queries going to unexpected destinations. See [references/inspektor-gadget.md](references/inspektor-gadget.md).
+
 ---
 
 ## Load Balancer Stuck in Pending
diff --git a/plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md b/plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md
@@ -17,7 +17,7 @@ kubectl describe node <node-name>
 | `Ready`              | `False` | kubelet stopped reporting         | SSH to node; if unrecoverable, consider cordon/drain/delete\* |
 | `MemoryPressure`     | `True`  | Node running out of memory        | Evict pods; scale out pool; reduce pod density                |
 | `DiskPressure`       | `True`  | OS disk or ephemeral storage full | Check logs and images; clean up or increase disk              |
-| `PIDPressure`        | `True`  | Too many processes                | App spawning excessive threads/processes                      |
+| `PIDPressure`        | `True`  | Too many processes                | App spawning excessive threads/processes; use IG `snapshot_process` |
 | `NetworkUnavailable` | `True`  | CNI plugin issue                  | Check CNI pods in kube-system; node network config            |
 
 \*Only after explicit user request for remediation and confirmation of workload impact.
@@ -103,6 +103,17 @@ kubectl debug node/<node> -it --image=mcr.microsoft.com/cbl-mariner/base/core:2.
 
 Common culprit: high-volume container logs accumulating in `/var/log/containers`.
 
+**Deep diagnostics with Inspektor Gadget** (PID pressure or unknown process load):
+
+```bash
+# List all processes on the node to find the offender
+kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run snapshot_process:v0.51.0 -o json --timeout 5
+```
+
+See [references/inspektor-gadget.md](references/inspektor-gadget.md).
+
 ---
 
 ## Node Image / OS Upgrade Issues
diff --git a/plugin/skills/azure-diagnostics/aks-troubleshooting/pod-failures.md b/plugin/skills/azure-diagnostics/aks-troubleshooting/pod-failures.md
@@ -52,6 +52,31 @@ kubectl describe pod <pod-name> -n <namespace> | grep -A2 "Last State"
 
 Fix: increase `resources.limits.memory` or optimize application memory usage. Check `kubectl top pod <pod-name> -n <namespace>` for actual usage.
 
+For real-time OOM kill tracing, use Inspektor Gadget `trace_oomkill` — see [references/inspektor-gadget.md](references/inspektor-gadget.md).
+
+**Deep diagnostics with Inspektor Gadget** (when logs and describe are inconclusive):
+
+```bash
+NODE=$(kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}')
+
+# See what the container actually executes at startup (catch bad entrypoints, unexpected child processes)
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run trace_exec:v0.51.0 -o json --timeout 30 --k8s-namespace <namespace> --k8s-podname <pod-name>
+
+# Trace file opens to find missing configs, secrets, or permission errors (retval -2 = ENOENT, -13 = EACCES)
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run trace_open:v0.51.0 -o json --timeout 30 --k8s-namespace <namespace> --k8s-podname <pod-name>
+
+# List running processes in the pod
+kubectl debug --profile=sysadmin node/$NODE --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run snapshot_process:v0.51.0 -o json --timeout 5 --k8s-namespace <namespace> --k8s-podname <pod-name>
+```
+
+See [references/inspektor-gadget.md](references/inspektor-gadget.md).
+
 ---
 
 ## ImagePullBackOff
diff --git a/plugin/skills/azure-diagnostics/aks-troubleshooting/references/command-flows.md b/plugin/skills/azure-diagnostics/aks-troubleshooting/references/command-flows.md
@@ -76,6 +76,14 @@ kubectl get pvc -n <namespace>
 kubectl describe quota -n <namespace>
 ```
 
+## Deep Diagnostics Flow (Inspektor Gadget)
+
+```text
+Standard diagnostics inconclusive -> resolve target node -> select gadget from symptom-to-gadget map -> run IG command with namespace/pod filters -> interpret output -> correlate with prior evidence
+```
+
+Use when steps 1–2 of the evidence order (Azure-side and Kubernetes-side) do not reveal root cause. See [inspektor-gadget.md](inspektor-gadget.md) for the full gadget catalog and command patterns.
+
 ## Safety Boundary
 
 Treat the following as change operations and avoid them unless the user explicitly asks for remediation:
diff --git a/plugin/skills/azure-diagnostics/aks-troubleshooting/references/inspektor-gadget.md b/plugin/skills/azure-diagnostics/aks-troubleshooting/references/inspektor-gadget.md
@@ -0,0 +1,133 @@
+# Inspektor Gadget (IG) Reference
+
+Use Inspektor Gadget for real-time, low-level node/pod diagnostics when standard `kubectl` is insufficient.
+
+## Base Command Pattern
+
+```bash
+kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run <gadget>:v0.51.0 -o json --timeout <seconds> [filters...]
+```
+
+Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile.
+
+**Required:** Resolve the node name first:
+
+```bash
+kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}'
+```
+
+## Common Filters
+
+| Filter | Description |
+|---|---|
+| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace |
+| `--k8s-podname <pod>` | Scope to a specific pod |
+| `--k8s-containername <ctr>` | Scope to a specific container |
+| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets |
+| `--max-entries <n>` | Max entries per batch for top/profile gadgets |
+| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) |
+| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) |
+| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume |
+
+> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`.
+>
+> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`.
+
+## Gadget Catalog
+
+### Networking
+
+| Gadget | Type | What It Does | When To Use |
+|---|---|---|---|
+| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS |
+| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity |
+| `trace_tcpdrop` | trace | Trace kernel TCP packet drops | Silent connection failures, packet loss, buffer overflows |
+| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services |
+| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors |
+| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems |
+| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED |
+| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues |
+
+#### tcpdump gadget
+
+`tcpdump` outputs raw pcap-ng data. Pipe to `tcpdump` for readable output:
+
+```bash
+kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \
+  --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \
+  -- ig run tcpdump:v0.51.0 -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \
+     --timeout 30 --filter "port 80" \
+  | tcpdump -nvr -
+```
+
+Parameters:
+- `--filter "<expr>"` — tcpdump filter expression (e.g., `port 80`, `host 10.0.0.1`, `tcp and port 443`)
+- `-o pcap-ng` — required output format (not `-o json`)
+
+### Process & Workload
+
+| Gadget | Type | What It Does | When To Use |
+|---|---|---|---|
+| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff |
+| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit |
+| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time |
+| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues |
+| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node |
+| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths |
+| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume |
+
+### File & Storage
+
+| Gadget | Type | What It Does | When To Use |
+|---|---|---|---|
+| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures |
+| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency |
+| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis |
+
+### Security & Audit
+
+| Gadget | Type | What It Does | When To Use |
+|---|---|---|---|
+| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging |
+
+## Symptom-to-Gadget Map
+
+| Symptom | Gadget(s) | Why |
+|---|---|---|
+| DNS resolution failures | `trace_dns` | See actual queries, responses, rcodes, and latency |
+| Connection refused / timeout | `trace_tcp` + `snapshot_socket` | Trace connections and verify listening ports |
+| Silent connection drops | `trace_tcpdrop` + `trace_tcpretrans` | Kernel-level packet drops and retransmissions |
+| High network latency | `trace_tcpretrans` | Spot retransmissions indicating congestion or lossy links |
+| TLS / HTTPS routing issues | `trace_sni` | See which SNI values are sent in TLS handshakes |
+| Port already in use | `trace_bind` + `snapshot_socket` | See bind failures and what holds the port |
+| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` | See what runs at startup and what files it opens |
+| OOMKilled pods | `trace_oomkill` + `top_process` | See OOM victim details and memory-heavy processes |
+| Pod killed unexpectedly | `trace_signal` | See which signal killed the process and who sent it |
+| PID pressure on node | `snapshot_process` + `top_process` | List and rank all processes |
+| "Too many open files" | `top_file` | Find I/O-heavy files contributing to FD pressure |
+| Missing config / secret mount | `trace_open` | See ENOENT / EACCES on file opens |
+| Slow disk / PVC performance | `trace_fsslower` + `top_file` | Find slow FS operations and I/O-heavy files |
+| Permission denied (capabilities) | `trace_capabilities` | See which capability check failed |
+| High CPU (unknown cause) | `profile_cpu` + `top_process` | Profile stack traces and rank processes |
+| Deep packet inspection | `tcpdump` | Raw pcap capture piped to `tcpdump -nvr -` |
+| Catch-all / intermittent issues | `traceloop` | Syscall flight recorder — specify `--syscall-filters` to limit output |
+
+## Gadget Type Reference
+
+| Type | Behavior | IG --timeout |
+|---|---|---|
+| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` |
+| `top` | Aggregated view, returns quickly | `--timeout 5` |
+| `trace` | Streams events in real-time | `--timeout 30` |
+| `profile` | Samples over a duration | `--timeout 30` |
+| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` |
+
+## Guardrails
+
+- All IG gadgets are **read-only** — they do not modify cluster or application state.
+- Resolve the correct node name before running IG commands.
+- Set `--timeout` on every IG command to cap runtime.
+- Prefer snapshot/top for quick checks; use trace/profile to observe behavior over time.
+- Check the ephemeral debug pod logs to view the gadget output again if needed.