|
| 1 | +# Inspektor Gadget (IG) Reference |
| 2 | + |
| 3 | +Use Inspektor Gadget for real-time, low-level node/pod diagnostics when standard `kubectl` is insufficient. |
| 4 | + |
| 5 | +## Base Command Pattern |
| 6 | + |
| 7 | +```bash |
| 8 | +kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \ |
| 9 | + --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \ |
| 10 | + -- ig run <gadget>:v0.51.0 -o json --timeout <seconds> [filters...] |
| 11 | +``` |
| 12 | + |
| 13 | +Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile. |
| 14 | + |
| 15 | +**Required:** Resolve the node name first: |
| 16 | + |
| 17 | +```bash |
| 18 | +kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}' |
| 19 | +``` |
| 20 | + |
| 21 | +## Common Filters |
| 22 | + |
| 23 | +| Filter | Description | |
| 24 | +|---|---| |
| 25 | +| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace | |
| 26 | +| `--k8s-podname <pod>` | Scope to a specific pod | |
| 27 | +| `--k8s-containername <ctr>` | Scope to a specific container | |
| 28 | +| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets | |
| 29 | +| `--max-entries <n>` | Max entries per batch for top/profile gadgets | |
| 30 | +| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) | |
| 31 | +| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) | |
| 32 | +| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume | |
| 33 | + |
| 34 | +> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`. |
| 35 | +> |
| 36 | +> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`. |
| 37 | +
|
| 38 | +## Gadget Catalog |
| 39 | + |
| 40 | +### Networking |
| 41 | + |
| 42 | +| Gadget | Type | What It Does | When To Use | |
| 43 | +|---|---|---|---| |
| 44 | +| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS | |
| 45 | +| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity | |
| 46 | +| `trace_tcpdrop` | trace | Trace kernel TCP packet drops | Silent connection failures, packet loss, buffer overflows | |
| 47 | +| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services | |
| 48 | +| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors | |
| 49 | +| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems | |
| 50 | +| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED | |
| 51 | +| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues | |
| 52 | + |
| 53 | +#### tcpdump gadget |
| 54 | + |
| 55 | +`tcpdump` outputs raw pcap-ng data. Pipe to `tcpdump` for readable output: |
| 56 | + |
| 57 | +```bash |
| 58 | +kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \ |
| 59 | + --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:v0.51.0 \ |
| 60 | + -- ig run tcpdump:v0.51.0 -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \ |
| 61 | + --timeout 30 --filter "port 80" \ |
| 62 | + | tcpdump -nvr - |
| 63 | +``` |
| 64 | + |
| 65 | +Parameters: |
| 66 | +- `--filter "<expr>"` — tcpdump filter expression (e.g., `port 80`, `host 10.0.0.1`, `tcp and port 443`) |
| 67 | +- `-o pcap-ng` — required output format (not `-o json`) |
| 68 | + |
| 69 | +### Process & Workload |
| 70 | + |
| 71 | +| Gadget | Type | What It Does | When To Use | |
| 72 | +|---|---|---|---| |
| 73 | +| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff | |
| 74 | +| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit | |
| 75 | +| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time | |
| 76 | +| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues | |
| 77 | +| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node | |
| 78 | +| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths | |
| 79 | +| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume | |
| 80 | + |
| 81 | +### File & Storage |
| 82 | + |
| 83 | +| Gadget | Type | What It Does | When To Use | |
| 84 | +|---|---|---|---| |
| 85 | +| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures | |
| 86 | +| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency | |
| 87 | +| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis | |
| 88 | + |
| 89 | +### Security & Audit |
| 90 | + |
| 91 | +| Gadget | Type | What It Does | When To Use | |
| 92 | +|---|---|---|---| |
| 93 | +| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging | |
| 94 | + |
| 95 | +## Symptom-to-Gadget Map |
| 96 | + |
| 97 | +| Symptom | Gadget(s) | Why | |
| 98 | +|---|---|---| |
| 99 | +| DNS resolution failures | `trace_dns` | See actual queries, responses, rcodes, and latency | |
| 100 | +| Connection refused / timeout | `trace_tcp` + `snapshot_socket` | Trace connections and verify listening ports | |
| 101 | +| Silent connection drops | `trace_tcpdrop` + `trace_tcpretrans` | Kernel-level packet drops and retransmissions | |
| 102 | +| High network latency | `trace_tcpretrans` | Spot retransmissions indicating congestion or lossy links | |
| 103 | +| TLS / HTTPS routing issues | `trace_sni` | See which SNI values are sent in TLS handshakes | |
| 104 | +| Port already in use | `trace_bind` + `snapshot_socket` | See bind failures and what holds the port | |
| 105 | +| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` | See what runs at startup and what files it opens | |
| 106 | +| OOMKilled pods | `trace_oomkill` + `top_process` | See OOM victim details and memory-heavy processes | |
| 107 | +| Pod killed unexpectedly | `trace_signal` | See which signal killed the process and who sent it | |
| 108 | +| PID pressure on node | `snapshot_process` + `top_process` | List and rank all processes | |
| 109 | +| "Too many open files" | `top_file` | Find I/O-heavy files contributing to FD pressure | |
| 110 | +| Missing config / secret mount | `trace_open` | See ENOENT / EACCES on file opens | |
| 111 | +| Slow disk / PVC performance | `trace_fsslower` + `top_file` | Find slow FS operations and I/O-heavy files | |
| 112 | +| Permission denied (capabilities) | `trace_capabilities` | See which capability check failed | |
| 113 | +| High CPU (unknown cause) | `profile_cpu` + `top_process` | Profile stack traces and rank processes | |
| 114 | +| Deep packet inspection | `tcpdump` | Raw pcap capture piped to `tcpdump -nvr -` | |
| 115 | +| Catch-all / intermittent issues | `traceloop` | Syscall flight recorder — specify `--syscall-filters` to limit output | |
| 116 | + |
| 117 | +## Gadget Type Reference |
| 118 | + |
| 119 | +| Type | Behavior | IG --timeout | |
| 120 | +|---|---|---| |
| 121 | +| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` | |
| 122 | +| `top` | Aggregated view, returns quickly | `--timeout 5` | |
| 123 | +| `trace` | Streams events in real-time | `--timeout 30` | |
| 124 | +| `profile` | Samples over a duration | `--timeout 30` | |
| 125 | +| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` | |
| 126 | + |
| 127 | +## Guardrails |
| 128 | + |
| 129 | +- All IG gadgets are **read-only** — they do not modify cluster or application state. |
| 130 | +- Resolve the correct node name before running IG commands. |
| 131 | +- Set `--timeout` on every IG command to cap runtime. |
| 132 | +- Prefer snapshot/top for quick checks; use trace/profile to observe behavior over time. |
| 133 | +- Check the ephemeral debug pod logs to view the gadget output again if needed. |
0 commit comments