|
| 1 | +# Inspektor Gadget (IG) Reference |
| 2 | + |
| 3 | +Use Inspektor Gadget for real-time, low-level node/pod diagnostics when `kubectl` is insufficient. |
| 4 | + |
| 5 | +## IG Version |
| 6 | + |
| 7 | +`<ig-version>` = `v0.51.0` — substitute this exact tag (with `v` prefix) wherever `<ig-version>` appears. Bump this line only. |
| 8 | + |
| 9 | +## Base Command Pattern |
| 10 | + |
| 11 | +```bash |
| 12 | +kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \ |
| 13 | + --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \ |
| 14 | + -- ig run <gadget>:<ig-version> -o json --timeout <seconds> [filters...] |
| 15 | +``` |
| 16 | + |
| 17 | +Always set `--timeout` after `--` to cap runtime. Use `--timeout 5` for snapshot/top, `--timeout 30` for trace/profile. |
| 18 | + |
| 19 | +> **Note:** IG uses `kubectl debug --profile=sysadmin` (privileged debug pod). Only run with explicit user approval and appropriate RBAC. |
| 20 | +
|
| 21 | +**Required:** Resolve the node name first: |
| 22 | + |
| 23 | +```bash |
| 24 | +kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeName}' |
| 25 | +``` |
| 26 | + |
| 27 | +## Common Filters |
| 28 | + |
| 29 | +| Filter | Description | |
| 30 | +|---|---| |
| 31 | +| `--k8s-namespace <ns>` | Scope to a Kubernetes namespace | |
| 32 | +| `--k8s-podname <pod>` | Scope to a specific pod | |
| 33 | +| `--k8s-containername <ctr>` | Scope to a specific container | |
| 34 | +| `--timeout <seconds>` | Cap streaming duration for trace/profile gadgets | |
| 35 | +| `--max-entries <n>` | Max entries per batch for top/profile gadgets | |
| 36 | +| `--map-fetch-interval <dur>` | Map fetch interval for top (except `top_process`) and profile gadgets (default `1000ms`) | |
| 37 | +| `--interval <dur>` | Reporting interval for `top_process` only (e.g. `5s`) | |
| 38 | +| `--syscall-filters <list>` | Comma-separated syscalls for `traceloop` (e.g. `open,connect,accept`). **Always specify** to limit data volume | |
| 39 | + |
| 40 | +> **Tip:** For top/profile, set `--map-fetch-interval` ≤ half of `--timeout` to collect at least one batch. E.g. `--timeout 2 --map-fetch-interval 1000ms --max-entries 20`. |
| 41 | +> |
| 42 | +> **Note:** `top_process` uses `--interval` instead of `--map-fetch-interval`. E.g. `--timeout 10 --interval 5s --max-entries 20`. |
| 43 | +
|
| 44 | +## Gadget Catalog |
| 45 | + |
| 46 | +### Networking |
| 47 | + |
| 48 | +| Gadget | Type | What It Does | When To Use | |
| 49 | +|---|---|---|---| |
| 50 | +| `trace_dns` | trace | Trace DNS queries and responses with latency | DNS failures, NXDOMAIN, SERVFAIL, slow resolution, intermittent DNS | |
| 51 | +| `trace_tcp` | trace | Trace TCP connect/accept/close events | Connection refused, timeouts, unexpected drops, mapping pod connectivity | |
| 52 | +| `trace_tcpretrans` | trace | Trace TCP retransmissions | Network congestion, lossy links, high latency between pods/services | |
| 53 | +| `trace_bind` | trace | Trace socket bind calls | Port conflicts, address-already-in-use errors | |
| 54 | +| `trace_sni` | trace | Trace TLS SNI (Server Name Indication) values | HTTPS routing issues, ingress TLS debugging, mTLS problems | |
| 55 | +| `snapshot_socket` | snapshot | List open sockets (TCP/UDP/Unix) | Port conflicts, listening ports, connection leaks, ECONNREFUSED | |
| 56 | +| `tcpdump` | special | Capture raw packets in pcap-ng format | Deep packet inspection, protocol-level debugging, reproducing network issues | |
| 57 | + |
| 58 | +#### tcpdump gadget |
| 59 | + |
| 60 | +Outputs raw pcap-ng data. Pipe to `tcpdump` for readable output: |
| 61 | + |
| 62 | +```bash |
| 63 | +kubectl debug --profile=sysadmin node/<node-name> --attach --quiet \ |
| 64 | + --image=mcr.microsoft.com/oss/v2/inspektor-gadget/ig:<ig-version> \ |
| 65 | + -- ig run tcpdump:<ig-version> -o pcap-ng --k8s-namespace <ns> --k8s-podname <pod> \ |
| 66 | + --timeout 30 --filter "port 80" \ |
| 67 | + | tcpdump -nvr - |
| 68 | +``` |
| 69 | + |
| 70 | +Use `--filter "<expr>"` for tcpdump filters (e.g., `port 80`, `host 10.0.0.1`). Output must be `-o pcap-ng` (not `-o json`). |
| 71 | + |
| 72 | +### Process & Workload |
| 73 | + |
| 74 | +| Gadget | Type | What It Does | When To Use | |
| 75 | +|---|---|---|---| |
| 76 | +| `snapshot_process` | snapshot | List running processes in pod/node | PID pressure, unknown processes, verifying entrypoint, CrashLoopBackOff | |
| 77 | +| `trace_exec` | trace | Trace process execution (execve calls) | CrashLoopBackOff (what actually runs), unexpected child processes, security audit | |
| 78 | +| `trace_oomkill` | trace | Trace OOM kill events with victim details | OOMKilled pods — see which process was killed, memory usage at kill time | |
| 79 | +| `trace_signal` | trace | Trace signals delivered to processes | Unexpected SIGKILL/SIGTERM, liveness probe kills, graceful shutdown issues | |
| 80 | +| `top_process` | top | Rank processes by CPU/memory usage | Identifying resource-hungry processes inside a pod or across a node | |
| 81 | +| `profile_cpu` | profile | CPU profiling via stack sampling | High CPU usage investigation, finding hot code paths | |
| 82 | +| `traceloop` | trace | Record syscalls as a flight recorder | Catch-all for intermittent issues. **Always use `--syscall-filters`** (e.g., `open,connect,accept`) to limit data volume | |
| 83 | + |
| 84 | +### File & Storage |
| 85 | + |
| 86 | +| Gadget | Type | What It Does | When To Use | |
| 87 | +|---|---|---|---| |
| 88 | +| `trace_open` | trace | Trace openat syscall | Missing config/secret files (ENOENT), permission denied (EACCES), startup failures | |
| 89 | +| `trace_fsslower` | trace | Trace slow filesystem operations | Slow disk I/O, PVC performance issues, NFS/Azure Disk latency | |
| 90 | +| `top_file` | top | Rank files by read/write activity | Identifying I/O-heavy files, noisy log writers, disk pressure diagnosis | |
| 91 | + |
| 92 | +### Security & Audit |
| 93 | + |
| 94 | +| Gadget | Type | What It Does | When To Use | |
| 95 | +|---|---|---|---| |
| 96 | +| `trace_capabilities` | trace | Trace Linux capability checks | Permission denied from dropped capabilities, SecurityContext debugging | |
| 97 | + |
| 98 | +## Symptom-to-Gadget Map |
| 99 | + |
| 100 | +| Symptom | Gadget(s) | |
| 101 | +|---|---| |
| 102 | +| DNS resolution failures | `trace_dns` | |
| 103 | +| Connection refused / timeout | `trace_tcp` + `snapshot_socket` | |
| 104 | +| Silent connection drops | `trace_tcpretrans` | |
| 105 | +| High network latency | `trace_tcpretrans` | |
| 106 | +| TLS / HTTPS routing issues | `trace_sni` | |
| 107 | +| Port already in use | `trace_bind` + `snapshot_socket` | |
| 108 | +| CrashLoopBackOff (unknown cause) | `trace_exec` + `trace_open` | |
| 109 | +| OOMKilled pods | `trace_oomkill` + `top_process` | |
| 110 | +| Pod killed unexpectedly | `trace_signal` | |
| 111 | +| PID pressure on node | `snapshot_process` + `top_process` | |
| 112 | +| "Too many open files" | `top_file` | |
| 113 | +| Missing config / secret mount | `trace_open` | |
| 114 | +| Slow disk / PVC performance | `trace_fsslower` + `top_file` | |
| 115 | +| Permission denied (capabilities) | `trace_capabilities` | |
| 116 | +| High CPU (unknown cause) | `profile_cpu` + `top_process` | |
| 117 | +| Deep packet inspection | `tcpdump` | |
| 118 | +| Catch-all / intermittent issues | `traceloop` (use `--syscall-filters`) | |
| 119 | + |
| 120 | +## Gadget Type Reference |
| 121 | + |
| 122 | +| Type | Behavior | IG --timeout | |
| 123 | +|---|---|---| |
| 124 | +| `snapshot` | Point-in-time data, returns immediately | `--timeout 5` | |
| 125 | +| `top` | Aggregated view, returns quickly | `--timeout 5` | |
| 126 | +| `trace` | Streams events in real-time | `--timeout 30` | |
| 127 | +| `profile` | Samples over a duration | `--timeout 30` | |
| 128 | +| `tcpdump` | Streams pcap-ng data, pipe to `tcpdump -nvr -` | `--timeout 30` | |
| 129 | + |
| 130 | +## Guardrails |
| 131 | + |
| 132 | +- IG gadgets are **read-only** — they do not modify cluster or application state. |
| 133 | +- Resolve the correct node name before running any IG command. |
| 134 | +- Always set `--timeout` to cap runtime. Prefer snapshot/top for quick checks; trace/profile for behavior over time. |
0 commit comments