You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: resolve Vale spelling errors in k8s_infra_monitor README
Wrap Kubernetes-specific terms (OOMKilled, PIDPressure, CrashLoopBackOff,
MemoryPressure, DiskPressure, kubeconfig) in backticks so Vale treats them
as code. Replace "Triaging" with "Investigating" and "siloed" with
"isolated" to use standard dictionary words.
Signed-off-by: futhgar <jmaldonado.rosa@gmail.com>
Kubernetes clusters generate a constant stream of operational signals — node conditions, pod status changes, events, and resource metrics. Triaging these signals manually is time-consuming, especially in clusters running dozens of workloads across multiple namespaces.
66
+
Kubernetes clusters generate a constant stream of operational signals — node conditions, pod status changes, events, and resource metrics. Investigating these signals manually is time-consuming, especially in clusters running dozens of workloads across multiple namespaces.
67
67
68
68
This example provides an agentic system that:
69
69
70
-
1.**Gathers node diagnostics**: Checks node readiness, conditions (MemoryPressure, DiskPressure, PIDPressure), and resource utilization via `kubectl top`.
71
-
2.**Scans pod health**: Identifies unhealthy pods (CrashLoopBackOff, OOMKilled, Pending, Evicted) and flags containers with high restart counts.
70
+
1.**Gathers node diagnostics**: Checks node readiness, conditions (`MemoryPressure`, `DiskPressure`, `PIDPressure`), and resource utilization via `kubectl top`.
71
+
2.**Scans pod health**: Identifies unhealthy pods (`CrashLoopBackOff`, `OOMKilled`, `Pending`, `Evicted`) and flags containers with high restart counts.
72
72
3.**Collects cluster events**: Retrieves recent Warning events and correlates them with affected resources.
73
73
4.**Analyzes resource pressure**: Detects nodes approaching CPU or memory thresholds and flags active pressure conditions.
74
74
5.**Classifies severity**: Uses an LLM to classify the overall incident severity based on collected evidence.
@@ -79,7 +79,7 @@ This example provides an agentic system that:
79
79
An agentic approach provides significant advantages over static dashboards or rule-based alerting:
80
80
81
81
-**Contextual investigation**: The agent decides which tools to call based on the query, rather than running every check every time.
82
-
-**Cross-signal correlation**: Unlike siloed monitoring tools, the agent correlates data from nodes, pods, events, and resources to identify root causes (e.g., OOMKilled pods + MemoryPressure condition = memory exhaustion on a specific node).
82
+
-**Cross-signal correlation**: Unlike isolated monitoring tools, the agent correlates data from nodes, pods, events, and resources to identify root causes (e.g., `OOMKilled` pods + `MemoryPressure` condition = memory exhaustion on a specific node).
83
83
-**Natural language reports**: Produces human-readable incident summaries that can be directly shared with team members or fed into ticketing systems.
84
84
85
85
## How It Works
@@ -123,7 +123,7 @@ functions:
123
123
124
124
- `offline_mode`: When `true`, tools return pre-defined responses from the offline scenario dataset.
125
125
- `cpu_threshold_percent` / `memory_threshold_percent`: Configurable thresholds for resource pressure alerts.
126
-
- `kubeconfig_path`: Optional path to a kubeconfig file for live mode. Defaults to the standard `kubectl` config.
126
+
- `kubeconfig_path`: Optional path to a `kubeconfig` file for live mode. Defaults to the standard `kubectl` config.
127
127
128
128
#### Workflow
129
129
@@ -152,7 +152,7 @@ Offline mode uses predefined scenarios to simulate cluster issues without requir
152
152
153
153
Three scenarios are included:
154
154
- **`node-not-ready`**: A worker node becomes unreachable, causing pod evictions.
155
-
- **`memory-pressure`**: Multiple pods are OOMKilled due to memory exhaustion on a worker node.
155
+
- **`memory-pressure`**: Multiple pods are `OOMKilled` due to memory exhaustion on a worker node.
156
156
- **`healthy-cluster`**: Normal cluster operations with no issues.
157
157
158
158
```bash
@@ -196,4 +196,4 @@ nat run \
196
196
You can customize the live mode configuration to:
197
197
- Target specific namespaces with the `namespaces` list in `pod_health_check`.
198
198
- Adjust resource thresholds with `cpu_threshold_percent` and `memory_threshold_percent`.
199
-
- Point to a specific kubeconfig file with `kubeconfig_path`.
199
+
- Point to a specific `kubeconfig` file with `kubeconfig_path`.
0 commit comments