|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | +http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | + |
| 18 | +# Kubernetes Infrastructure Monitor using NeMo Agent Toolkit |
| 19 | + |
| 20 | +**Complexity:** 🟨 Intermediate |
| 21 | + |
| 22 | +This example demonstrates how to build an intelligent Kubernetes cluster monitoring agent using NeMo Agent Toolkit and LangGraph. The agent analyzes cluster health queries by gathering diagnostics from multiple tools — node status, pod health, cluster events, and resource utilization — then correlates the findings to produce structured incident reports with severity classification. |
| 23 | + |
| 24 | +## Table of Contents |
| 25 | + |
| 26 | +- [Key Features](#key-features) |
| 27 | +- [Installation and Setup](#installation-and-setup) |
| 28 | +- [Use Case Description](#use-case-description) |
| 29 | + - [Why Use an Agentic Design?](#why-use-an-agentic-design) |
| 30 | +- [How It Works](#how-it-works) |
| 31 | + - [Understanding the Configuration](#understanding-the-configuration) |
| 32 | +- [Example Usage](#example-usage) |
| 33 | + - [Running in Offline Mode](#running-in-offline-mode) |
| 34 | + - [Running in Live Mode](#running-in-live-mode) |
| 35 | + |
| 36 | +## Key Features |
| 37 | + |
| 38 | +- **Automated Cluster Health Analysis:** An agent that autonomously investigates Kubernetes cluster health queries using multiple diagnostic tools and generates structured reports. |
| 39 | +- **Multi-Tool Diagnostic Framework:** Integrates node status checks, pod health scanning, event collection, and resource pressure analysis for comprehensive cluster diagnosis. |
| 40 | +- **Dynamic Tool Selection:** The agent selects appropriate diagnostic tools based on the query context — a question about crashing pods triggers pod health and event checks, while a node issue triggers node status and resource analysis. |
| 41 | +- **Severity Classification:** Automatically classifies incidents as critical, warning, or informational based on the collected evidence. |
| 42 | +- **Offline and Live Modes:** Run with synthetic scenarios for development and testing, or connect to a real Kubernetes cluster via `kubectl` for production monitoring. |
| 43 | + |
| 44 | +## Installation and Setup |
| 45 | + |
| 46 | +If you have not already done so, install the NeMo Agent Toolkit following the [official documentation](https://docs.nvidia.com/nemo/agent-toolkit/latest/get-started/installation.html). |
| 47 | + |
| 48 | +### Install This Workflow |
| 49 | + |
| 50 | +From the root directory of the NeMo Agent Toolkit library: |
| 51 | + |
| 52 | +```bash |
| 53 | +uv pip install -e examples/k8s_infra_monitor |
| 54 | +``` |
| 55 | + |
| 56 | +### Set Up API Keys |
| 57 | + |
| 58 | +Export your NVIDIA API key: |
| 59 | + |
| 60 | +```bash |
| 61 | +export NVIDIA_API_KEY=<YOUR_API_KEY> |
| 62 | +``` |
| 63 | + |
| 64 | +## Use Case Description |
| 65 | + |
| 66 | +Kubernetes clusters generate a constant stream of operational signals — node conditions, pod status changes, events, and resource metrics. Triaging these signals manually is time-consuming, especially in clusters running dozens of workloads across multiple namespaces. |
| 67 | + |
| 68 | +This example provides an agentic system that: |
| 69 | + |
| 70 | +1. **Gathers node diagnostics**: Checks node readiness, conditions (MemoryPressure, DiskPressure, PIDPressure), and resource utilization via `kubectl top`. |
| 71 | +2. **Scans pod health**: Identifies unhealthy pods (CrashLoopBackOff, OOMKilled, Pending, Evicted) and flags containers with high restart counts. |
| 72 | +3. **Collects cluster events**: Retrieves recent Warning events and correlates them with affected resources. |
| 73 | +4. **Analyzes resource pressure**: Detects nodes approaching CPU or memory thresholds and flags active pressure conditions. |
| 74 | +5. **Classifies severity**: Uses an LLM to classify the overall incident severity based on collected evidence. |
| 75 | +6. **Generates structured reports**: Produces markdown reports with findings, root cause analysis, and recommended remediation steps. |
| 76 | + |
| 77 | +### Why Use an Agentic Design? |
| 78 | + |
| 79 | +An agentic approach provides significant advantages over static dashboards or rule-based alerting: |
| 80 | + |
| 81 | +- **Contextual investigation**: The agent decides which tools to call based on the query, rather than running every check every time. |
| 82 | +- **Cross-signal correlation**: Unlike siloed monitoring tools, the agent correlates data from nodes, pods, events, and resources to identify root causes (e.g., OOMKilled pods + MemoryPressure condition = memory exhaustion on a specific node). |
| 83 | +- **Natural language reports**: Produces human-readable incident summaries that can be directly shared with team members or fed into ticketing systems. |
| 84 | + |
| 85 | +## How It Works |
| 86 | + |
| 87 | +### Diagnostic Tools |
| 88 | + |
| 89 | +| Tool | Description | Live Mode | |
| 90 | +|------|-------------|-----------| |
| 91 | +| `node_status_check` | Retrieves node readiness, conditions, and resource utilization (`kubectl get nodes`, `kubectl top nodes`) | Uses `kubectl` | |
| 92 | +| `pod_health_check` | Scans for unhealthy pods and high restart counts across namespaces | Uses `kubectl` | |
| 93 | +| `event_collector` | Collects recent Warning events and correlates them with affected resources | Uses `kubectl` | |
| 94 | +| `resource_pressure_check` | Analyzes CPU/memory utilization against configurable thresholds, checks for pressure conditions | Uses `kubectl` | |
| 95 | +| `severity_classifier` | Classifies the final report's severity as critical, warning, or informational | LLM-based | |
| 96 | + |
| 97 | +### Workflow |
| 98 | + |
| 99 | +1. A cluster health query is received (natural language or JSON with scenario context). |
| 100 | +2. The monitor agent selects relevant diagnostic tools based on the query. |
| 101 | +3. Tools gather data (from `kubectl` in live mode, or from offline scenarios). |
| 102 | +4. The agent correlates findings across all tool outputs. |
| 103 | +5. A structured diagnostic report is generated. |
| 104 | +6. The severity classifier appends an incident severity classification. |
| 105 | + |
| 106 | +### Understanding the Configuration |
| 107 | + |
| 108 | +#### Functions |
| 109 | + |
| 110 | +Each tool is configured in the `functions` section: |
| 111 | + |
| 112 | +```yaml |
| 113 | +functions: |
| 114 | + node_status_check: |
| 115 | + _type: node_status_check |
| 116 | + offline_mode: true |
| 117 | + resource_pressure_check: |
| 118 | + _type: resource_pressure_check |
| 119 | + offline_mode: true |
| 120 | + cpu_threshold_percent: 80 |
| 121 | + memory_threshold_percent: 85 |
| 122 | +``` |
| 123 | +
|
| 124 | +- `offline_mode`: When `true`, tools return pre-defined responses from the offline scenario dataset. |
| 125 | +- `cpu_threshold_percent` / `memory_threshold_percent`: Configurable thresholds for resource pressure alerts. |
| 126 | +- `kubeconfig_path`: Optional path to a kubeconfig file for live mode. Defaults to the standard `kubectl` config. |
| 127 | + |
| 128 | +#### Workflow |
| 129 | + |
| 130 | +```yaml |
| 131 | +workflow: |
| 132 | + _type: k8s_infra_monitor |
| 133 | + tool_names: |
| 134 | + - node_status_check |
| 135 | + - pod_health_check |
| 136 | + - event_collector |
| 137 | + - resource_pressure_check |
| 138 | + llm_name: monitor_agent_llm |
| 139 | + offline_mode: true |
| 140 | + offline_data_path: examples/k8s_infra_monitor/data/offline_scenarios.json |
| 141 | +``` |
| 142 | + |
| 143 | +#### LLMs |
| 144 | + |
| 145 | +All tools and the main agent use NVIDIA NIM with `nvidia/nemotron-3-nano-30b-a3b` by default. You can swap this for any supported model. |
| 146 | + |
| 147 | +## Example Usage |
| 148 | + |
| 149 | +### Running in Offline Mode |
| 150 | + |
| 151 | +Offline mode uses predefined scenarios to simulate cluster issues without requiring a real Kubernetes cluster. |
| 152 | + |
| 153 | +Three scenarios are included: |
| 154 | +- **`node-not-ready`**: A worker node becomes unreachable, causing pod evictions. |
| 155 | +- **`memory-pressure`**: Multiple pods are OOMKilled due to memory exhaustion on a worker node. |
| 156 | +- **`healthy-cluster`**: Normal cluster operations with no issues. |
| 157 | + |
| 158 | +```bash |
| 159 | +# Investigate a node failure |
| 160 | +nat run \ |
| 161 | + --config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml \ |
| 162 | + --input '{"scenario_id": "node-not-ready", "query": "Worker node worker-2 appears to be down. Investigate the cluster health."}' |
| 163 | +``` |
| 164 | + |
| 165 | +```bash |
| 166 | +# Investigate OOMKilled pods |
| 167 | +nat run \ |
| 168 | + --config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml \ |
| 169 | + --input '{"scenario_id": "memory-pressure", "query": "Multiple pods are crashing in the ml-serving namespace. Check what is happening."}' |
| 170 | +``` |
| 171 | + |
| 172 | +```bash |
| 173 | +# Routine health check |
| 174 | +nat run \ |
| 175 | + --config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml \ |
| 176 | + --input '{"scenario_id": "healthy-cluster", "query": "Run a routine health check on the Kubernetes cluster."}' |
| 177 | +``` |
| 178 | + |
| 179 | +To evaluate the agent across all scenarios: |
| 180 | + |
| 181 | +```bash |
| 182 | +nat eval --config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml |
| 183 | +``` |
| 184 | + |
| 185 | +### Running in Live Mode |
| 186 | + |
| 187 | +Live mode connects to a real Kubernetes cluster using `kubectl`. Ensure your `KUBECONFIG` is set or specify `kubeconfig_path` in each tool's configuration. |
| 188 | + |
| 189 | +```bash |
| 190 | +# Run a live cluster health check |
| 191 | +nat run \ |
| 192 | + --config_file=examples/k8s_infra_monitor/configs/config_live_mode.yml \ |
| 193 | + --input "Check the overall health of the Kubernetes cluster. Are there any unhealthy pods or nodes under resource pressure?" |
| 194 | +``` |
| 195 | + |
| 196 | +You can customize the live mode configuration to: |
| 197 | +- Target specific namespaces with the `namespaces` list in `pod_health_check`. |
| 198 | +- Adjust resource thresholds with `cpu_threshold_percent` and `memory_threshold_percent`. |
| 199 | +- Point to a specific kubeconfig file with `kubeconfig_path`. |
0 commit comments