Skip to content

Commit e8ad305

Browse files
committed
feat: add Kubernetes infrastructure monitor agent example
Add a new example that demonstrates automated Kubernetes cluster health monitoring using NeMo Agent Toolkit and LangGraph. The agent orchestrates four diagnostic tools (node status, pod health, event collection, resource pressure analysis) to investigate cluster health queries and classify incident severity, producing structured markdown reports with root cause analysis and remediation steps. Tools operate in offline mode (with three bundled scenarios: node-not-ready, memory-pressure, healthy-cluster) or live mode via kubectl integration. Includes evaluation dataset, RAGAS-based eval config, integration tests, and full documentation. Moved from NVIDIA/NeMo-Agent-Toolkit#1805 per maintainer guidance. Signed-off-by: futhgar <jmaldonado.rosa@gmail.com>
1 parent 27ac411 commit e8ad305

18 files changed

Lines changed: 1470 additions & 0 deletions
Lines changed: 199 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# Kubernetes Infrastructure Monitor using NeMo Agent Toolkit
19+
20+
**Complexity:** 🟨 Intermediate
21+
22+
This example demonstrates how to build an intelligent Kubernetes cluster monitoring agent using NeMo Agent Toolkit and LangGraph. The agent analyzes cluster health queries by gathering diagnostics from multiple tools — node status, pod health, cluster events, and resource utilization — then correlates the findings to produce structured incident reports with severity classification.
23+
24+
## Table of Contents
25+
26+
- [Key Features](#key-features)
27+
- [Installation and Setup](#installation-and-setup)
28+
- [Use Case Description](#use-case-description)
29+
- [Why Use an Agentic Design?](#why-use-an-agentic-design)
30+
- [How It Works](#how-it-works)
31+
- [Understanding the Configuration](#understanding-the-configuration)
32+
- [Example Usage](#example-usage)
33+
- [Running in Offline Mode](#running-in-offline-mode)
34+
- [Running in Live Mode](#running-in-live-mode)
35+
36+
## Key Features
37+
38+
- **Automated Cluster Health Analysis:** An agent that autonomously investigates Kubernetes cluster health queries using multiple diagnostic tools and generates structured reports.
39+
- **Multi-Tool Diagnostic Framework:** Integrates node status checks, pod health scanning, event collection, and resource pressure analysis for comprehensive cluster diagnosis.
40+
- **Dynamic Tool Selection:** The agent selects appropriate diagnostic tools based on the query context — a question about crashing pods triggers pod health and event checks, while a node issue triggers node status and resource analysis.
41+
- **Severity Classification:** Automatically classifies incidents as critical, warning, or informational based on the collected evidence.
42+
- **Offline and Live Modes:** Run with synthetic scenarios for development and testing, or connect to a real Kubernetes cluster via `kubectl` for production monitoring.
43+
44+
## Installation and Setup
45+
46+
If you have not already done so, install the NeMo Agent Toolkit following the [official documentation](https://docs.nvidia.com/nemo/agent-toolkit/latest/get-started/installation.html).
47+
48+
### Install This Workflow
49+
50+
From the root directory of the NeMo Agent Toolkit library:
51+
52+
```bash
53+
uv pip install -e examples/k8s_infra_monitor
54+
```
55+
56+
### Set Up API Keys
57+
58+
Export your NVIDIA API key:
59+
60+
```bash
61+
export NVIDIA_API_KEY=<YOUR_API_KEY>
62+
```
63+
64+
## Use Case Description
65+
66+
Kubernetes clusters generate a constant stream of operational signals — node conditions, pod status changes, events, and resource metrics. Triaging these signals manually is time-consuming, especially in clusters running dozens of workloads across multiple namespaces.
67+
68+
This example provides an agentic system that:
69+
70+
1. **Gathers node diagnostics**: Checks node readiness, conditions (MemoryPressure, DiskPressure, PIDPressure), and resource utilization via `kubectl top`.
71+
2. **Scans pod health**: Identifies unhealthy pods (CrashLoopBackOff, OOMKilled, Pending, Evicted) and flags containers with high restart counts.
72+
3. **Collects cluster events**: Retrieves recent Warning events and correlates them with affected resources.
73+
4. **Analyzes resource pressure**: Detects nodes approaching CPU or memory thresholds and flags active pressure conditions.
74+
5. **Classifies severity**: Uses an LLM to classify the overall incident severity based on collected evidence.
75+
6. **Generates structured reports**: Produces markdown reports with findings, root cause analysis, and recommended remediation steps.
76+
77+
### Why Use an Agentic Design?
78+
79+
An agentic approach provides significant advantages over static dashboards or rule-based alerting:
80+
81+
- **Contextual investigation**: The agent decides which tools to call based on the query, rather than running every check every time.
82+
- **Cross-signal correlation**: Unlike siloed monitoring tools, the agent correlates data from nodes, pods, events, and resources to identify root causes (e.g., OOMKilled pods + MemoryPressure condition = memory exhaustion on a specific node).
83+
- **Natural language reports**: Produces human-readable incident summaries that can be directly shared with team members or fed into ticketing systems.
84+
85+
## How It Works
86+
87+
### Diagnostic Tools
88+
89+
| Tool | Description | Live Mode |
90+
|------|-------------|-----------|
91+
| `node_status_check` | Retrieves node readiness, conditions, and resource utilization (`kubectl get nodes`, `kubectl top nodes`) | Uses `kubectl` |
92+
| `pod_health_check` | Scans for unhealthy pods and high restart counts across namespaces | Uses `kubectl` |
93+
| `event_collector` | Collects recent Warning events and correlates them with affected resources | Uses `kubectl` |
94+
| `resource_pressure_check` | Analyzes CPU/memory utilization against configurable thresholds, checks for pressure conditions | Uses `kubectl` |
95+
| `severity_classifier` | Classifies the final report's severity as critical, warning, or informational | LLM-based |
96+
97+
### Workflow
98+
99+
1. A cluster health query is received (natural language or JSON with scenario context).
100+
2. The monitor agent selects relevant diagnostic tools based on the query.
101+
3. Tools gather data (from `kubectl` in live mode, or from offline scenarios).
102+
4. The agent correlates findings across all tool outputs.
103+
5. A structured diagnostic report is generated.
104+
6. The severity classifier appends an incident severity classification.
105+
106+
### Understanding the Configuration
107+
108+
#### Functions
109+
110+
Each tool is configured in the `functions` section:
111+
112+
```yaml
113+
functions:
114+
node_status_check:
115+
_type: node_status_check
116+
offline_mode: true
117+
resource_pressure_check:
118+
_type: resource_pressure_check
119+
offline_mode: true
120+
cpu_threshold_percent: 80
121+
memory_threshold_percent: 85
122+
```
123+
124+
- `offline_mode`: When `true`, tools return pre-defined responses from the offline scenario dataset.
125+
- `cpu_threshold_percent` / `memory_threshold_percent`: Configurable thresholds for resource pressure alerts.
126+
- `kubeconfig_path`: Optional path to a kubeconfig file for live mode. Defaults to the standard `kubectl` config.
127+
128+
#### Workflow
129+
130+
```yaml
131+
workflow:
132+
_type: k8s_infra_monitor
133+
tool_names:
134+
- node_status_check
135+
- pod_health_check
136+
- event_collector
137+
- resource_pressure_check
138+
llm_name: monitor_agent_llm
139+
offline_mode: true
140+
offline_data_path: examples/k8s_infra_monitor/data/offline_scenarios.json
141+
```
142+
143+
#### LLMs
144+
145+
All tools and the main agent use NVIDIA NIM with `nvidia/nemotron-3-nano-30b-a3b` by default. You can swap this for any supported model.
146+
147+
## Example Usage
148+
149+
### Running in Offline Mode
150+
151+
Offline mode uses predefined scenarios to simulate cluster issues without requiring a real Kubernetes cluster.
152+
153+
Three scenarios are included:
154+
- **`node-not-ready`**: A worker node becomes unreachable, causing pod evictions.
155+
- **`memory-pressure`**: Multiple pods are OOMKilled due to memory exhaustion on a worker node.
156+
- **`healthy-cluster`**: Normal cluster operations with no issues.
157+
158+
```bash
159+
# Investigate a node failure
160+
nat run \
161+
--config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml \
162+
--input '{"scenario_id": "node-not-ready", "query": "Worker node worker-2 appears to be down. Investigate the cluster health."}'
163+
```
164+
165+
```bash
166+
# Investigate OOMKilled pods
167+
nat run \
168+
--config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml \
169+
--input '{"scenario_id": "memory-pressure", "query": "Multiple pods are crashing in the ml-serving namespace. Check what is happening."}'
170+
```
171+
172+
```bash
173+
# Routine health check
174+
nat run \
175+
--config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml \
176+
--input '{"scenario_id": "healthy-cluster", "query": "Run a routine health check on the Kubernetes cluster."}'
177+
```
178+
179+
To evaluate the agent across all scenarios:
180+
181+
```bash
182+
nat eval --config_file=examples/k8s_infra_monitor/configs/config_offline_mode.yml
183+
```
184+
185+
### Running in Live Mode
186+
187+
Live mode connects to a real Kubernetes cluster using `kubectl`. Ensure your `KUBECONFIG` is set or specify `kubeconfig_path` in each tool's configuration.
188+
189+
```bash
190+
# Run a live cluster health check
191+
nat run \
192+
--config_file=examples/k8s_infra_monitor/configs/config_live_mode.yml \
193+
--input "Check the overall health of the Kubernetes cluster. Are there any unhealthy pods or nodes under resource pressure?"
194+
```
195+
196+
You can customize the live mode configuration to:
197+
- Target specific namespaces with the `namespaces` list in `pod_health_check`.
198+
- Adjust resource thresholds with `cpu_threshold_percent` and `memory_threshold_percent`.
199+
- Point to a specific kubeconfig file with `kubeconfig_path`.

examples/k8s_infra_monitor/configs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
src/nat_k8s_infra_monitor/configs

examples/k8s_infra_monitor/data

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
src/nat_k8s_infra_monitor/data
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
[build-system]
17+
build-backend = "setuptools.build_meta"
18+
requires = ["setuptools >= 64", "setuptools-scm>=8"]
19+
20+
[tool.setuptools_scm]
21+
git_describe_command = "git describe --long --first-parent"
22+
root = "../.."
23+
24+
[tool.setuptools.packages.find]
25+
where = ["src"]
26+
27+
[project]
28+
name = "nat_k8s_infra_monitor"
29+
dynamic = ["version"]
30+
requires-python = ">=3.11,<3.14"
31+
description = "Kubernetes Infrastructure Monitor using NeMo Agent Toolkit"
32+
dependencies = [
33+
"nvidia-nat[eval,langchain,profiler,test]~=1.5",
34+
"langchain-core",
35+
"langgraph>=0.0.10",
36+
]
37+
keywords = ["ai", "kubernetes", "monitoring", "agents"]
38+
classifiers = ["Programming Language :: Python"]
39+
40+
[project.entry-points.'nat.components']
41+
nat_k8s_infra_monitor = "nat_k8s_infra_monitor.register"

examples/k8s_infra_monitor/src/nat_k8s_infra_monitor/__init__.py

Whitespace-only changes.
Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
# Live mode configuration — requires kubectl access to a Kubernetes cluster.
17+
# Set KUBECONFIG environment variable or specify kubeconfig_path in each tool.
18+
19+
functions:
20+
node_status_check:
21+
_type: node_status_check
22+
offline_mode: false
23+
# kubeconfig_path: /path/to/kubeconfig # Optional: uses default kubectl config if omitted
24+
pod_health_check:
25+
_type: pod_health_check
26+
offline_mode: false
27+
# namespaces: # Optional: check specific namespaces instead of all
28+
# - default
29+
# - monitoring
30+
# - production
31+
event_collector:
32+
_type: event_collector
33+
offline_mode: false
34+
event_limit: 50
35+
resource_pressure_check:
36+
_type: resource_pressure_check
37+
offline_mode: false
38+
cpu_threshold_percent: 80
39+
memory_threshold_percent: 85
40+
severity_classifier:
41+
_type: severity_classifier
42+
llm_name: classifier_llm
43+
44+
workflow:
45+
_type: k8s_infra_monitor
46+
tool_names:
47+
- node_status_check
48+
- pod_health_check
49+
- event_collector
50+
- resource_pressure_check
51+
llm_name: monitor_agent_llm
52+
offline_mode: false
53+
54+
llms:
55+
monitor_agent_llm:
56+
_type: nim
57+
model_name: nvidia/nemotron-3-nano-30b-a3b
58+
temperature: 0
59+
max_tokens: 16384
60+
61+
classifier_llm:
62+
_type: nim
63+
model_name: nvidia/nemotron-3-nano-30b-a3b
64+
temperature: 0
65+
max_tokens: 2048
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
17+
functions:
18+
node_status_check:
19+
_type: node_status_check
20+
offline_mode: true
21+
pod_health_check:
22+
_type: pod_health_check
23+
offline_mode: true
24+
event_collector:
25+
_type: event_collector
26+
offline_mode: true
27+
event_limit: 50
28+
resource_pressure_check:
29+
_type: resource_pressure_check
30+
offline_mode: true
31+
cpu_threshold_percent: 80
32+
memory_threshold_percent: 85
33+
severity_classifier:
34+
_type: severity_classifier
35+
llm_name: classifier_llm
36+
37+
workflow:
38+
_type: k8s_infra_monitor
39+
tool_names:
40+
- node_status_check
41+
- pod_health_check
42+
- event_collector
43+
- resource_pressure_check
44+
llm_name: monitor_agent_llm
45+
offline_mode: true
46+
offline_data_path: examples/k8s_infra_monitor/data/offline_scenarios.json
47+
48+
llms:
49+
monitor_agent_llm:
50+
_type: nim
51+
model_name: nvidia/nemotron-3-nano-30b-a3b
52+
temperature: 0
53+
max_tokens: 16384
54+
55+
classifier_llm:
56+
_type: nim
57+
model_name: nvidia/nemotron-3-nano-30b-a3b
58+
temperature: 0
59+
max_tokens: 2048
60+
61+
nim_rag_eval_llm:
62+
_type: nim
63+
model_name: nvidia/nemotron-3-nano-30b-a3b
64+
max_tokens: 8
65+
66+
eval:
67+
general:
68+
output_dir: .tmp/nat/examples/k8s_infra_monitor/output/
69+
dataset:
70+
_type: json
71+
file_path: examples/k8s_infra_monitor/data/eval_dataset.json
72+
evaluators:
73+
accuracy:
74+
_type: ragas
75+
metric: AnswerAccuracy
76+
llm_name: nim_rag_eval_llm
77+
groundedness:
78+
_type: ragas
79+
metric: ResponseGroundedness
80+
llm_name: nim_rag_eval_llm

0 commit comments

Comments
 (0)