|
1 | 1 | # Cluster Health Monitor |
2 | 2 |
|
3 | | -A lightweight, real-time monitoring tool for NVIDIA GPUs. Track GPU utilization, memory, temperature, and power during ML training or any GPU workload. |
| 3 | +Real-time GPU and system monitoring with web dashboard and CLI interface. Features intelligent GPU stress testing with auto-scaling workloads and performance baselines. |
4 | 4 |
|
5 | | -## System Requirements |
| 5 | +## Features |
6 | 6 |
|
7 | | -### Hardware |
| 7 | +### Monitoring |
| 8 | +- Real-time GPU metrics (utilization, memory, temperature, power) |
| 9 | +- System metrics (CPU, memory, disk I/O) |
| 10 | +- Web dashboard with live charts |
| 11 | +- Terminal interface with auto-refresh |
| 12 | +- Historical data storage and alerting |
8 | 13 |
|
9 | | -- NVIDIA GPU (GeForce, RTX, Quadro, Tesla, etc.) |
| 14 | +### GPU Benchmarking |
| 15 | +- GEMM (matrix multiplication) stress test |
| 16 | +- Particle simulation workload |
| 17 | +- Auto-scaling stress test (dynamically increases load to 98% GPU utilization) |
| 18 | +- Performance baseline tracking per GPU and benchmark type |
| 19 | +- Multiple test modes: Quick (15s), Standard (60s), Extended (180s), Stress Test, Custom |
10 | 20 |
|
11 | | -### Software |
| 21 | +## Requirements |
12 | 22 |
|
13 | | -- Windows 10/11 or Linux (Ubuntu 18.04+) |
14 | | -- Python 3.8 or higher |
15 | | -- NVIDIA Driver 450.0 or higher |
| 23 | +### Core Monitoring (Always Available) |
| 24 | +- Python 3.8+ |
| 25 | +- NVIDIA GPU with drivers installed |
| 26 | +- `nvidia-smi` command available |
16 | 27 |
|
17 | | -### Verify Your Setup |
| 28 | +### GPU Benchmarking (Optional) |
| 29 | +- CUDA Toolkit 12.0+ or compatible |
| 30 | +- One of: |
| 31 | + - CuPy: `pip install cupy-cuda12x` (or appropriate CUDA version) |
| 32 | + - PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu121` |
18 | 33 |
|
19 | | -Before installing, confirm your GPU is detected: |
| 34 | +## Installation |
20 | 35 |
|
| 36 | +### 1. Clone Repository |
21 | 37 | ```bash |
22 | | -nvidia-smi |
| 38 | +git clone https://github.com/DataBoySu/cluster-monitor.git |
| 39 | +cd cluster-health-monitor |
23 | 40 | ``` |
24 | 41 |
|
25 | | -You should see your GPU listed with driver version. If this command fails, install NVIDIA drivers first. |
| 42 | +### 2. Create Virtual Environment |
| 43 | +```bash |
| 44 | +python -m venv .venv |
| 45 | +``` |
26 | 46 |
|
27 | | -## Installation |
| 47 | +Activate: |
| 48 | +- Windows: `.venv\Scripts\activate` |
| 49 | +- Linux/Mac: `source .venv/bin/activate` |
28 | 50 |
|
29 | | -### Step 1: Clone the Repository |
| 51 | +### 3. Install Dependencies |
30 | 52 |
|
31 | | -```git |
32 | | -git clone https://github.com/DataBoySu/cluster-monitor.git |
33 | | -cd cluster-monitor |
| 53 | +**Basic Monitoring:** |
| 54 | +```bash |
| 55 | +pip install -r requirements.txt |
34 | 56 | ``` |
35 | 57 |
|
36 | | -### Step 2: Create Virtual Environment |
37 | | - |
38 | | -Windows: |
39 | | - |
40 | | -```python |
41 | | -python -m venv venv |
42 | | -venv\Scripts\activate |
| 58 | +**With GPU Benchmarking (CuPy):** |
| 59 | +```bash |
| 60 | +pip install -r requirements.txt |
| 61 | +pip install cupy-cuda12x # Adjust for your CUDA version |
43 | 62 | ``` |
44 | 63 |
|
45 | | -Linux/macOS: |
| 64 | +**With GPU Benchmarking (PyTorch):** |
| 65 | +```bash |
| 66 | +pip install -r requirements.txt |
| 67 | +pip install torch --index-url https://download.pytorch.org/whl/cu121 |
| 68 | +``` |
46 | 69 |
|
47 | | -```python |
48 | | -python3 -m venv venv |
49 | | -source venv/bin/activate |
| 70 | +### 4. Verify Installation |
| 71 | +```bash |
| 72 | +python health_monitor.py --help |
50 | 73 | ``` |
51 | 74 |
|
52 | | -### Step 3: Install Dependencies |
| 75 | +## Usage |
53 | 76 |
|
54 | | -```python |
55 | | -pip install -r requirements.txt |
| 77 | +### Web Dashboard (Recommended) |
| 78 | +```bash |
| 79 | +python health_monitor.py monitor --web |
56 | 80 | ``` |
57 | 81 |
|
58 | | -### Step 4: Verify Installation |
| 82 | +Access at: http://localhost:8090 |
| 83 | + |
| 84 | +Features: |
| 85 | +- Real-time GPU/system metrics |
| 86 | +- Interactive benchmark controls |
| 87 | +- Live performance charts |
| 88 | +- Historical data visualization |
59 | 89 |
|
60 | | -```python |
61 | | -python health_monitor.py --once |
| 90 | +### Terminal Dashboard |
| 91 | +```bash |
| 92 | +python health_monitor.py monitor |
62 | 93 | ``` |
63 | 94 |
|
64 | | -This should print your GPU information once and exit. |
| 95 | +Displays live metrics in terminal with auto-refresh. |
65 | 96 |
|
66 | | -## Usage |
| 97 | +### CLI Benchmark |
| 98 | +```bash |
| 99 | +# Quick 15-second test |
| 100 | +python health_monitor.py benchmark --mode quick |
67 | 101 |
|
68 | | -### CLI Dashboard (Terminal) |
| 102 | +# Standard 60-second test |
| 103 | +python health_monitor.py benchmark --mode standard |
69 | 104 |
|
70 | | -Live monitoring in your terminal with auto-refresh: |
| 105 | +# Stress test with auto-scaling (pushes GPU to 98% util) |
| 106 | +python health_monitor.py benchmark --mode stress-test --type particle |
71 | 107 |
|
72 | | -```python |
73 | | -python health_monitor.py --cli |
74 | | -``` |
| 108 | +# Extended 180-second burn-in |
| 109 | +python health_monitor.py benchmark --mode extended |
75 | 110 |
|
76 | | -Press Ctrl+C to exit. |
| 111 | +# Custom configuration |
| 112 | +python health_monitor.py benchmark --mode custom --duration 120 --temp-limit 85 |
| 113 | +``` |
77 | 114 |
|
78 | | -### Single Snapshot |
| 115 | +## Benchmark Modes |
79 | 116 |
|
80 | | -Print GPU info once and exit: |
| 117 | +| Mode | Duration | Workload | Auto-Scale | Use Case | |
| 118 | +|------|----------|----------|------------|----------| |
| 119 | +| Quick | 15s | Fixed | No | Quick baseline check | |
| 120 | +| Standard | 60s | Fixed | No | Standard benchmark | |
| 121 | +| Extended | 180s | Fixed | No | Long-term stability | |
| 122 | +| Stress Test | 60s | Dynamic | Yes | Maximum GPU load testing | |
| 123 | +| Custom | Variable | Fixed | Optional | User-defined parameters | |
81 | 124 |
|
82 | | -```python |
83 | | -python health_monitor.py --once |
84 | | -``` |
| 125 | +### Auto-Scaling Stress Test |
85 | 126 |
|
86 | | -### Web Dashboard (Optional) |
| 127 | +The Stress Test mode automatically increases workload intensity: |
87 | 128 |
|
88 | | -Start a web server with browser-based dashboard: |
| 129 | +1. Starts with baseline workload (2048x2048 GEMM or 100K particles) |
| 130 | +2. Every 2 seconds, checks GPU utilization |
| 131 | +3. Scales workload aggressively if GPU util < target: |
| 132 | + - `<70% util`: 2.0x scaling |
| 133 | + - `70-85% util`: 1.5x scaling |
| 134 | + - `85-93% util`: 1.2x scaling |
| 135 | + - `>93% util`: Target reached |
| 136 | +4. Continues scaling up to 15 times or until 98% GPU utilization achieved |
89 | 137 |
|
90 | | -```python |
91 | | -python health_monitor.py --web --port 8888 |
| 138 | +Example progression: |
| 139 | +``` |
| 140 | +100K particles → 200K → 400K → 800K → 1.2M → 1.8M → 2.2M → 2.6M (94% GPU util) |
92 | 141 | ``` |
93 | 142 |
|
94 | | -Then open <http://localhost:8888> in your browser. |
| 143 | +## Benchmark Types |
95 | 144 |
|
96 | | -## What You See |
| 145 | +### GEMM (Matrix Multiplication) |
| 146 | +Dense matrix multiplication for maximum compute stress. Measures TFLOPS. |
| 147 | + |
| 148 | +```bash |
| 149 | +python health_monitor.py benchmark --type gemm --mode stress-test |
| 150 | +``` |
97 | 151 |
|
98 | | -The monitor displays: |
| 152 | +### Particle Simulation |
| 153 | +Vectorized particle physics simulation with collision detection. Measures steps/second. |
99 | 154 |
|
100 | | -- GPU utilization (%) |
101 | | -- Memory usage (used/total GB) |
102 | | -- Temperature (C) |
103 | | -- Power draw (W) |
104 | | -- CPU and RAM usage (system) |
| 155 | +```bash |
| 156 | +python health_monitor.py benchmark --type particle --mode stress-test |
| 157 | +``` |
105 | 158 |
|
106 | 159 | ## Configuration |
107 | 160 |
|
108 | | -Edit `config.yaml` to customize: |
| 161 | +Edit `config.yaml`: |
109 | 162 |
|
110 | 163 | ```yaml |
111 | 164 | monitoring: |
112 | | - interval_seconds: 5 # How often to refresh |
| 165 | + interval_seconds: 5 |
| 166 | + history_retention_hours: 168 |
113 | 167 |
|
114 | 168 | alerts: |
115 | | - gpu_temperature_warn: 80 # Warn at 80C |
116 | | - gpu_temperature_critical: 90 # Critical at 90C |
| 169 | + gpu_temperature_warn: 80 |
| 170 | + gpu_temperature_critical: 90 |
| 171 | + gpu_memory_usage_warn: 90 |
| 172 | + |
| 173 | +web: |
| 174 | + host: 0.0.0.0 |
| 175 | + port: 8090 |
| 176 | + |
| 177 | +storage: |
| 178 | + path: ./metrics.db |
117 | 179 | ``` |
118 | 180 |
|
119 | | -## Troubleshooting |
| 181 | +## Project Structure |
120 | 182 |
|
121 | | -### "No NVIDIA GPU detected" |
| 183 | +``` |
| 184 | +cluster-health-monitor/ |
| 185 | +├── monitor/ |
| 186 | +│ ├── benchmark/ |
| 187 | +│ │ ├── config.py # Benchmark configuration |
| 188 | +│ │ ├── storage.py # Baseline storage (SQLite) |
| 189 | +│ │ ├── workloads.py # GPU workloads (GEMM/Particle) |
| 190 | +│ │ └── runner.py # Benchmark orchestration |
| 191 | +│ ├── collectors/ |
| 192 | +│ │ ├── gpu.py # GPU metrics via nvidia-smi |
| 193 | +│ │ ├── system.py # CPU, memory, disk |
| 194 | +│ │ └── network.py # Network info |
| 195 | +│ ├── storage/ |
| 196 | +│ │ └── sqlite.py # Metrics persistence |
| 197 | +│ ├── api/ |
| 198 | +│ │ ├── server.py # FastAPI web server |
| 199 | +│ │ └── templates/ |
| 200 | +│ │ └── index.html # Web dashboard |
| 201 | +│ └── cli/ |
| 202 | +│ └── benchmark_cli.py # CLI commands |
| 203 | +├── config.yaml # Configuration |
| 204 | +├── requirements.txt # Dependencies |
| 205 | +└── health_monitor.py # Main entry point |
| 206 | +``` |
| 207 | + |
| 208 | +## API Endpoints |
| 209 | + |
| 210 | +When running web server (`--web`): |
122 | 211 |
|
123 | | -- Run `nvidia-smi` to verify driver is installed |
124 | | -- Make sure you have a discrete NVIDIA GPU (not Intel/AMD integrated) |
| 212 | +- `GET /` - Web dashboard |
| 213 | +- `GET /api/status` - Current metrics |
| 214 | +- `GET /api/history` - Historical data |
| 215 | +- `POST /api/benchmark/start` - Start benchmark |
| 216 | +- `GET /api/benchmark/status` - Benchmark progress |
| 217 | +- `POST /api/benchmark/stop` - Stop benchmark |
| 218 | +- `GET /api/benchmark/results` - Get results |
| 219 | +- `GET /api/benchmark/baseline` - Get baseline for GPU |
125 | 220 |
|
126 | | -### "pynvml not found" or "ModuleNotFoundError" |
| 221 | +## Troubleshooting |
127 | 222 |
|
128 | | -- Make sure virtual environment is activated |
129 | | -- Run: `pip install pynvml` |
| 223 | +### "nvidia-smi not found" |
| 224 | +- Install NVIDIA drivers |
| 225 | +- Add nvidia-smi to PATH |
| 226 | +- Verify: `nvidia-smi` in terminal |
130 | 227 |
|
131 | | -### "rich not found" |
| 228 | +### "No CUDA libraries found" |
| 229 | +Benchmarking features disabled without CUDA libraries. Install CuPy or PyTorch. |
132 | 230 |
|
133 | | -- Run: `pip install rich` |
| 231 | +### Web dashboard not loading data |
| 232 | +- Check terminal for errors |
| 233 | +- Verify port 8090 is available |
| 234 | +- Check firewall settings |
| 235 | +- Try: `http://127.0.0.1:8090` |
134 | 236 |
|
135 | | -### Web dashboard not loading |
| 237 | +### Benchmark not scaling GPU to 98% |
| 238 | +- Increase max_scales in runner.py |
| 239 | +- Check GPU has available memory |
| 240 | +- Verify no other GPU workloads running |
| 241 | +- Try different benchmark type (GEMM vs Particle) |
136 | 242 |
|
137 | | -- Install web dependencies: `pip install fastapi uvicorn` |
138 | | -- Check if port 8080 is available |
| 243 | +## Performance Tips |
139 | 244 |
|
140 | | -### High CPU usage |
| 245 | +1. **Close other GPU applications** during benchmarking |
| 246 | +2. **Adequate cooling** for stress tests |
| 247 | +3. **Monitor temperatures** - tests will stop at temp limit |
| 248 | +4. **Use Stress Test mode** to find maximum GPU performance |
| 249 | +5. **Run Extended mode** for stability validation |
141 | 250 |
|
142 | | -- Increase refresh interval in config.yaml |
| 251 | +## Development |
143 | 252 |
|
144 | | -## Dependencies |
| 253 | +### Run Tests |
| 254 | +```bash |
| 255 | +pytest tests/ |
| 256 | +``` |
145 | 257 |
|
146 | | -- pynvml - NVIDIA GPU metrics |
147 | | -- psutil - System metrics (CPU, RAM, disk) |
148 | | -- pyyaml - Configuration file parsing |
149 | | -- click - Command line interface |
150 | | -- rich - Terminal UI |
151 | | -- fastapi - REST API |
152 | | -- uvicorn - Web server |
| 258 | +### Code Structure |
| 259 | +- Modular design: config, storage, workloads, runner separated |
| 260 | +- Clean API exports via `__init__.py` |
| 261 | +- Type hints throughout |
| 262 | +- Comprehensive error handling |
| 263 | + |
| 264 | +### Contributing |
| 265 | +1. Fork repository |
| 266 | +2. Create feature branch |
| 267 | +3. Add tests for new features |
| 268 | +4. Submit pull request |
153 | 269 |
|
154 | 270 | ## License |
155 | 271 |
|
156 | | -MIT License |
| 272 | +MIT License - See LICENSE file |
| 273 | + |
| 274 | +## Acknowledgments |
| 275 | + |
| 276 | +- Built with FastAPI, Rich, Chart.js |
| 277 | +- GPU compute via CuPy and PyTorch |
| 278 | +- Inspired by nvidia-smi and GPU monitoring tools |
| 279 | + |
| 280 | +## Support |
| 281 | + |
| 282 | +- Issues: GitHub Issues |
| 283 | +- Documentation: This README |
| 284 | +- CUDA setup: https://developer.nvidia.com/cuda-downloads |
0 commit comments