Skip to content

Commit c721caf

Browse files
authored
Merge pull request #1 from DataBoySu/modularity
Modulize the files
2 parents d5291a9 + 4374854 commit c721caf

14 files changed

Lines changed: 2935 additions & 1371 deletions

README.md

Lines changed: 214 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -1,156 +1,284 @@
11
# Cluster Health Monitor
22

3-
A lightweight, real-time monitoring tool for NVIDIA GPUs. Track GPU utilization, memory, temperature, and power during ML training or any GPU workload.
3+
Real-time GPU and system monitoring with web dashboard and CLI interface. Features intelligent GPU stress testing with auto-scaling workloads and performance baselines.
44

5-
## System Requirements
5+
## Features
66

7-
### Hardware
7+
### Monitoring
8+
- Real-time GPU metrics (utilization, memory, temperature, power)
9+
- System metrics (CPU, memory, disk I/O)
10+
- Web dashboard with live charts
11+
- Terminal interface with auto-refresh
12+
- Historical data storage and alerting
813

9-
- NVIDIA GPU (GeForce, RTX, Quadro, Tesla, etc.)
14+
### GPU Benchmarking
15+
- GEMM (matrix multiplication) stress test
16+
- Particle simulation workload
17+
- Auto-scaling stress test (dynamically increases load to 98% GPU utilization)
18+
- Performance baseline tracking per GPU and benchmark type
19+
- Multiple test modes: Quick (15s), Standard (60s), Extended (180s), Stress Test, Custom
1020

11-
### Software
21+
## Requirements
1222

13-
- Windows 10/11 or Linux (Ubuntu 18.04+)
14-
- Python 3.8 or higher
15-
- NVIDIA Driver 450.0 or higher
23+
### Core Monitoring (Always Available)
24+
- Python 3.8+
25+
- NVIDIA GPU with drivers installed
26+
- `nvidia-smi` command available
1627

17-
### Verify Your Setup
28+
### GPU Benchmarking (Optional)
29+
- CUDA Toolkit 12.0+ or compatible
30+
- One of:
31+
- CuPy: `pip install cupy-cuda12x` (or appropriate CUDA version)
32+
- PyTorch: `pip install torch --index-url https://download.pytorch.org/whl/cu121`
1833

19-
Before installing, confirm your GPU is detected:
34+
## Installation
2035

36+
### 1. Clone Repository
2137
```bash
22-
nvidia-smi
38+
git clone https://github.com/DataBoySu/cluster-monitor.git
39+
cd cluster-health-monitor
2340
```
2441

25-
You should see your GPU listed with driver version. If this command fails, install NVIDIA drivers first.
42+
### 2. Create Virtual Environment
43+
```bash
44+
python -m venv .venv
45+
```
2646

27-
## Installation
47+
Activate:
48+
- Windows: `.venv\Scripts\activate`
49+
- Linux/Mac: `source .venv/bin/activate`
2850

29-
### Step 1: Clone the Repository
51+
### 3. Install Dependencies
3052

31-
```git
32-
git clone https://github.com/DataBoySu/cluster-monitor.git
33-
cd cluster-monitor
53+
**Basic Monitoring:**
54+
```bash
55+
pip install -r requirements.txt
3456
```
3557

36-
### Step 2: Create Virtual Environment
37-
38-
Windows:
39-
40-
```python
41-
python -m venv venv
42-
venv\Scripts\activate
58+
**With GPU Benchmarking (CuPy):**
59+
```bash
60+
pip install -r requirements.txt
61+
pip install cupy-cuda12x # Adjust for your CUDA version
4362
```
4463

45-
Linux/macOS:
64+
**With GPU Benchmarking (PyTorch):**
65+
```bash
66+
pip install -r requirements.txt
67+
pip install torch --index-url https://download.pytorch.org/whl/cu121
68+
```
4669

47-
```python
48-
python3 -m venv venv
49-
source venv/bin/activate
70+
### 4. Verify Installation
71+
```bash
72+
python health_monitor.py --help
5073
```
5174

52-
### Step 3: Install Dependencies
75+
## Usage
5376

54-
```python
55-
pip install -r requirements.txt
77+
### Web Dashboard (Recommended)
78+
```bash
79+
python health_monitor.py monitor --web
5680
```
5781

58-
### Step 4: Verify Installation
82+
Access at: http://localhost:8090
83+
84+
Features:
85+
- Real-time GPU/system metrics
86+
- Interactive benchmark controls
87+
- Live performance charts
88+
- Historical data visualization
5989

60-
```python
61-
python health_monitor.py --once
90+
### Terminal Dashboard
91+
```bash
92+
python health_monitor.py monitor
6293
```
6394

64-
This should print your GPU information once and exit.
95+
Displays live metrics in terminal with auto-refresh.
6596

66-
## Usage
97+
### CLI Benchmark
98+
```bash
99+
# Quick 15-second test
100+
python health_monitor.py benchmark --mode quick
67101

68-
### CLI Dashboard (Terminal)
102+
# Standard 60-second test
103+
python health_monitor.py benchmark --mode standard
69104

70-
Live monitoring in your terminal with auto-refresh:
105+
# Stress test with auto-scaling (pushes GPU to 98% util)
106+
python health_monitor.py benchmark --mode stress-test --type particle
71107

72-
```python
73-
python health_monitor.py --cli
74-
```
108+
# Extended 180-second burn-in
109+
python health_monitor.py benchmark --mode extended
75110

76-
Press Ctrl+C to exit.
111+
# Custom configuration
112+
python health_monitor.py benchmark --mode custom --duration 120 --temp-limit 85
113+
```
77114

78-
### Single Snapshot
115+
## Benchmark Modes
79116

80-
Print GPU info once and exit:
117+
| Mode | Duration | Workload | Auto-Scale | Use Case |
118+
|------|----------|----------|------------|----------|
119+
| Quick | 15s | Fixed | No | Quick baseline check |
120+
| Standard | 60s | Fixed | No | Standard benchmark |
121+
| Extended | 180s | Fixed | No | Long-term stability |
122+
| Stress Test | 60s | Dynamic | Yes | Maximum GPU load testing |
123+
| Custom | Variable | Fixed | Optional | User-defined parameters |
81124

82-
```python
83-
python health_monitor.py --once
84-
```
125+
### Auto-Scaling Stress Test
85126

86-
### Web Dashboard (Optional)
127+
The Stress Test mode automatically increases workload intensity:
87128

88-
Start a web server with browser-based dashboard:
129+
1. Starts with baseline workload (2048x2048 GEMM or 100K particles)
130+
2. Every 2 seconds, checks GPU utilization
131+
3. Scales workload aggressively if GPU util < target:
132+
- `<70% util`: 2.0x scaling
133+
- `70-85% util`: 1.5x scaling
134+
- `85-93% util`: 1.2x scaling
135+
- `>93% util`: Target reached
136+
4. Continues scaling up to 15 times or until 98% GPU utilization achieved
89137

90-
```python
91-
python health_monitor.py --web --port 8888
138+
Example progression:
139+
```
140+
100K particles → 200K → 400K → 800K → 1.2M → 1.8M → 2.2M → 2.6M (94% GPU util)
92141
```
93142

94-
Then open <http://localhost:8888> in your browser.
143+
## Benchmark Types
95144

96-
## What You See
145+
### GEMM (Matrix Multiplication)
146+
Dense matrix multiplication for maximum compute stress. Measures TFLOPS.
147+
148+
```bash
149+
python health_monitor.py benchmark --type gemm --mode stress-test
150+
```
97151

98-
The monitor displays:
152+
### Particle Simulation
153+
Vectorized particle physics simulation with collision detection. Measures steps/second.
99154

100-
- GPU utilization (%)
101-
- Memory usage (used/total GB)
102-
- Temperature (C)
103-
- Power draw (W)
104-
- CPU and RAM usage (system)
155+
```bash
156+
python health_monitor.py benchmark --type particle --mode stress-test
157+
```
105158

106159
## Configuration
107160

108-
Edit `config.yaml` to customize:
161+
Edit `config.yaml`:
109162

110163
```yaml
111164
monitoring:
112-
interval_seconds: 5 # How often to refresh
165+
interval_seconds: 5
166+
history_retention_hours: 168
113167

114168
alerts:
115-
gpu_temperature_warn: 80 # Warn at 80C
116-
gpu_temperature_critical: 90 # Critical at 90C
169+
gpu_temperature_warn: 80
170+
gpu_temperature_critical: 90
171+
gpu_memory_usage_warn: 90
172+
173+
web:
174+
host: 0.0.0.0
175+
port: 8090
176+
177+
storage:
178+
path: ./metrics.db
117179
```
118180
119-
## Troubleshooting
181+
## Project Structure
120182
121-
### "No NVIDIA GPU detected"
183+
```
184+
cluster-health-monitor/
185+
├── monitor/
186+
│ ├── benchmark/
187+
│ │ ├── config.py # Benchmark configuration
188+
│ │ ├── storage.py # Baseline storage (SQLite)
189+
│ │ ├── workloads.py # GPU workloads (GEMM/Particle)
190+
│ │ └── runner.py # Benchmark orchestration
191+
│ ├── collectors/
192+
│ │ ├── gpu.py # GPU metrics via nvidia-smi
193+
│ │ ├── system.py # CPU, memory, disk
194+
│ │ └── network.py # Network info
195+
│ ├── storage/
196+
│ │ └── sqlite.py # Metrics persistence
197+
│ ├── api/
198+
│ │ ├── server.py # FastAPI web server
199+
│ │ └── templates/
200+
│ │ └── index.html # Web dashboard
201+
│ └── cli/
202+
│ └── benchmark_cli.py # CLI commands
203+
├── config.yaml # Configuration
204+
├── requirements.txt # Dependencies
205+
└── health_monitor.py # Main entry point
206+
```
207+
208+
## API Endpoints
209+
210+
When running web server (`--web`):
122211

123-
- Run `nvidia-smi` to verify driver is installed
124-
- Make sure you have a discrete NVIDIA GPU (not Intel/AMD integrated)
212+
- `GET /` - Web dashboard
213+
- `GET /api/status` - Current metrics
214+
- `GET /api/history` - Historical data
215+
- `POST /api/benchmark/start` - Start benchmark
216+
- `GET /api/benchmark/status` - Benchmark progress
217+
- `POST /api/benchmark/stop` - Stop benchmark
218+
- `GET /api/benchmark/results` - Get results
219+
- `GET /api/benchmark/baseline` - Get baseline for GPU
125220

126-
### "pynvml not found" or "ModuleNotFoundError"
221+
## Troubleshooting
127222

128-
- Make sure virtual environment is activated
129-
- Run: `pip install pynvml`
223+
### "nvidia-smi not found"
224+
- Install NVIDIA drivers
225+
- Add nvidia-smi to PATH
226+
- Verify: `nvidia-smi` in terminal
130227

131-
### "rich not found"
228+
### "No CUDA libraries found"
229+
Benchmarking features disabled without CUDA libraries. Install CuPy or PyTorch.
132230

133-
- Run: `pip install rich`
231+
### Web dashboard not loading data
232+
- Check terminal for errors
233+
- Verify port 8090 is available
234+
- Check firewall settings
235+
- Try: `http://127.0.0.1:8090`
134236

135-
### Web dashboard not loading
237+
### Benchmark not scaling GPU to 98%
238+
- Increase max_scales in runner.py
239+
- Check GPU has available memory
240+
- Verify no other GPU workloads running
241+
- Try different benchmark type (GEMM vs Particle)
136242

137-
- Install web dependencies: `pip install fastapi uvicorn`
138-
- Check if port 8080 is available
243+
## Performance Tips
139244

140-
### High CPU usage
245+
1. **Close other GPU applications** during benchmarking
246+
2. **Adequate cooling** for stress tests
247+
3. **Monitor temperatures** - tests will stop at temp limit
248+
4. **Use Stress Test mode** to find maximum GPU performance
249+
5. **Run Extended mode** for stability validation
141250

142-
- Increase refresh interval in config.yaml
251+
## Development
143252

144-
## Dependencies
253+
### Run Tests
254+
```bash
255+
pytest tests/
256+
```
145257

146-
- pynvml - NVIDIA GPU metrics
147-
- psutil - System metrics (CPU, RAM, disk)
148-
- pyyaml - Configuration file parsing
149-
- click - Command line interface
150-
- rich - Terminal UI
151-
- fastapi - REST API
152-
- uvicorn - Web server
258+
### Code Structure
259+
- Modular design: config, storage, workloads, runner separated
260+
- Clean API exports via `__init__.py`
261+
- Type hints throughout
262+
- Comprehensive error handling
263+
264+
### Contributing
265+
1. Fork repository
266+
2. Create feature branch
267+
3. Add tests for new features
268+
4. Submit pull request
153269

154270
## License
155271

156-
MIT License
272+
MIT License - See LICENSE file
273+
274+
## Acknowledgments
275+
276+
- Built with FastAPI, Rich, Chart.js
277+
- GPU compute via CuPy and PyTorch
278+
- Inspired by nvidia-smi and GPU monitoring tools
279+
280+
## Support
281+
282+
- Issues: GitHub Issues
283+
- Documentation: This README
284+
- CUDA setup: https://developer.nvidia.com/cuda-downloads

0 commit comments

Comments
 (0)