|
| 1 | +# Agent Documentation (AGENTS.md) |
| 2 | + |
| 3 | +This document provides a technical overview of the **MyGPU** repository to assist LLM agents and developers in understanding the codebase structure, data flow, and implementation details. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## 🏗 Repository Structure |
| 8 | + |
| 9 | +### Core Package: `monitor/` |
| 10 | +The heart of the application, organized by responsibility: |
| 11 | + |
| 12 | +- **`monitor/api/`**: The FastAPI-based web server. |
| 13 | + - `server.py`: Main API definition, WebSocket handling, and routing. |
| 14 | + - `templates/` & `static/`: Frontend assets (Vanilla JS, CSS, HTML). |
| 15 | +- **`monitor/collectors/`**: Data acquisition layer. |
| 16 | + - `gpu.py`: NVIDIA/Apple Silicon GPU metric collection. |
| 17 | + - `system.py`: CPU, RAM, Disk, and Hostname info (via `psutil`). |
| 18 | + - `network.py`: Network interface statistics. |
| 19 | +- **`monitor/benchmark/`**: Stress-testing and physics workloads. |
| 20 | + - `runner.py`: Orchestrates benchmark execution. |
| 21 | + - `physics_torch.py` / `gpu_setup.py`: PyTorch-based particle physics engine. |
| 22 | + - `workloads.py`: GEMM and other computational stress tests. |
| 23 | +- **`monitor/storage/`**: Persistance layer. |
| 24 | + - `sqlite.py`: Manages the `metrics.db` SQLite database using a unified connector. |
| 25 | +- **`monitor/alerting/`**: Alert engine and notifications. |
| 26 | + - `rules.py`: Threshold evaluation logic. |
| 27 | + - `toaster.py`: Cross-platform system notifications (Windows, Linux, macOS). |
| 28 | +- **`monitor/utils/`**: Helper utilities. |
| 29 | + - `features.py`: Capability detection (CUDA, CuPy, PyTorch, Platform). |
| 30 | + |
| 31 | +### External Entry Points |
| 32 | +- **`health_monitor.py`**: The primary CLI entry point. Uses `click` for commands (`web`, `cli`, `benchmark`, `refresh`). |
| 33 | +- **`setup.ps1` / `setup.sh`**: Cross-platform environment installers (uses `uv`). |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## 🔄 Data Flow & Connectivity |
| 38 | + |
| 39 | +1. **Collection**: `health_monitor.py` starts a background thread or process that periodically triggers `collectors`. |
| 40 | +2. **Storage**: Collected metrics are passed to `monitor.storage.sqlite` and appended to the `metrics.db` file. |
| 41 | +3. **API Service**: `monitor.api.server` reads live data from memory (cached state) and historical data from the SQLite database. |
| 42 | +4. **Frontend**: The web dashboard polls the `/api/status` endpoint for live updates and uses WebSockets (`/ws/simulation`) for real-time benchmark visualization. |
| 43 | +5. **Alerting**: The `AlertEngine` evaluates every new metric sample against rules defined in `config.yaml`. If a threshold is hit, it triggers `toaster.py`. |
| 44 | + |
| 45 | +--- |
| 46 | + |
| 47 | +## 🛠 Technology Stack |
| 48 | + |
| 49 | +- **Backend**: Python 3.10+, FastAPI (Web Server), Click (CLI). |
| 50 | +- **Frontend**: Vanilla JS (Dynamic UI), Chart.js (History graphs). |
| 51 | +- **GPU Computing**: |
| 52 | + - NVIDIA: `nvidia-ml-py` (NVML) for metrics, `CuPy` or `PyTorch` for benchmarks. |
| 53 | + - Apple Silicon: `psutil` and native commands for basic metrics. |
| 54 | +- **Environment**: `uv` is the preferred package manager for virtual environments. |
| 55 | + |
| 56 | +--- |
| 57 | + |
| 58 | +## 🤖 LLM Implementation Principles |
| 59 | + |
| 60 | +When modifying this repository, please adhere to these guidelines: |
| 61 | + |
| 62 | +1. **Cross-Platform First**: Always consider Windows, Linux, and macOS. Use `platform.system()` and provide fallbacks. |
| 63 | +2. **Modular Collectors**: If adding a new metric, create a new file in `monitor/collectors/` and register it in the main loop within `health_monitor.py`. |
| 64 | +3. **Non-Blocking API**: API endpoints and WebSockets must remain non-blocking. Use `asyncio` for I/O and `threading` for compute-heavy benchmarks. |
| 65 | +4. **Graceful Degredation**: Ensure the dashboard works even if no GPU is detected (fall back to CPU metrics). |
| 66 | +5. **Database Integrity**: Use the existing `SQLiteManager` in `monitor/storage/sqlite.py` to ensure thread-safe database access. |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +## ⚠️ Known "Old" or Volatile Files |
| 71 | +- **`old/`**: Contains legacy translation and utility scripts. These are preserved for reference but are not part of the runtime. |
| 72 | +- **`metrics.db`**: Automatically generated. Can be safely deleted to reset history. |
| 73 | +- **`.features_cache`**: Caches hardware detection results. Run `python health_monitor.py refresh` to clear. |
0 commit comments