|
| 1 | +# 🧠 The AI Log SRE: Turning Noise into Insights |
| 2 | + |
| 3 | +**Project Status:** ✅ Operational |
| 4 | +**Components:** Grafana Loki, Google Gemini 2.0 Flash, Home Assistant, Unraid, Python |
| 5 | + |
| 6 | +## 1. The Problem: Log Fatigue |
| 7 | +In a distributed homelab (Unraid, Proxmox VE, Edge Servers, DNS (Adguard + Unbound), Traefik, Unifi Network, Tailscale ...), logs are scattered everywhere. |
| 8 | +* **Volume:** My servers generate ~1GB of text logs daily. |
| 9 | +* **Visibility:** I only looked at logs *after* I noticed something was broken. |
| 10 | +* **Noise:** 99% of logs are "Info", masking the 1% "Critical" errors. |
| 11 | + |
| 12 | +I needed a system that wouldn't just *store* logs, but actively *analyze* them and tap me on the shoulder only when it found something I actually needed to see. |
| 13 | + |
| 14 | +## 2. The Solution |
| 15 | +I built a centralized logging pipeline using **Grafana** and **Loki** (for storage) and a custom **Python + Gemini** script (for analysis). |
| 16 | + |
| 17 | +Instead of feeding raw logs to an LLM (which is slow and expensive), I implemented a **"Pre-processing Engine"** that: |
| 18 | +1. **Fetches** the last 24 hours of history. |
| 19 | +2. **Deduplicates** repetitive errors (e.g., compressing 5,000 "Connection Refused" lines into 1 line). |
| 20 | +3. **Summarizes** the context using Google Gemini. |
| 21 | +4. **Reports** the findings to my Home Assistant dashboard. |
| 22 | + |
| 23 | +<video width="100%" autoplay loop muted playsinline> |
| 24 | + <source src="ai-home-assistant-dashboard.mp4" type="video/mp4"> |
| 25 | + Your browser does not support the video tag. |
| 26 | +</video> |
| 27 | + |
| 28 | +## 3. Architecture Diagram |
| 29 | + |
| 30 | +```mermaid |
| 31 | +graph TD |
| 32 | + %% --- LEVEL 1: EDGE --- |
| 33 | + subgraph Edge ["Edge Nodes (Collectors)"] |
| 34 | + direction TB |
| 35 | + Docker[Docker Logs] & System[System Logs] --> Promtail[Promtail Agent] |
| 36 | + end |
| 37 | +
|
| 38 | + %% --- LEVEL 2: UNRAID --- |
| 39 | + subgraph Unraid ["Unraid Server (The Brain)"] |
| 40 | + direction TB |
| 41 | + Promtail -->|Push| Loki[Loki DB] |
| 42 | + |
| 43 | + %% The Fork: Human vs AI |
| 44 | + Loki -->|Visualise| Grafana[Grafana UI] |
| 45 | + Loki -->|Fetch| Script[Python Script] |
| 46 | + |
| 47 | + Script <-->|Analyze| Gemini[Gemini API] |
| 48 | + end |
| 49 | +
|
| 50 | + %% --- LEVEL 3: HA --- |
| 51 | + subgraph HA ["Home Assistant (Interface)"] |
| 52 | + direction TB |
| 53 | + Script -->|Webhook| Core[Home Assistant] |
| 54 | + Core --> Phone[Mobile Alert] |
| 55 | + Core --> Wall[Dashboard] |
| 56 | + end |
| 57 | +
|
| 58 | + %% --- THE FIX: INVISIBLE STRUT --- |
| 59 | + %% Forces HA to stay at the bottom |
| 60 | + Gemini ~~~ Core |
| 61 | +
|
| 62 | + %% --- STYLING --- |
| 63 | + style Loki fill:#f9f,stroke:#333,stroke-width:2px |
| 64 | + style Script fill:#ff9,stroke:#333,stroke-width:2px |
| 65 | + style Gemini fill:#4285f4,stroke:#fff,stroke-width:2px,color:#fff |
| 66 | + style Grafana fill:#ff9900,stroke:#333,stroke-width:2px,color:white |
| 67 | +``` |
| 68 | + |
| 69 | +## 4. Key Features |
| 70 | +* **Cost Effic ient:** Uses client-side deduplication to reduce token usage by ~95%. |
| 71 | +* **Massive Context:** Can analyze up to 50,000 log lines per run. |
| 72 | +* **Self-Healing:** If the report fails, Home Assistant retains the last known state. |
| 73 | +* **Privacy:** Only anonymized/filtered error logs are sent to the AI; raw logs stay local. |
| 74 | + |
| 75 | +*** |
| 76 | +<!-- |
| 77 | +### File 2: `02-implementation.md` |
| 78 | +*Use this file for the technical setup steps and code.* |
| 79 | +--> |
| 80 | +<br> |
| 81 | +<br> |
| 82 | +<br> |
| 83 | +<br> |
| 84 | +<br> |
| 85 | +<br> |
| 86 | + |
| 87 | +# 🛠️ Implementation Guide |
| 88 | + |
| 89 | +This guide details how to reproduce the "AI Log SRE" stack. |
| 90 | + |
| 91 | +## Prerequisites |
| 92 | +* **Unraid Server** (or any Docker host). |
| 93 | +* **Google Gemini API Key** (Free tier is sufficient, Paid recommended for high limits). |
| 94 | +* **Home Assistant** (for notifications). |
| 95 | + |
| 96 | +--- |
| 97 | + |
| 98 | +## Step 1: Central Server (Unraid) |
| 99 | +We run the `loki` database and the `ai-reporter` script in a single stack. |
| 100 | + |
| 101 | +### Docker Compose |
| 102 | +```yaml |
| 103 | +services: |
| 104 | + loki: |
| 105 | + image: grafana/loki:3.1.0 |
| 106 | + container_name: loki |
| 107 | + user: "0:0" |
| 108 | + volumes: |
| 109 | + - /mnt/docker/appdata/loki:/loki |
| 110 | + command: -config.file=/loki/local-config.yaml |
| 111 | + network_mode: host |
| 112 | + restart: unless-stopped |
| 113 | + |
| 114 | + grafana: |
| 115 | + image: grafana/grafana:latest |
| 116 | + container_name: grafana |
| 117 | + ports: |
| 118 | + - "3000:3000" |
| 119 | + volumes: |
| 120 | + - /mnt/docker/appdata/grafana:/var/lib/grafana |
| 121 | + restart: unless-stopped |
| 122 | + |
| 123 | + ai-log-reporter: |
| 124 | + image: python:3.11-slim |
| 125 | + container_name: ai-log-reporter |
| 126 | + volumes: |
| 127 | + - /mnt/docker/appdata/ai-reporter:/app |
| 128 | + environment: |
| 129 | + - LOKI_URL=http://localhost:3100 |
| 130 | + - GEMINI_API_KEY=your_gemini_key_here |
| 131 | + - HA_URL=[http://[HA-IP]:8123](http://[HA-IP]:8123) |
| 132 | + - HA_TOKEN=your_ha_long_lived_token |
| 133 | + # Installs dependencies and runs script on demand |
| 134 | + command: > |
| 135 | + sh -c "pip install requests google-genai && |
| 136 | + tail -f /dev/null" |
| 137 | + network_mode: host |
| 138 | + restart: unless-stopped |
| 139 | +``` |
| 140 | +
|
| 141 | +**Loki Configuration** (local-config.yaml) |
| 142 | +*Critical:* The `max_entries_limit_per_query` must be increased to allow the AI to see full history. |
| 143 | + |
| 144 | +```yaml |
| 145 | +auth_enabled: false |
| 146 | +
|
| 147 | +server: |
| 148 | + http_listen_port: 3100 |
| 149 | +
|
| 150 | +common: |
| 151 | + instance_addr: 0.0.0.0 |
| 152 | + path_prefix: /loki |
| 153 | + storage: |
| 154 | + filesystem: |
| 155 | + chunks_directory: /loki/chunks |
| 156 | + rules_directory: /loki/rules |
| 157 | + replication_factor: 1 |
| 158 | + ring: |
| 159 | + kvstore: |
| 160 | + store: inmemory |
| 161 | +
|
| 162 | +schema_config: |
| 163 | + configs: |
| 164 | + - from: 2024-04-01 |
| 165 | + store: tsdb |
| 166 | + object_store: filesystem |
| 167 | + schema: v13 |
| 168 | + index: |
| 169 | + prefix: index_ |
| 170 | + period: 24h |
| 171 | +
|
| 172 | +compactor: |
| 173 | + working_directory: /loki/boltdb-shipper-compactor |
| 174 | + retention_enabled: true |
| 175 | +
|
| 176 | +limits_config: |
| 177 | + retention_period: 720h # 30 Days |
| 178 | +
|
| 179 | + # --- Rate Limiting --- |
| 180 | + reject_old_samples: true |
| 181 | + reject_old_samples_max_age: 168h |
| 182 | + ingestion_rate_mb: 20 |
| 183 | + ingestion_burst_size_mb: 40 |
| 184 | + per_stream_rate_limit: 20MB |
| 185 | + per_stream_rate_limit_burst: 40MB |
| 186 | +
|
| 187 | + max_entries_limit_per_query: 50000 # <--- CRITICAL FOR AI |
| 188 | +``` |
| 189 | + |
| 190 | +### Step 2: Log Collection (Edge Nodes) |
| 191 | +On every other server (Proxmox, Pi, Edge), we run **Promtail** to ship logs to Unraid. |
| 192 | + |
| 193 | +**Promtail Config** (config.yml) |
| 194 | +```yaml |
| 195 | +server: |
| 196 | + http_listen_port: 9080 |
| 197 | + grpc_listen_port: 0 |
| 198 | +
|
| 199 | +positions: |
| 200 | + filename: /tmp/positions.yaml |
| 201 | +
|
| 202 | +clients: |
| 203 | + # Your Unraid IP |
| 204 | + - url: http://[UNRAID-IP]:3100/loki/api/v1/push |
| 205 | +
|
| 206 | +scrape_configs: |
| 207 | + # --- SYSTEM LOGS --- |
| 208 | + - job_name: system |
| 209 | + static_configs: |
| 210 | + - targets: |
| 211 | + - localhost |
| 212 | + labels: |
| 213 | + host: docker-server-edge # <--- UPDATED for Node 63 |
| 214 | + service_name: os-system |
| 215 | + __path__: /var/log/*.log |
| 216 | +
|
| 217 | + # --- DOCKER LOGS (Smart Proxy Mode) --- |
| 218 | + - job_name: docker |
| 219 | + docker_sd_configs: |
| 220 | + - host: tcp://docker-socket-proxy:2375 |
| 221 | + refresh_interval: 5s |
| 222 | +
|
| 223 | + relabel_configs: |
| 224 | + # 1. Get Container Name |
| 225 | + - source_labels: ['__meta_docker_container_name'] |
| 226 | + regex: '/(.*)' |
| 227 | + target_label: 'container_name' |
| 228 | +
|
| 229 | + # 2. Get Container ID |
| 230 | + - source_labels: ['__meta_docker_container_id'] |
| 231 | + target_label: 'container_id' |
| 232 | +
|
| 233 | + # 3. FORCE the "host" label |
| 234 | + - target_label: 'host' |
| 235 | + replacement: 'docker-server-edge' # <--- UPDATED for Node 63 |
| 236 | +
|
| 237 | + # 4. Force Service Name (AI Reporter) |
| 238 | + - target_label: 'service_name' |
| 239 | + replacement: 'docker' |
| 240 | +
|
| 241 | + # 5. Force Job Label |
| 242 | + - target_label: 'job' |
| 243 | + replacement: 'docker' |
| 244 | +
|
| 245 | + pipeline_stages: |
| 246 | + # 1. Standard Docker Unwrap |
| 247 | + - docker: {} |
| 248 | +
|
| 249 | + # 2. Socket Proxy Level Detection (200=info) |
| 250 | + - match: |
| 251 | + selector: '{container_name="docker-socket-proxy"}' |
| 252 | + stages: |
| 253 | + - regex: |
| 254 | + expression: '\s\d+/\d+/\d+/\d+/\d+\s+(?P<status_code>\d{3})\s' |
| 255 | + - template: |
| 256 | + source: level |
| 257 | + template: '{{ if hasPrefix "2" .status_code }}info{{ else if hasPrefix "3" .status_code }}inftatus_code }}warn{{ else if hasPrefix "5" .status_code }}error{{ else }}unknown{{ end }}' |
| 258 | + - labels: |
| 259 | + level: |
| 260 | +
|
| 261 | + # 3. Standard Level Detection |
| 262 | + - regex: |
| 263 | + expression: '(?i)(?:level|lvl|severity)=(?P<level>\w+)|\[(?P<level>\w+)\]' |
| 264 | + - labels: |
| 265 | + level: |
| 266 | +
|
| 267 | +``` |
| 268 | + |
| 269 | +### Step 3: The Intelligence (Python Script) |
| 270 | +This script runs inside the `ai-log-reporter` container. |
| 271 | + |
| 272 | +**Key Logic:** |
| 273 | + |
| 274 | +1. Fetches last 24h of logs (level=error or warn). |
| 275 | +2. Uses a defaultdict to count duplicates. |
| 276 | +3. Truncates output if > 90k chars. |
| 277 | + |
| 278 | +(See repository for full reporter.py source code) |
| 279 | + |
| 280 | +*** |
| 281 | + |
| 282 | +Step 4: Home Assistant Package |
| 283 | +The automation that triggers the report and displays it. |
| 284 | + |
| 285 | +```yaml |
| 286 | +shell_command: |
| 287 | + generate_ai_log_summary: > |
| 288 | + ssh -i /config/.ssh/id_rsa -o StrictHostKeyChecking=no root@10.0.0.23 'docker exec ai-log-reporter python /app/reporter.py' |
| 289 | +
|
| 290 | +automation: |
| 291 | + - alias: "Daily AI System Summary" |
| 292 | + trigger: |
| 293 | + - platform: time |
| 294 | + at: "07:00:00" |
| 295 | + - platform: homeassistant |
| 296 | + event: start |
| 297 | + action: |
| 298 | + - delay: "00:01:00" |
| 299 | + - action: script.run_ai_summary_now |
| 300 | +``` |
| 301 | + |
| 302 | +*** |
| 303 | + |
| 304 | +<!-- |
| 305 | +### File 3: `03-user-manual.md` |
| 306 | +*Use this file to explain how to interpret the results.* |
| 307 | +--> |
| 308 | + |
| 309 | +<br> |
| 310 | +<br> |
| 311 | +<br> |
| 312 | +<br> |
| 313 | +<br> |
| 314 | +<br> |
| 315 | + |
| 316 | +# 📖 User Manual & Operations |
| 317 | + |
| 318 | +## How to Read the Daily Report |
| 319 | +The AI Summary appears in Home Assistant every morning at 07:00. |
| 320 | + |
| 321 | +### The Iconography |
| 322 | +* 🔴 **CRITICAL:** Immediate action required. |
| 323 | + * *Examples:* Database corruption, Disk failure (SMART), Service boot loops. |
| 324 | + * *Action:* Check Grafana immediately. |
| 325 | +* 🛡️ **SECURITY:** Passive protection info. |
| 326 | + * *Examples:* "CrowdSec blocked 50 IPs", "Brute force attempt on SSH". |
| 327 | + * *Action:* None (System is doing its job). |
| 328 | +* 🟡 **WARNING:** Non-critical noise. |
| 329 | + * *Examples:* Timeouts, configuration deprecation warnings. |
| 330 | + * *Action:* Add to "Technical Debt" to-do list. |
| 331 | + |
| 332 | +## Troubleshooting |
| 333 | +**"Report says: No critical errors found."** |
| 334 | +* **Good News:** Your system is healthy! |
| 335 | +* **Verification:** Check the `ai-log-reporter` container logs to ensure it actually ran and didn't just fail to fetch data. |
| 336 | + |
| 337 | +**"Report is Empty or Unknown"** |
| 338 | +* Check Home Assistant logs for `Shell Command` errors. |
| 339 | +* Ensure the SSH key in Home Assistant allows connection to Unraid without a password. |
| 340 | + |
| 341 | +## Grafana Deep Dive |
| 342 | +When the AI reports a "Critical" error, use Grafana to investigate. |
| 343 | + |
| 344 | +**Recommended LogQL Query:** |
| 345 | +To see the raw data the AI analyzed: |
| 346 | +```logql |
| 347 | +{job=~".+"} != "docker-socket-proxy" |= "error" |
0 commit comments