Skip to content

Commit 1a475da

Browse files
committed
feat: Introduce AI Log SRE documentation and a companion dashboard video.
1 parent d3c48ba commit 1a475da

2 files changed

Lines changed: 347 additions & 0 deletions

File tree

Binary file not shown.
Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# 🧠 The AI Log SRE: Turning Noise into Insights
2+
3+
**Project Status:** ✅ Operational
4+
**Components:** Grafana Loki, Google Gemini 2.0 Flash, Home Assistant, Unraid, Python
5+
6+
## 1. The Problem: Log Fatigue
7+
In a distributed homelab (Unraid, Proxmox VE, Edge Servers, DNS (Adguard + Unbound), Traefik, Unifi Network, Tailscale ...), logs are scattered everywhere.
8+
* **Volume:** My servers generate ~1GB of text logs daily.
9+
* **Visibility:** I only looked at logs *after* I noticed something was broken.
10+
* **Noise:** 99% of logs are "Info", masking the 1% "Critical" errors.
11+
12+
I needed a system that wouldn't just *store* logs, but actively *analyze* them and tap me on the shoulder only when it found something I actually needed to see.
13+
14+
## 2. The Solution
15+
I built a centralized logging pipeline using **Grafana** and **Loki** (for storage) and a custom **Python + Gemini** script (for analysis).
16+
17+
Instead of feeding raw logs to an LLM (which is slow and expensive), I implemented a **"Pre-processing Engine"** that:
18+
1. **Fetches** the last 24 hours of history.
19+
2. **Deduplicates** repetitive errors (e.g., compressing 5,000 "Connection Refused" lines into 1 line).
20+
3. **Summarizes** the context using Google Gemini.
21+
4. **Reports** the findings to my Home Assistant dashboard.
22+
23+
<video width="100%" autoplay loop muted playsinline>
24+
<source src="ai-home-assistant-dashboard.mp4" type="video/mp4">
25+
Your browser does not support the video tag.
26+
</video>
27+
28+
## 3. Architecture Diagram
29+
30+
```mermaid
31+
graph TD
32+
%% --- LEVEL 1: EDGE ---
33+
subgraph Edge ["Edge Nodes (Collectors)"]
34+
direction TB
35+
Docker[Docker Logs] & System[System Logs] --> Promtail[Promtail Agent]
36+
end
37+
38+
%% --- LEVEL 2: UNRAID ---
39+
subgraph Unraid ["Unraid Server (The Brain)"]
40+
direction TB
41+
Promtail -->|Push| Loki[Loki DB]
42+
43+
%% The Fork: Human vs AI
44+
Loki -->|Visualise| Grafana[Grafana UI]
45+
Loki -->|Fetch| Script[Python Script]
46+
47+
Script <-->|Analyze| Gemini[Gemini API]
48+
end
49+
50+
%% --- LEVEL 3: HA ---
51+
subgraph HA ["Home Assistant (Interface)"]
52+
direction TB
53+
Script -->|Webhook| Core[Home Assistant]
54+
Core --> Phone[Mobile Alert]
55+
Core --> Wall[Dashboard]
56+
end
57+
58+
%% --- THE FIX: INVISIBLE STRUT ---
59+
%% Forces HA to stay at the bottom
60+
Gemini ~~~ Core
61+
62+
%% --- STYLING ---
63+
style Loki fill:#f9f,stroke:#333,stroke-width:2px
64+
style Script fill:#ff9,stroke:#333,stroke-width:2px
65+
style Gemini fill:#4285f4,stroke:#fff,stroke-width:2px,color:#fff
66+
style Grafana fill:#ff9900,stroke:#333,stroke-width:2px,color:white
67+
```
68+
69+
## 4. Key Features
70+
* **Cost Effic ient:** Uses client-side deduplication to reduce token usage by ~95%.
71+
* **Massive Context:** Can analyze up to 50,000 log lines per run.
72+
* **Self-Healing:** If the report fails, Home Assistant retains the last known state.
73+
* **Privacy:** Only anonymized/filtered error logs are sent to the AI; raw logs stay local.
74+
75+
***
76+
<!--
77+
### File 2: `02-implementation.md`
78+
*Use this file for the technical setup steps and code.*
79+
-->
80+
<br>
81+
<br>
82+
<br>
83+
<br>
84+
<br>
85+
<br>
86+
87+
# 🛠️ Implementation Guide
88+
89+
This guide details how to reproduce the "AI Log SRE" stack.
90+
91+
## Prerequisites
92+
* **Unraid Server** (or any Docker host).
93+
* **Google Gemini API Key** (Free tier is sufficient, Paid recommended for high limits).
94+
* **Home Assistant** (for notifications).
95+
96+
---
97+
98+
## Step 1: Central Server (Unraid)
99+
We run the `loki` database and the `ai-reporter` script in a single stack.
100+
101+
### Docker Compose
102+
```yaml
103+
services:
104+
loki:
105+
image: grafana/loki:3.1.0
106+
container_name: loki
107+
user: "0:0"
108+
volumes:
109+
- /mnt/docker/appdata/loki:/loki
110+
command: -config.file=/loki/local-config.yaml
111+
network_mode: host
112+
restart: unless-stopped
113+
114+
grafana:
115+
image: grafana/grafana:latest
116+
container_name: grafana
117+
ports:
118+
- "3000:3000"
119+
volumes:
120+
- /mnt/docker/appdata/grafana:/var/lib/grafana
121+
restart: unless-stopped
122+
123+
ai-log-reporter:
124+
image: python:3.11-slim
125+
container_name: ai-log-reporter
126+
volumes:
127+
- /mnt/docker/appdata/ai-reporter:/app
128+
environment:
129+
- LOKI_URL=http://localhost:3100
130+
- GEMINI_API_KEY=your_gemini_key_here
131+
- HA_URL=[http://[HA-IP]:8123](http://[HA-IP]:8123)
132+
- HA_TOKEN=your_ha_long_lived_token
133+
# Installs dependencies and runs script on demand
134+
command: >
135+
sh -c "pip install requests google-genai &&
136+
tail -f /dev/null"
137+
network_mode: host
138+
restart: unless-stopped
139+
```
140+
141+
**Loki Configuration** (local-config.yaml)
142+
*Critical:* The `max_entries_limit_per_query` must be increased to allow the AI to see full history.
143+
144+
```yaml
145+
auth_enabled: false
146+
147+
server:
148+
http_listen_port: 3100
149+
150+
common:
151+
instance_addr: 0.0.0.0
152+
path_prefix: /loki
153+
storage:
154+
filesystem:
155+
chunks_directory: /loki/chunks
156+
rules_directory: /loki/rules
157+
replication_factor: 1
158+
ring:
159+
kvstore:
160+
store: inmemory
161+
162+
schema_config:
163+
configs:
164+
- from: 2024-04-01
165+
store: tsdb
166+
object_store: filesystem
167+
schema: v13
168+
index:
169+
prefix: index_
170+
period: 24h
171+
172+
compactor:
173+
working_directory: /loki/boltdb-shipper-compactor
174+
retention_enabled: true
175+
176+
limits_config:
177+
retention_period: 720h # 30 Days
178+
179+
# --- Rate Limiting ---
180+
reject_old_samples: true
181+
reject_old_samples_max_age: 168h
182+
ingestion_rate_mb: 20
183+
ingestion_burst_size_mb: 40
184+
per_stream_rate_limit: 20MB
185+
per_stream_rate_limit_burst: 40MB
186+
187+
max_entries_limit_per_query: 50000 # <--- CRITICAL FOR AI
188+
```
189+
190+
### Step 2: Log Collection (Edge Nodes)
191+
On every other server (Proxmox, Pi, Edge), we run **Promtail** to ship logs to Unraid.
192+
193+
**Promtail Config** (config.yml)
194+
```yaml
195+
server:
196+
http_listen_port: 9080
197+
grpc_listen_port: 0
198+
199+
positions:
200+
filename: /tmp/positions.yaml
201+
202+
clients:
203+
# Your Unraid IP
204+
- url: http://[UNRAID-IP]:3100/loki/api/v1/push
205+
206+
scrape_configs:
207+
# --- SYSTEM LOGS ---
208+
- job_name: system
209+
static_configs:
210+
- targets:
211+
- localhost
212+
labels:
213+
host: docker-server-edge # <--- UPDATED for Node 63
214+
service_name: os-system
215+
__path__: /var/log/*.log
216+
217+
# --- DOCKER LOGS (Smart Proxy Mode) ---
218+
- job_name: docker
219+
docker_sd_configs:
220+
- host: tcp://docker-socket-proxy:2375
221+
refresh_interval: 5s
222+
223+
relabel_configs:
224+
# 1. Get Container Name
225+
- source_labels: ['__meta_docker_container_name']
226+
regex: '/(.*)'
227+
target_label: 'container_name'
228+
229+
# 2. Get Container ID
230+
- source_labels: ['__meta_docker_container_id']
231+
target_label: 'container_id'
232+
233+
# 3. FORCE the "host" label
234+
- target_label: 'host'
235+
replacement: 'docker-server-edge' # <--- UPDATED for Node 63
236+
237+
# 4. Force Service Name (AI Reporter)
238+
- target_label: 'service_name'
239+
replacement: 'docker'
240+
241+
# 5. Force Job Label
242+
- target_label: 'job'
243+
replacement: 'docker'
244+
245+
pipeline_stages:
246+
# 1. Standard Docker Unwrap
247+
- docker: {}
248+
249+
# 2. Socket Proxy Level Detection (200=info)
250+
- match:
251+
selector: '{container_name="docker-socket-proxy"}'
252+
stages:
253+
- regex:
254+
expression: '\s\d+/\d+/\d+/\d+/\d+\s+(?P<status_code>\d{3})\s'
255+
- template:
256+
source: level
257+
template: '{{ if hasPrefix "2" .status_code }}info{{ else if hasPrefix "3" .status_code }}inftatus_code }}warn{{ else if hasPrefix "5" .status_code }}error{{ else }}unknown{{ end }}'
258+
- labels:
259+
level:
260+
261+
# 3. Standard Level Detection
262+
- regex:
263+
expression: '(?i)(?:level|lvl|severity)=(?P<level>\w+)|\[(?P<level>\w+)\]'
264+
- labels:
265+
level:
266+
267+
```
268+
269+
### Step 3: The Intelligence (Python Script)
270+
This script runs inside the `ai-log-reporter` container.
271+
272+
**Key Logic:**
273+
274+
1. Fetches last 24h of logs (level=error or warn).
275+
2. Uses a defaultdict to count duplicates.
276+
3. Truncates output if > 90k chars.
277+
278+
(See repository for full reporter.py source code)
279+
280+
***
281+
282+
Step 4: Home Assistant Package
283+
The automation that triggers the report and displays it.
284+
285+
```yaml
286+
shell_command:
287+
generate_ai_log_summary: >
288+
ssh -i /config/.ssh/id_rsa -o StrictHostKeyChecking=no root@10.0.0.23 'docker exec ai-log-reporter python /app/reporter.py'
289+
290+
automation:
291+
- alias: "Daily AI System Summary"
292+
trigger:
293+
- platform: time
294+
at: "07:00:00"
295+
- platform: homeassistant
296+
event: start
297+
action:
298+
- delay: "00:01:00"
299+
- action: script.run_ai_summary_now
300+
```
301+
302+
***
303+
304+
<!--
305+
### File 3: `03-user-manual.md`
306+
*Use this file to explain how to interpret the results.*
307+
-->
308+
309+
<br>
310+
<br>
311+
<br>
312+
<br>
313+
<br>
314+
<br>
315+
316+
# 📖 User Manual & Operations
317+
318+
## How to Read the Daily Report
319+
The AI Summary appears in Home Assistant every morning at 07:00.
320+
321+
### The Iconography
322+
* 🔴 **CRITICAL:** Immediate action required.
323+
* *Examples:* Database corruption, Disk failure (SMART), Service boot loops.
324+
* *Action:* Check Grafana immediately.
325+
* 🛡️ **SECURITY:** Passive protection info.
326+
* *Examples:* "CrowdSec blocked 50 IPs", "Brute force attempt on SSH".
327+
* *Action:* None (System is doing its job).
328+
* 🟡 **WARNING:** Non-critical noise.
329+
* *Examples:* Timeouts, configuration deprecation warnings.
330+
* *Action:* Add to "Technical Debt" to-do list.
331+
332+
## Troubleshooting
333+
**"Report says: No critical errors found."**
334+
* **Good News:** Your system is healthy!
335+
* **Verification:** Check the `ai-log-reporter` container logs to ensure it actually ran and didn't just fail to fetch data.
336+
337+
**"Report is Empty or Unknown"**
338+
* Check Home Assistant logs for `Shell Command` errors.
339+
* Ensure the SSH key in Home Assistant allows connection to Unraid without a password.
340+
341+
## Grafana Deep Dive
342+
When the AI reports a "Critical" error, use Grafana to investigate.
343+
344+
**Recommended LogQL Query:**
345+
To see the raw data the AI analyzed:
346+
```logql
347+
{job=~".+"} != "docker-socket-proxy" |= "error"

0 commit comments

Comments
 (0)