K3s Homelab Monitoring Stack

Complete monitoring solution for your K3s homelab cluster with Prometheus, Grafana, Loki, and more.

🚀 Overview

This monitoring stack provides comprehensive observability for your K3s homelab:

📊 Metrics Collection: Prometheus scrapes metrics from Kubernetes and system components
📈 Visualization: Grafana dashboards for beautiful charts and graphs
📝 Log Aggregation: Loki collects and indexes logs from all pods and services
🚨 Alerting: Built-in alerts for common issues (high CPU, memory, pod crashes)
🔍 System Monitoring: Node Exporter provides host-level metrics
☸️ Kubernetes Monitoring: Kube State Metrics exposes cluster state

📦 Components

Component	Purpose	Port	Image
Prometheus	Metrics collection & alerting	30000	`prom/prometheus:v2.45.0`
Grafana	Data visualization	30001	`grafana/grafana:10.0.3`
Loki	Log aggregation	3100	`grafana/loki:2.9.0`
Promtail	Log collection agent	-	`grafana/promtail:2.9.0`
Node Exporter	System metrics	9100	`prom/node-exporter:v1.6.1`
Kube State Metrics	K8s cluster metrics	8080	`k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.10.0`

🛠️ Installation

Prerequisites

K3s cluster running (see main README.md)
Ansible configured and working
kubectl access to your cluster

Method 1: Ansible Deployment (Recommended)

# Deploy the complete monitoring stack
ansible-playbook -i inventory/hosts.yml playbooks/monitoring.yml

Method 2: Manual kubectl

# Apply all manifests at once
kubectl apply -f manifests/monitoring/

# Or apply individually
kubectl apply -f manifests/monitoring/namespace.yml
kubectl apply -f manifests/monitoring/prometheus-rbac.yml
kubectl apply -f manifests/monitoring/prometheus-config.yml
kubectl apply -f manifests/monitoring/prometheus-deployment.yml
kubectl apply -f manifests/monitoring/grafana-config.yml
kubectl apply -f manifests/monitoring/grafana-deployment.yml
kubectl apply -f manifests/monitoring/loki-deployment.yml
kubectl apply -f manifests/monitoring/promtail-deployment.yml
kubectl apply -f manifests/monitoring/node-exporter.yml
kubectl apply -f manifests/monitoring/kube-state-metrics.yml

Verify Deployment

# Check all pods are running
kubectl get pods -n monitoring

# Check services
kubectl get svc -n monitoring

# Check deployments
kubectl get deployments -n monitoring

🌐 Access

URLs

Replace 10.10.1.24 with your K3s server IP address:

Service	URL	Credentials
Grafana	http://10.10.1.24:30001	admin / admin
Prometheus	http://10.10.1.24:30000	No auth

Port Forwarding (Alternative)

# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000

# Prometheus
kubectl port-forward -n monitoring svc/prometheus-service 9090:8080

📊 Default Dashboards

Grafana Dashboards

Kubernetes Cluster Resource Usage - Pre-configured dashboard showing:
- Cluster CPU and Memory capacity
- Resource utilization trends
- Pod and node status

Import Additional Dashboards

Popular dashboard IDs to import from grafana.com:

ID	Name	Description
`12119`	Kubernetes Cluster Overview	Complete K8s cluster monitoring
`315`	Kubernetes Cluster Monitoring	Detailed cluster metrics
`13332`	Kube State Metrics v2	Kubernetes object state
`1860`	Node Exporter Full	Detailed system metrics
`13639`	Logs App	Loki log exploration

How to import:

Open Grafana → Dashboards → Import
Enter dashboard ID
Select Prometheus data source
Click Import

🚨 Built-in Alerts

Prometheus Alerts

The following alerts are pre-configured:

Alert	Condition	Severity
`NodeDown`	Node unreachable for >5min	Critical
`NodeHighCPUUsage`	CPU usage >80% for 5min	Warning
`NodeHighMemoryUsage`	Memory usage >90% for 5min	Critical
`KubernetesNodeReady`	Node not ready for 10min	Critical
`KubernetesMemoryPressure`	Memory pressure detected	Critical
`KubernetesPodCrashLooping`	Pod restart >3 times in 1min	Warning

Alert Configuration

To configure alert channels (email, Slack, Discord, etc.):

Open Grafana → Alerting → Notification channels
Add your preferred notification method
Test the configuration

📝 Log Management

Loki Configuration

Loki automatically collects logs from:

All Kubernetes pods
System containers
Application containers

Viewing Logs

Grafana Explore: http://10.10.1.24:30001/explore
Select "Loki" data source
Use LogQL queries:

# All logs from monitoring namespace
{namespace="monitoring"}

# Logs from specific pod
{pod="prometheus-xxxxx"}

# Error logs across cluster
{} |= "error"

# Logs from specific app
{app="nginx"}

⚙️ Configuration

Resource Limits

Default resource limits per component:

# Prometheus
resources:
  requests: { cpu: "500m", memory: "500M" }
  limits: { cpu: "1000m", memory: "1Gi" }

# Grafana
resources:
  requests: { cpu: "500m", memory: "500M" }
  limits: { cpu: "1000m", memory: "1Gi" }

# Loki
resources:
  requests: { cpu: "500m", memory: "512Mi" }
  limits: { cpu: "1000m", memory: "1Gi" }

Persistent Storage

⚠️ Current Setup: Uses emptyDir (ephemeral storage)

For Production: Configure persistent volumes:

# Add to deployments
volumes:
- name: storage
  persistentVolumeClaim:
    claimName: prometheus-pvc

Data Retention

Prometheus: 30 days (configurable)
Loki: No retention limit (configurable)

🔧 Customization

Adding Custom Metrics

To monitor your applications:

Add metrics endpoint to your app

Annotate your service:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"

Custom Dashboards

Create dashboards in Grafana
Export JSON
Store in version control
Import via ConfigMap

Custom Alerts

Add to prometheus-config.yml:

groups:
- name: custom
  rules:
  - alert: HighResponseTime
    expr: http_request_duration_seconds > 2
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High response time detected"

🩺 Troubleshooting

Common Issues

1. Pods Not Starting

# Check pod status
kubectl describe pod -n monitoring <pod-name>

# Check logs
kubectl logs -n monitoring <pod-name>

2. Prometheus Not Scraping

# Check Prometheus targets
curl http://10.10.1.24:30000/targets

# Check service discovery
kubectl logs -n monitoring deployment/prometheus

3. Grafana Connection Issues

# Check Grafana logs
kubectl logs -n monitoring deployment/grafana

# Check data source configuration
kubectl get configmap -n monitoring grafana-datasources -o yaml

4. Missing Metrics

Common solutions:

Check service annotations
Verify network policies
Check RBAC permissions
Validate port configurations

Debugging Commands

# Get all monitoring resources
kubectl get all -n monitoring

# Check configurations
kubectl get configmaps -n monitoring

# Check persistent volumes (if used)
kubectl get pv,pvc -n monitoring

# Check resource usage
kubectl top pods -n monitoring
kubectl top nodes

🚀 Next Steps

Performance Optimization

Add persistent storage for data retention
Configure resource requests/limits based on usage
Set up node affinity for better distribution
Enable scrape interval tuning for better performance

Enhanced Security

Enable HTTPS/TLS for Grafana
Configure RBAC properly
Set up authentication (LDAP, OAuth, etc.)
Use secrets for sensitive configuration

Advanced Features

High Availability: Run multiple Prometheus instances
Federation: Connect multiple Prometheus servers
Remote Storage: Use external storage backends
Service Mesh: Integrate with Istio/Linkerd
GitOps: Automate deployments with ArgoCD

Monitoring Expansion

Application Metrics: Add custom application monitoring
Network Monitoring: Add network policy monitoring
Security Monitoring: Add Falco for security events
Cost Monitoring: Add resource cost tracking

📚 Additional Resources

🤝 Contributing

Found an issue or want to improve the monitoring setup?

Check existing issues
Create a pull request
Update documentation
Test thoroughly

Happy Monitoring! 📊✨

FilesExpand file tree

MONITORING.md

Latest commit

History