Complete monitoring solution for your K3s homelab cluster with Prometheus, Grafana, Loki, and more.
This monitoring stack provides comprehensive observability for your K3s homelab:
- 📊 Metrics Collection: Prometheus scrapes metrics from Kubernetes and system components
- 📈 Visualization: Grafana dashboards for beautiful charts and graphs
- 📝 Log Aggregation: Loki collects and indexes logs from all pods and services
- 🚨 Alerting: Built-in alerts for common issues (high CPU, memory, pod crashes)
- 🔍 System Monitoring: Node Exporter provides host-level metrics
- ☸️ Kubernetes Monitoring: Kube State Metrics exposes cluster state
| Component | Purpose | Port | Image |
|---|---|---|---|
| Prometheus | Metrics collection & alerting | 30000 | prom/prometheus:v2.45.0 |
| Grafana | Data visualization | 30001 | grafana/grafana:10.0.3 |
| Loki | Log aggregation | 3100 | grafana/loki:2.9.0 |
| Promtail | Log collection agent | - | grafana/promtail:2.9.0 |
| Node Exporter | System metrics | 9100 | prom/node-exporter:v1.6.1 |
| Kube State Metrics | K8s cluster metrics | 8080 | k8s.gcr.io/kube-state-metrics/kube-state-metrics:v2.10.0 |
- K3s cluster running (see main README.md)
- Ansible configured and working
kubectlaccess to your cluster
# Deploy the complete monitoring stack
ansible-playbook -i inventory/hosts.yml playbooks/monitoring.yml# Apply all manifests at once
kubectl apply -f manifests/monitoring/
# Or apply individually
kubectl apply -f manifests/monitoring/namespace.yml
kubectl apply -f manifests/monitoring/prometheus-rbac.yml
kubectl apply -f manifests/monitoring/prometheus-config.yml
kubectl apply -f manifests/monitoring/prometheus-deployment.yml
kubectl apply -f manifests/monitoring/grafana-config.yml
kubectl apply -f manifests/monitoring/grafana-deployment.yml
kubectl apply -f manifests/monitoring/loki-deployment.yml
kubectl apply -f manifests/monitoring/promtail-deployment.yml
kubectl apply -f manifests/monitoring/node-exporter.yml
kubectl apply -f manifests/monitoring/kube-state-metrics.yml# Check all pods are running
kubectl get pods -n monitoring
# Check services
kubectl get svc -n monitoring
# Check deployments
kubectl get deployments -n monitoringReplace 10.10.1.24 with your K3s server IP address:
| Service | URL | Credentials |
|---|---|---|
| Grafana | http://10.10.1.24:30001 | admin / admin |
| Prometheus | http://10.10.1.24:30000 | No auth |
# Grafana
kubectl port-forward -n monitoring svc/grafana 3000:3000
# Prometheus
kubectl port-forward -n monitoring svc/prometheus-service 9090:8080- Kubernetes Cluster Resource Usage - Pre-configured dashboard showing:
- Cluster CPU and Memory capacity
- Resource utilization trends
- Pod and node status
Popular dashboard IDs to import from grafana.com:
| ID | Name | Description |
|---|---|---|
12119 |
Kubernetes Cluster Overview | Complete K8s cluster monitoring |
315 |
Kubernetes Cluster Monitoring | Detailed cluster metrics |
13332 |
Kube State Metrics v2 | Kubernetes object state |
1860 |
Node Exporter Full | Detailed system metrics |
13639 |
Logs App | Loki log exploration |
How to import:
- Open Grafana → Dashboards → Import
- Enter dashboard ID
- Select Prometheus data source
- Click Import
The following alerts are pre-configured:
| Alert | Condition | Severity |
|---|---|---|
NodeDown |
Node unreachable for >5min | Critical |
NodeHighCPUUsage |
CPU usage >80% for 5min | Warning |
NodeHighMemoryUsage |
Memory usage >90% for 5min | Critical |
KubernetesNodeReady |
Node not ready for 10min | Critical |
KubernetesMemoryPressure |
Memory pressure detected | Critical |
KubernetesPodCrashLooping |
Pod restart >3 times in 1min | Warning |
To configure alert channels (email, Slack, Discord, etc.):
- Open Grafana → Alerting → Notification channels
- Add your preferred notification method
- Test the configuration
Loki automatically collects logs from:
- All Kubernetes pods
- System containers
- Application containers
- Grafana Explore: http://10.10.1.24:30001/explore
- Select "Loki" data source
- Use LogQL queries:
# All logs from monitoring namespace
{namespace="monitoring"}
# Logs from specific pod
{pod="prometheus-xxxxx"}
# Error logs across cluster
{} |= "error"
# Logs from specific app
{app="nginx"}
Default resource limits per component:
# Prometheus
resources:
requests: { cpu: "500m", memory: "500M" }
limits: { cpu: "1000m", memory: "1Gi" }
# Grafana
resources:
requests: { cpu: "500m", memory: "500M" }
limits: { cpu: "1000m", memory: "1Gi" }
# Loki
resources:
requests: { cpu: "500m", memory: "512Mi" }
limits: { cpu: "1000m", memory: "1Gi" }emptyDir (ephemeral storage)
For Production: Configure persistent volumes:
# Add to deployments
volumes:
- name: storage
persistentVolumeClaim:
claimName: prometheus-pvc- Prometheus: 30 days (configurable)
- Loki: No retention limit (configurable)
To monitor your applications:
- Add metrics endpoint to your app
- Annotate your service:
metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics"
- Create dashboards in Grafana
- Export JSON
- Store in version control
- Import via ConfigMap
Add to prometheus-config.yml:
groups:
- name: custom
rules:
- alert: HighResponseTime
expr: http_request_duration_seconds > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"# Check pod status
kubectl describe pod -n monitoring <pod-name>
# Check logs
kubectl logs -n monitoring <pod-name># Check Prometheus targets
curl http://10.10.1.24:30000/targets
# Check service discovery
kubectl logs -n monitoring deployment/prometheus# Check Grafana logs
kubectl logs -n monitoring deployment/grafana
# Check data source configuration
kubectl get configmap -n monitoring grafana-datasources -o yamlCommon solutions:
- Check service annotations
- Verify network policies
- Check RBAC permissions
- Validate port configurations
# Get all monitoring resources
kubectl get all -n monitoring
# Check configurations
kubectl get configmaps -n monitoring
# Check persistent volumes (if used)
kubectl get pv,pvc -n monitoring
# Check resource usage
kubectl top pods -n monitoring
kubectl top nodes- Add persistent storage for data retention
- Configure resource requests/limits based on usage
- Set up node affinity for better distribution
- Enable scrape interval tuning for better performance
- Enable HTTPS/TLS for Grafana
- Configure RBAC properly
- Set up authentication (LDAP, OAuth, etc.)
- Use secrets for sensitive configuration
- High Availability: Run multiple Prometheus instances
- Federation: Connect multiple Prometheus servers
- Remote Storage: Use external storage backends
- Service Mesh: Integrate with Istio/Linkerd
- GitOps: Automate deployments with ArgoCD
- Application Metrics: Add custom application monitoring
- Network Monitoring: Add network policy monitoring
- Security Monitoring: Add Falco for security events
- Cost Monitoring: Add resource cost tracking
- Prometheus Documentation
- Grafana Documentation
- Loki Documentation
- Kubernetes Monitoring Best Practices
- PromQL Tutorial
- LogQL Tutorial
Found an issue or want to improve the monitoring setup?
- Check existing issues
- Create a pull request
- Update documentation
- Test thoroughly
Happy Monitoring! 📊✨