feat(monitoring): add Prometheus alerting rules for production deployments#245
Conversation
…ments Add 22 production-ready alert rules across 8 groups: - Availability: service down, high error rate, no traffic - Performance: p95/p99 latency, memory API latency, slow inference - Resources: memory/CPU usage, active request saturation - Cache: L1/L2 hit rate degradation - Decay engine: stalled cycles, slow cycles, high expiry rate - Storage: MinIO down, MinIO latency - Cluster: node count, replica count (HA-only, documented) - Prometheus: target down, scrape errors, storage usage Changes: - New monitoring/alerting-rules.yml with all alert definitions - Updated monitoring/prometheus.yml to reference rule_files - Updated all compose files (monitoring, single-node, HA) to mount alerting-rules.yml into the Prometheus container Tuning notes included: for durations, container memory limit dependency, HA-only cluster alerts flagged for single-node users.
|
🤖 [Agent: CTO] Code Review — Prometheus Alerting Rules Assessment: Well-structured alerting rules with proper severity levels, flap prevention ( Technical observations:
Status: This PR is from an external contributor and adds new monitoring features to a public repo. Per governance policy, this requires explicit founder approval before merge. Adding CI status needs verification once checks complete. |
ferhimedamine
left a comment
There was a problem hiding this comment.
🤖 [Agent: CTO] — Deep Technical Review
Verdict: Approved ✅
Excellent contribution — this fills a genuine gap in our monitoring stack. We had dashboards (observability) but zero alerting. These 22 rules across 8 groups are well-structured, use appropriate for durations to avoid flapping, and follow Prometheus best practices for severity tiering.
What works well
- Alert grouping is clean and logical — availability, performance, resources, cache, decay engine, storage, cluster, and self-monitoring.
fordurations are well-tuned: 1m for hard-down (DakeraDown, MinIODown), 5-15m for degraded state, 30m for cache/decay slow-burn issues. This prevents alert storms on transient spikes.- Annotations are operator-friendly — they include the actual values (
{{ $value | humanizePercentage }}) and thresholds, so on-call doesn't need to memorize numbers. - HA-awareness: Cluster alerts correctly assume 3-node HA and the header documents the single-node silencing path.
- Compose mounts are consistent across all 3 compose files —
monitoring/docker-compose.yml,docker/docker-compose.yml, anddocker/docker-compose.ha.yml.
Technical notes for future iterations
A few metrics referenced here aren't emitted by the current server version — these alerts will stay safely dormant (empty vectors = no fire) and will activate automatically when we add the metrics:
| Alert | Metric | Status |
|---|---|---|
DakeraDecayCycleSlow |
dakera_decay_cycle_duration_seconds |
Not yet emitted — we track dakera_decay_run_total but not per-cycle duration |
DakeraClusterDegraded/Offline |
dakera_cluster_nodes_total |
Not emitted — we have dakera_replica_count (gauge) which could be used instead |
DakeraL2CacheHitRateLow |
dakera_l2_cache_hits_total / dakera_l2_cache_misses_total |
Not emitted — we have dakera_cache_hits_total / dakera_cache_misses_total (L1), no L2-specific counters yet |
DakeraHighMemoryUsage/Critical |
container_spec_memory_limit_bytes |
Requires cAdvisor — not currently in the compose stack |
One PromQL note: the resource alerts use / on(instance) group() — in standard PromQL binary operations, this should be / on(instance) group_left() for many-to-one matching. group() is a valid aggregation operator but not a vector matching modifier. I'll address this in a fast follow-up.
The DakeraInferenceSlow alert correctly references the operation label — our server emits dakera_inference_duration_seconds{operation="embed"}, so this will work as-is.
Merging
Founder has approved. I'll merge this now and create a small follow-up to:
- Fix the
group()→group_left()PromQL syntax - Swap
dakera_cluster_nodes_total→dakera_replica_count(which we already emit) - Add inline comments marking which alerts are "future-ready" (dormant until metrics are added)
Thank you @extremecoder-rgb for this solid contribution! The alert coverage is comprehensive and the structure is production-grade. 🙌
|
Thank you @extremecoder-rgb for this excellent contribution! 🙌 Your alerting rules are now live on main. I also created a small follow-up (#246, merged) to align a few metric names with what the Dakera server currently emits and fix a PromQL syntax detail — nothing to do on your end. Your work fills a genuine gap in our monitoring stack — we had dashboards but zero alerting. The severity tiering, flap-resistant |
|
Thanks 👍 |
Adds 22 production-ready Prometheus alerting rules across 8 groups to the Dakera monitoring stack. Previously the monitoring setup only had dashboards (observability) but no alerting — this fills that gap.
Changes
monitoring/alerting-rules.yml— alert definitionsmonitoring/prometheus.yml— addedrule_filesdirectivemonitoring/docker-compose.yml,docker/docker-compose.yml,docker/docker-compose.ha.yml— mountalerting-rules.ymlinto Prometheus containerAlert Groups
Notes
DakeraClusterDegraded) assume HA mode (3 nodes). Single-node deployments should silence these via Alertmanager — documented in the file header.deploy.resources.limits.memoryto be set in compose forcontainer_spec_memory_limit_bytesto be available. This is already configured in all compose files.fordurations are tuned to avoid alert flapping on transient spikes.