Skip to content

feat(monitoring): add Prometheus alerting rules for production deployments#245

Merged
ferhimedamine merged 1 commit into
Dakera-AI:mainfrom
extremecoder-rgb:feature/add-prometheus-alerting-rules
Jun 26, 2026
Merged

feat(monitoring): add Prometheus alerting rules for production deployments#245
ferhimedamine merged 1 commit into
Dakera-AI:mainfrom
extremecoder-rgb:feature/add-prometheus-alerting-rules

Conversation

@extremecoder-rgb

Copy link
Copy Markdown
Contributor

Adds 22 production-ready Prometheus alerting rules across 8 groups to the Dakera monitoring stack. Previously the monitoring setup only had dashboards (observability) but no alerting — this fills that gap.

Changes

  • New file: monitoring/alerting-rules.yml — alert definitions
  • Modified: monitoring/prometheus.yml — added rule_files directive
  • Modified: monitoring/docker-compose.yml, docker/docker-compose.yml, docker/docker-compose.ha.yml — mount alerting-rules.yml into Prometheus container

Alert Groups

Group Alerts Severity Range
Availability DakeraDown, HighErrorRate, NoTraffic warning / critical
Performance HighLatencyP95, HighLatencyP99, MemoryApiLatency, InferenceSlow warning / critical
Resources HighMemoryUsage, CriticalMemoryUsage, HighCpuUsage, HighActiveRequests warning / critical
Cache L1HitRateLow, L2HitRateLow info / warning
Decay Engine Stalled, SlowCycle, HighExpiryRate warning / info
Storage MinIODown, MinIOHighLatency warning / critical
Cluster ClusterDegraded, ClusterOffline, ReplicaCountLow warning / critical
Prometheus TargetDown, ScrapeErrors, StorageNearLimit info / warning

Notes

  • Cluster alerts (DakeraClusterDegraded) assume HA mode (3 nodes). Single-node deployments should silence these via Alertmanager — documented in the file header.
  • Resource alerts require deploy.resources.limits.memory to be set in compose for container_spec_memory_limit_bytes to be available. This is already configured in all compose files.
  • All for durations are tuned to avoid alert flapping on transient spikes.

…ments

Add 22 production-ready alert rules across 8 groups:
- Availability: service down, high error rate, no traffic
- Performance: p95/p99 latency, memory API latency, slow inference
- Resources: memory/CPU usage, active request saturation
- Cache: L1/L2 hit rate degradation
- Decay engine: stalled cycles, slow cycles, high expiry rate
- Storage: MinIO down, MinIO latency
- Cluster: node count, replica count (HA-only, documented)
- Prometheus: target down, scrape errors, storage usage

Changes:
- New monitoring/alerting-rules.yml with all alert definitions
- Updated monitoring/prometheus.yml to reference rule_files
- Updated all compose files (monitoring, single-node, HA) to
  mount alerting-rules.yml into the Prometheus container

Tuning notes included: for durations, container memory limit
dependency, HA-only cluster alerts flagged for single-node users.
@ferhimedamine

Copy link
Copy Markdown
Contributor

🤖 [Agent: CTO] Code Review — Prometheus Alerting Rules

Assessment: Well-structured alerting rules with proper severity levels, flap prevention (for durations), and comprehensive coverage. Good work.

Technical observations:

  1. Metric name verification needed: Several referenced metrics (dakera_cache_hits_total, dakera_l2_cache_hits_total, dakera_decay_run_total, dakera_decay_cycle_duration_seconds, dakera_cluster_nodes_total, dakera_replica_count) should be verified against our actual /metrics endpoint export. Some may not exist yet — particularly cluster/replication metrics.

  2. Hardcoded CPU limit: The DakeraHighCpuUsage annotation says "container limit is 4 cores" — this is deployment-specific and should note it's configurable or use a query against container_spec_cpu_quota.

  3. Cluster alerts: DakeraClusterDegraded and DakeraClusterOffline assume HA mode (3 nodes). The header comment correctly notes single-node deployments should silence these, but consider adding a dakera_cluster_enabled guard expression.

  4. Good patterns: Tiered severities (warning → critical), info-level for expected high-volume events (memory expiry), and the cache hit rate recommendation to increase DAKERA_L1_CACHE_SIZE are helpful.

Status: This PR is from an external contributor and adds new monitoring features to a public repo. Per governance policy, this requires explicit founder approval before merge. Adding founder-review-required label.

CI status needs verification once checks complete.

@ferhimedamine ferhimedamine added the founder-review-required Requires explicit founder approval before merge label Jun 25, 2026

@ferhimedamine ferhimedamine left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 [Agent: CTO] — Deep Technical Review

Verdict: Approved

Excellent contribution — this fills a genuine gap in our monitoring stack. We had dashboards (observability) but zero alerting. These 22 rules across 8 groups are well-structured, use appropriate for durations to avoid flapping, and follow Prometheus best practices for severity tiering.

What works well

  • Alert grouping is clean and logical — availability, performance, resources, cache, decay engine, storage, cluster, and self-monitoring.
  • for durations are well-tuned: 1m for hard-down (DakeraDown, MinIODown), 5-15m for degraded state, 30m for cache/decay slow-burn issues. This prevents alert storms on transient spikes.
  • Annotations are operator-friendly — they include the actual values ({{ $value | humanizePercentage }}) and thresholds, so on-call doesn't need to memorize numbers.
  • HA-awareness: Cluster alerts correctly assume 3-node HA and the header documents the single-node silencing path.
  • Compose mounts are consistent across all 3 compose files — monitoring/docker-compose.yml, docker/docker-compose.yml, and docker/docker-compose.ha.yml.

Technical notes for future iterations

A few metrics referenced here aren't emitted by the current server version — these alerts will stay safely dormant (empty vectors = no fire) and will activate automatically when we add the metrics:

Alert Metric Status
DakeraDecayCycleSlow dakera_decay_cycle_duration_seconds Not yet emitted — we track dakera_decay_run_total but not per-cycle duration
DakeraClusterDegraded/Offline dakera_cluster_nodes_total Not emitted — we have dakera_replica_count (gauge) which could be used instead
DakeraL2CacheHitRateLow dakera_l2_cache_hits_total / dakera_l2_cache_misses_total Not emitted — we have dakera_cache_hits_total / dakera_cache_misses_total (L1), no L2-specific counters yet
DakeraHighMemoryUsage/Critical container_spec_memory_limit_bytes Requires cAdvisor — not currently in the compose stack

One PromQL note: the resource alerts use / on(instance) group() — in standard PromQL binary operations, this should be / on(instance) group_left() for many-to-one matching. group() is a valid aggregation operator but not a vector matching modifier. I'll address this in a fast follow-up.

The DakeraInferenceSlow alert correctly references the operation label — our server emits dakera_inference_duration_seconds{operation="embed"}, so this will work as-is.

Merging

Founder has approved. I'll merge this now and create a small follow-up to:

  1. Fix the group()group_left() PromQL syntax
  2. Swap dakera_cluster_nodes_totaldakera_replica_count (which we already emit)
  3. Add inline comments marking which alerts are "future-ready" (dormant until metrics are added)

Thank you @extremecoder-rgb for this solid contribution! The alert coverage is comprehensive and the structure is production-grade. 🙌

@ferhimedamine ferhimedamine added the agent/cto Reviewed/merged by CTO agent label Jun 26, 2026
@ferhimedamine ferhimedamine merged commit 95d636e into Dakera-AI:main Jun 26, 2026
@ferhimedamine

Copy link
Copy Markdown
Contributor

Thank you @extremecoder-rgb for this excellent contribution! 🙌

Your alerting rules are now live on main. I also created a small follow-up (#246, merged) to align a few metric names with what the Dakera server currently emits and fix a PromQL syntax detail — nothing to do on your end.

Your work fills a genuine gap in our monitoring stack — we had dashboards but zero alerting. The severity tiering, flap-resistant for durations, and operator-friendly annotations are all production-grade. Great job!

@extremecoder-rgb

Copy link
Copy Markdown
Contributor Author

Thanks 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent/cto Reviewed/merged by CTO agent founder-review-required Requires explicit founder approval before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants