feat(monitoring): add Prometheus alerting rules for production deployments by extremecoder-rgb · Pull Request #245 · Dakera-AI/dakera-deploy

extremecoder-rgb · 2026-06-25T09:09:13Z

Adds 22 production-ready Prometheus alerting rules across 8 groups to the Dakera monitoring stack. Previously the monitoring setup only had dashboards (observability) but no alerting — this fills that gap.

Changes

New file: monitoring/alerting-rules.yml — alert definitions
Modified: monitoring/prometheus.yml — added rule_files directive
Modified: monitoring/docker-compose.yml, docker/docker-compose.yml, docker/docker-compose.ha.yml — mount alerting-rules.yml into Prometheus container

Alert Groups

Group	Alerts	Severity Range
Availability	DakeraDown, HighErrorRate, NoTraffic	warning / critical
Performance	HighLatencyP95, HighLatencyP99, MemoryApiLatency, InferenceSlow	warning / critical
Resources	HighMemoryUsage, CriticalMemoryUsage, HighCpuUsage, HighActiveRequests	warning / critical
Cache	L1HitRateLow, L2HitRateLow	info / warning
Decay Engine	Stalled, SlowCycle, HighExpiryRate	warning / info
Storage	MinIODown, MinIOHighLatency	warning / critical
Cluster	ClusterDegraded, ClusterOffline, ReplicaCountLow	warning / critical
Prometheus	TargetDown, ScrapeErrors, StorageNearLimit	info / warning

Notes

Cluster alerts (DakeraClusterDegraded) assume HA mode (3 nodes). Single-node deployments should silence these via Alertmanager — documented in the file header.
Resource alerts require deploy.resources.limits.memory to be set in compose for container_spec_memory_limit_bytes to be available. This is already configured in all compose files.
All for durations are tuned to avoid alert flapping on transient spikes.

…ments Add 22 production-ready alert rules across 8 groups: - Availability: service down, high error rate, no traffic - Performance: p95/p99 latency, memory API latency, slow inference - Resources: memory/CPU usage, active request saturation - Cache: L1/L2 hit rate degradation - Decay engine: stalled cycles, slow cycles, high expiry rate - Storage: MinIO down, MinIO latency - Cluster: node count, replica count (HA-only, documented) - Prometheus: target down, scrape errors, storage usage Changes: - New monitoring/alerting-rules.yml with all alert definitions - Updated monitoring/prometheus.yml to reference rule_files - Updated all compose files (monitoring, single-node, HA) to mount alerting-rules.yml into the Prometheus container Tuning notes included: for durations, container memory limit dependency, HA-only cluster alerts flagged for single-node users.

ferhimedamine · 2026-06-25T09:13:26Z

🤖 [Agent: CTO] Code Review — Prometheus Alerting Rules

Assessment: Well-structured alerting rules with proper severity levels, flap prevention (for durations), and comprehensive coverage. Good work.

Technical observations:

Metric name verification needed: Several referenced metrics (dakera_cache_hits_total, dakera_l2_cache_hits_total, dakera_decay_run_total, dakera_decay_cycle_duration_seconds, dakera_cluster_nodes_total, dakera_replica_count) should be verified against our actual /metrics endpoint export. Some may not exist yet — particularly cluster/replication metrics.
Hardcoded CPU limit: The DakeraHighCpuUsage annotation says "container limit is 4 cores" — this is deployment-specific and should note it's configurable or use a query against container_spec_cpu_quota.
Cluster alerts: DakeraClusterDegraded and DakeraClusterOffline assume HA mode (3 nodes). The header comment correctly notes single-node deployments should silence these, but consider adding a dakera_cluster_enabled guard expression.
Good patterns: Tiered severities (warning → critical), info-level for expected high-volume events (memory expiry), and the cache hit rate recommendation to increase DAKERA_L1_CACHE_SIZE are helpful.

Status: This PR is from an external contributor and adds new monitoring features to a public repo. Per governance policy, this requires explicit founder approval before merge. Adding founder-review-required label.

CI status needs verification once checks complete.

ferhimedamine

🤖 [Agent: CTO] — Deep Technical Review

Verdict: Approved ✅

Excellent contribution — this fills a genuine gap in our monitoring stack. We had dashboards (observability) but zero alerting. These 22 rules across 8 groups are well-structured, use appropriate for durations to avoid flapping, and follow Prometheus best practices for severity tiering.

What works well

Alert grouping is clean and logical — availability, performance, resources, cache, decay engine, storage, cluster, and self-monitoring.
for durations are well-tuned: 1m for hard-down (DakeraDown, MinIODown), 5-15m for degraded state, 30m for cache/decay slow-burn issues. This prevents alert storms on transient spikes.
Annotations are operator-friendly — they include the actual values ({{ $value | humanizePercentage }}) and thresholds, so on-call doesn't need to memorize numbers.
HA-awareness: Cluster alerts correctly assume 3-node HA and the header documents the single-node silencing path.
Compose mounts are consistent across all 3 compose files — monitoring/docker-compose.yml, docker/docker-compose.yml, and docker/docker-compose.ha.yml.

Technical notes for future iterations

A few metrics referenced here aren't emitted by the current server version — these alerts will stay safely dormant (empty vectors = no fire) and will activate automatically when we add the metrics:

Alert	Metric	Status
`DakeraDecayCycleSlow`	`dakera_decay_cycle_duration_seconds`	Not yet emitted — we track `dakera_decay_run_total` but not per-cycle duration
`DakeraClusterDegraded/Offline`	`dakera_cluster_nodes_total`	Not emitted — we have `dakera_replica_count` (gauge) which could be used instead
`DakeraL2CacheHitRateLow`	`dakera_l2_cache_hits_total` / `dakera_l2_cache_misses_total`	Not emitted — we have `dakera_cache_hits_total` / `dakera_cache_misses_total` (L1), no L2-specific counters yet
`DakeraHighMemoryUsage/Critical`	`container_spec_memory_limit_bytes`	Requires cAdvisor — not currently in the compose stack

One PromQL note: the resource alerts use / on(instance) group() — in standard PromQL binary operations, this should be / on(instance) group_left() for many-to-one matching. group() is a valid aggregation operator but not a vector matching modifier. I'll address this in a fast follow-up.

The DakeraInferenceSlow alert correctly references the operation label — our server emits dakera_inference_duration_seconds{operation="embed"}, so this will work as-is.

Merging

Founder has approved. I'll merge this now and create a small follow-up to:

Fix the group() → group_left() PromQL syntax
Swap dakera_cluster_nodes_total → dakera_replica_count (which we already emit)
Add inline comments marking which alerts are "future-ready" (dormant until metrics are added)

Thank you @extremecoder-rgb for this solid contribution! The alert coverage is comprehensive and the structure is production-grade. 🙌

ferhimedamine · 2026-06-26T15:53:34Z

Thank you @extremecoder-rgb for this excellent contribution! 🙌

Your alerting rules are now live on main. I also created a small follow-up (#246, merged) to align a few metric names with what the Dakera server currently emits and fix a PromQL syntax detail — nothing to do on your end.

Your work fills a genuine gap in our monitoring stack — we had dashboards but zero alerting. The severity tiering, flap-resistant for durations, and operator-friendly annotations are all production-grade. Great job!

extremecoder-rgb · 2026-06-26T16:33:19Z

Thanks 👍

ferhimedamine added the founder-review-required Requires explicit founder approval before merge label Jun 25, 2026

ferhimedamine approved these changes Jun 26, 2026

View reviewed changes

ferhimedamine added the agent/cto Reviewed/merged by CTO agent label Jun 26, 2026

ferhimedamine merged commit 95d636e into Dakera-AI:main Jun 26, 2026

ferhimedamine mentioned this pull request Jun 26, 2026

fix(monitoring): correct PromQL syntax and align metric names #246

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(monitoring): add Prometheus alerting rules for production deployments#245

feat(monitoring): add Prometheus alerting rules for production deployments#245
ferhimedamine merged 1 commit into
Dakera-AI:mainfrom
extremecoder-rgb:feature/add-prometheus-alerting-rules

extremecoder-rgb commented Jun 25, 2026

Uh oh!

ferhimedamine commented Jun 25, 2026

Uh oh!

ferhimedamine left a comment

Uh oh!

ferhimedamine commented Jun 26, 2026

Uh oh!

extremecoder-rgb commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

extremecoder-rgb commented Jun 25, 2026

Changes

Alert Groups

Notes

Uh oh!

ferhimedamine commented Jun 25, 2026

Uh oh!

ferhimedamine left a comment

Choose a reason for hiding this comment

Verdict: Approved ✅

What works well

Technical notes for future iterations

Merging

Uh oh!

ferhimedamine commented Jun 26, 2026

Uh oh!

extremecoder-rgb commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants