Skip to content

Commit 4513944

Browse files
authored
Merge branch 'main' into feat/57-duplicate-discussion-detection
2 parents 119645e + 8b81fda commit 4513944

2 files changed

Lines changed: 125 additions & 0 deletions

File tree

deploy/prometheus/README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
# Prometheus Alerting Rules
2+
3+
Alerting rules for the Decree service, covering DB pool exhaustion, cache health,
4+
pub/sub reliability, rate limiting, JWKS refresh, and CEL validation.
5+
6+
## Loading the rules
7+
8+
Add a `rule_files` entry to your `prometheus.yml`:
9+
10+
```yaml
11+
rule_files:
12+
- /path/to/deploy/prometheus/alerts.yaml
13+
```
14+
15+
Prometheus reloads rule files on `SIGHUP` or via the `/-/reload` HTTP endpoint
16+
(requires `--web.enable-lifecycle`).
17+
18+
## Metrics source
19+
20+
Metrics are emitted by the Decree server using the **OpenTelemetry Go SDK** and
21+
exported to Prometheus via the OTel Prometheus exporter (configured in
22+
`deploy/otel-collector.yaml`).
23+
24+
### OTel-to-Prometheus name translation
25+
26+
OTel metric names use `.` as a separator; Prometheus uses `_`. The exporter
27+
translates automatically:
28+
29+
| OTel name | Prometheus name |
30+
|-----------|----------------|
31+
| `db.pool.acquired_connections` | `db_pool_acquired_connections` |
32+
| `config.cache.hits` | `config_cache_hits_total` |
33+
34+
Counter instruments also receive a `_total` suffix per the Prometheus
35+
exposition format convention.
36+
37+
## Alert groups
38+
39+
| Group | Alerts |
40+
|-------|--------|
41+
| `decree.db` | `DecreeDBPoolExhaustion` (critical), `DecreeDBPoolHighUtilization` (warning) |
42+
| `decree.cache` | `DecreeCacheMissRateHigh` (warning) |
43+
| `decree.reliability` | `DecreePubSubDropped`, `DecreeRateLimitRejectionHigh`, `DecreeJWKSRefreshFailing` (critical) |
44+
| `decree.validation` | `DecreeCELCostCapExceeded` |
45+
46+
> **Alpha**: Decree is pre-production software. Thresholds and alert definitions
47+
> are subject to change.

deploy/prometheus/alerts.yaml

Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
groups:
2+
- name: decree.db
3+
rules:
4+
- alert: DecreeDBPoolExhaustion
5+
expr: db_pool_acquired_connections / db_pool_max_connections > 0.95
6+
for: 2m
7+
labels:
8+
severity: critical
9+
annotations:
10+
summary: "DB pool near exhaustion"
11+
description: "DB {{ $labels.pool }} pool at {{ $value | humanizePercentage }} utilization"
12+
13+
- alert: DecreeDBPoolHighUtilization
14+
expr: db_pool_acquired_connections / db_pool_max_connections > 0.80
15+
for: 5m
16+
labels:
17+
severity: warning
18+
annotations:
19+
summary: "DB pool utilization high"
20+
description: "DB {{ $labels.pool }} pool at {{ $value | humanizePercentage }} utilization"
21+
22+
- name: decree.cache
23+
rules:
24+
- alert: DecreeCacheMissRateHigh
25+
expr: >
26+
(
27+
rate(config_cache_misses_total[5m])
28+
/
29+
(rate(config_cache_hits_total[5m]) + rate(config_cache_misses_total[5m]))
30+
) > 0.5
31+
and
32+
(rate(config_cache_hits_total[5m]) + rate(config_cache_misses_total[5m])) > 0
33+
for: 10m
34+
labels:
35+
severity: warning
36+
annotations:
37+
summary: "Config cache miss rate is high"
38+
description: "Cache miss rate is {{ $value | humanizePercentage }} over the last 5 minutes"
39+
40+
- name: decree.reliability
41+
rules:
42+
- alert: DecreePubSubDropped
43+
expr: rate(pubsub_dropped_total[5m]) > 0
44+
for: 5m
45+
labels:
46+
severity: warning
47+
annotations:
48+
summary: "Pub/sub events are being dropped"
49+
description: "Pub/sub events are being dropped — subscriber channel full"
50+
51+
- alert: DecreeRateLimitRejectionHigh
52+
expr: rate(ratelimit_rejected_total[5m]) * 60 > 10
53+
for: 5m
54+
labels:
55+
severity: warning
56+
annotations:
57+
summary: "Rate limit rejection rate is high"
58+
description: "Rate limiter is rejecting {{ $value | humanize }} requests/min"
59+
60+
- alert: DecreeJWKSRefreshFailing
61+
expr: increase(auth_jwks_refresh_failures_total[10m]) > 0
62+
for: 10m
63+
labels:
64+
severity: critical
65+
annotations:
66+
summary: "JWKS refresh is failing"
67+
description: "JWKS endpoint has not refreshed successfully in the last 10 minutes"
68+
69+
- name: decree.validation
70+
rules:
71+
- alert: DecreeCELCostCapExceeded
72+
expr: rate(validation_cel_aggregate_cost_cap_exceeded_total[5m]) > 0
73+
for: 5m
74+
labels:
75+
severity: warning
76+
annotations:
77+
summary: "CEL validation cost cap is being exceeded"
78+
description: "CEL aggregate cost cap is exceeded — validation rules may be too expensive"

0 commit comments

Comments
 (0)