Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
348 changes: 348 additions & 0 deletions docs/ops/ALERTING_RULES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,348 @@
# Alerting Rules (OPS-30)

Last Updated: 2026-04-22
Issue: `#868` OPS-30 Define monitoring and alerting rules
Depends on: `docs/ops/OBSERVABILITY_BASELINE.md` (OBS-01, #68)

---

## Overview

This document defines monitoring thresholds, alert priorities, escalation paths, and integration guidance for Taskdeck production deployments. It translates the raw metrics from `OBSERVABILITY_BASELINE.md` and the cloud alarm stubs in `CLOUD_REFERENCE_ARCHITECTURE.md` into actionable alerting rules that operators can wire into Grafana, CloudWatch, PagerDuty, or equivalent systems.

**Design principles:**
- Every alert must have a clear owner and runbook reference.
- Prefer fewer, high-signal alerts over many low-signal ones. Noisy alerts erode trust.
- All thresholds are starting points. Tune them based on production traffic baselines within 30 days of deployment.
- P1 alerts page on-call. P2 alerts notify the ops channel. P3 alerts create tickets for review.

---

## Priority Definitions

| Priority | Meaning | Response SLA | Notification channel |
| --- | --- | --- | --- |
| **P1 (Critical)** | Service is down or severely degraded for users. Data loss risk. | Acknowledge within 15 minutes. Mitigate within 1 hour. | Page on-call (PagerDuty / phone) |
| **P2 (Warning)** | Service is degraded but still functional. Risk of escalation to P1 if unaddressed. | Acknowledge within 1 hour. Investigate within 4 hours. | Ops channel (Slack / Teams / email) |
| **P3 (Info)** | Anomaly detected. No immediate user impact but warrants investigation. | Review within 1 business day. | Ops channel or ticket creation |

---

## Alert Rules

### 1. API Error Rate (5xx)

| Field | Value |
| --- | --- |
| **Metric** | `http.server.request.duration` status code dimension, or ALB `HTTPCode_Target_5XX_Count` / total request count |
| **Condition** | 5xx response rate > 1% of total requests |
| **Evaluation window** | 5 minutes (rolling) |
| **Minimum sample** | At least 50 requests in the window (suppress during very low traffic) |
| **Priority** | **P1** |
| **Runbook** | Check application logs for stack traces. Inspect `/health/ready` for subsystem failures. If database is unhealthy, follow `DISASTER_RECOVERY_RUNBOOK.md`. If queue-related, check worker health. |
| **Escalation** | If rate exceeds 10% for 3 minutes, escalate to incident declaration. |

### 2. API Latency (p95)

| Field | Value |
| --- | --- |
| **Metric** | `http.server.request.duration` p95, or ALB `TargetResponseTime` p95 |
| **Condition** | p95 latency > 2 seconds |
| **Evaluation window** | 10 minutes (rolling) |
| **Minimum sample** | At least 50 requests in the window |
| **Priority** | **P2** |
| **Runbook** | Identify slow routes via trace data. Check database query performance. Review queue backlog depth (a full queue can back-pressure API requests). Check host CPU/memory. |
| **Escalation** | If p95 exceeds 5 seconds for 5 minutes, escalate to P1. |

### 3. Worker Heartbeat Missing

| Field | Value |
| --- | --- |
| **Metric** | `taskdeck.worker.heartbeat.staleness` (per worker name) |
| **Condition** | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker |
| **Evaluation window** | 3 consecutive samples |
| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker`, `OutboundWebhookDeliveryWorker` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align heartbeat alert scope with emitted metrics

The rule states this alert applies to OutboundWebhookDeliveryWorker, but HealthController.ReadyCheck only records taskdeck.worker.heartbeat.staleness for LlmQueueToProposalWorker and ProposalHousekeepingWorker. In production, configuring this rule as written creates a false sense of coverage: an outbound webhook worker stall will not produce the documented heartbeat metric series, so the alert cannot fire for that worker even though the runbook implies it will.

Useful? React with 👍 / 👎.

| **Priority** | **P1** |
| **Runbook** | Check if the worker process/container is running. Inspect worker logs for crash loops or unhandled exceptions. Verify the health endpoint: `GET /health/ready` reports worker status. If the worker is running but not heartbeating, check for deadlocks or stuck I/O (e.g., LLM provider timeout). Restart the worker container/process if unrecoverable. |
| **Startup grace** | Suppress for 30 seconds after process start (matches `WorkerHeartbeatRegistry.StartupTime` grace period in `HealthController`). |
| **Escalation** | If heartbeat is missing for more than 15 minutes, escalate to incident — queue items are not being processed and proposals will stale. |

### 4. Disk Usage

| Field | Value |
| --- | --- |
| **Metric** | Host disk utilization % (CloudWatch `DiskSpaceUtilization`, node_exporter `node_filesystem_avail_bytes`, or equivalent) |
| **Condition** | Disk usage > 80% on any data volume |
| **Evaluation window** | 5 minutes |
| **Priority** | **P2** |
| **Runbook** | Identify the volume (data disk vs. root). For the SQLite data volume (`/var/lib/taskdeck`): check WAL file size, run backup and compact if possible, review log rotation settings, check if old backups are consuming space. For root volume: check Docker image layer cache, clean unused images. |
| **Escalation** | If disk usage > 95%, escalate to P1 — SQLite writes will fail and the application will become unavailable. |

### 5. Memory Usage

| Field | Value |
| --- | --- |
| **Metric** | Host/container memory utilization % (CloudWatch `MemoryUtilization`, ECS `MemoryUtilization`, or `node_memory_MemAvailable_bytes`) |
| **Condition** | Memory usage > 85% |
| **Evaluation window** | 5 minutes |
| **Priority** | **P2** |
| **Runbook** | Check if the API process has a memory leak (monitor RSS over time). Review recent deployments for memory regression. If running in a container with a hard memory limit, the OOM killer may terminate the process imminently. Consider scaling up (vertical) or out (horizontal) if the baseline memory footprint genuinely exceeds allocation. |
| **Escalation** | If memory usage > 95% for 3 minutes, escalate to P1 — OOM kill is imminent. |

### 6. Automation Queue Backlog

| Field | Value |
| --- | --- |
| **Metric** | `taskdeck.automation.queue.backlog` |
| **Condition** | Queue depth > 100 pending items |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The static threshold of 100 items is inconsistent with the dynamic threshold used in HealthController.cs (line 77), which is Math.Max(_workerSettings.MaxBatchSize * 20, 100). If MaxBatchSize is set to 10 or higher, the health check will remain 'Healthy' while this alert fires, which may cause confusion during incident response.

Suggested change
| **Condition** | Queue depth > 100 pending items |
| **Condition** | Queue depth > 100 pending items (or matches dynamic threshold in `HealthController`) |

| **Evaluation window** | 10 minutes (sustained) |
| **Priority** | **P2** |
| **Runbook** | Check if the `LlmQueueToProposalWorker` is healthy (see Alert 3). Check LLM provider response times and error rates. If the provider is degraded, the queue will grow. Verify the worker batch size setting (`WorkerSettings:MaxBatchSize`). If backlog is growing faster than drain rate, consider scaling the worker or throttling inbound capture. |
| **Escalation** | If queue depth > 500 for 10 minutes, escalate to P1 — significant user-facing delay in proposal generation. |

### 7. Database Connectivity

| Field | Value |
| --- | --- |
| **Metric** | `/health/ready` response — `checks.database.status` |
| **Condition** | Database status is `Unhealthy` |
| **Evaluation window** | 2 consecutive health check failures (poll interval dependent, typically 30s) |
| **Priority** | **P1** |
| **Runbook** | If SQLite: check that the database file exists and is not locked by another process. Check disk space (see Alert 4). Check file permissions. If the WAL file is corrupted, follow `DISASTER_RECOVERY_RUNBOOK.md` restore procedure. If PostgreSQL (cloud topology): check RDS status, connection count, and network connectivity between the API container and the database endpoint. |
| **Escalation** | Immediate — database failure means total service outage. |

### 8. Health Endpoint Failure

| Field | Value |
| --- | --- |
| **Metric** | HTTP status code of `GET /health/ready` |
| **Condition** | Returns 503 (NotReady) or connection timeout |
| **Evaluation window** | 3 consecutive failures |
| **Priority** | **P1** |
| **Runbook** | Parse the response body to identify which subsystem is unhealthy (database, queue, workers, signalrBackplane). Address the specific subsystem per its dedicated alert rule above. If the endpoint is unreachable entirely, the API process may have crashed — check container/process status and restart. |
| **Escalation** | Immediate — any sustained health endpoint failure indicates service degradation. |

### 9. CPU Usage

| Field | Value |
| --- | --- |
| **Metric** | Host/container CPU utilization % (ECS `CPUUtilization`, CloudWatch, or node_exporter) |
| **Condition** | CPU usage > 80% |
| **Evaluation window** | 5 minutes (sustained) |
| **Priority** | **P2** |
| **Runbook** | Identify hot processes (API, worker, or database). Check if a recent deployment introduced a CPU regression. Review request rate for traffic spikes. If the worker is CPU-bound on LLM processing, this may be expected during queue drain — check queue backlog trend alongside CPU. |
| **Escalation** | If CPU usage > 95% for 5 minutes, escalate to P1. |

### 10. SignalR Backplane (Redis) Health

| Field | Value |
| --- | --- |
| **Metric** | `/health/ready` response — `checks.signalrBackplane.status` |
| **Condition** | Status is `Unhealthy` (only when Redis is configured) |
| **Evaluation window** | 3 consecutive failures |
| **Priority** | **P2** |
| **Runbook** | Check Redis connectivity from the API host. Verify Redis memory usage (see `CLOUD_REFERENCE_ARCHITECTURE.md` — ElastiCache `BytesUsedForCache` > 80% of max). Check network security group rules. Note: `NotConfigured` status is normal for local/single-instance deployments and should not trigger alerts. |
| **Escalation** | If Redis is down for more than 10 minutes and the deployment uses multi-instance API, escalate to P1 — SignalR realtime events will not propagate across instances. |

---

## Alert Summary Table

| # | Alert | Metric | Threshold | Priority | Escalation trigger |
| --- | --- | --- | --- | --- | --- |
| 1 | API 5xx rate | HTTP 5xx / total | > 1% for 5 min | P1 | > 10% for 3 min |
| 2 | API p95 latency | `http.server.request.duration` p95 | > 2s for 10 min | P2 | > 5s for 5 min -> P1 |
| 3 | Worker heartbeat | `taskdeck.worker.heartbeat.staleness` | > 300s, 3 samples | P1 | > 15 min -> incident |
| 4 | Disk usage | Host disk % | > 80% | P2 | > 95% -> P1 |
| 5 | Memory usage | Host/container memory % | > 85% | P2 | > 95% for 3 min -> P1 |
| 6 | Queue backlog | `taskdeck.automation.queue.backlog` | > 100 for 10 min | P2 | > 500 for 10 min -> P1 |
| 7 | Database down | `/health/ready` DB check | Unhealthy x2 | P1 | Immediate |
| 8 | Health endpoint | `GET /health/ready` | 503 or timeout x3 | P1 | Immediate |
| 9 | CPU usage | Host/container CPU % | > 80% for 5 min | P2 | > 95% for 5 min -> P1 |
| 10 | Redis backplane | `/health/ready` Redis check | Unhealthy x3 | P2 | > 10 min -> P1 (multi-instance) |

---

## Integration Paths

### Grafana (self-hosted or Grafana Cloud)

Grafana is the recommended alerting platform for Taskdeck because it natively consumes OpenTelemetry data and supports flexible notification channels.

**Setup:**
1. Configure the OTLP endpoint in Taskdeck: set `Observability:OtlpEndpoint` to point at a Grafana-compatible OTLP collector (e.g., Grafana Alloy, Grafana Agent, or Grafana Cloud OTLP endpoint).
2. In Grafana, create a Prometheus or OTLP data source pointing to the metrics backend.
3. Import or create alert rules matching the thresholds in this document.
4. Configure contact points for each priority level:
- P1: PagerDuty integration or phone/SMS notification channel
- P2: Slack/Teams webhook or email
- P3: Ticket creation (Jira, GitHub Issues, or equivalent)
5. Configure notification policies to route alerts by priority label.

**Example PromQL for API 5xx rate:**
```promql
sum(rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[5m]))
/
sum(rate(http_server_request_duration_seconds_count[5m]))
> 0.01
```

**Example PromQL for worker heartbeat staleness:**
```promql
max(taskdeck_worker_heartbeat_staleness_seconds) by (taskdeck_worker_name)
> 300
```

**Example PromQL for queue backlog (sustained):**
```promql
avg_over_time(taskdeck_automation_queue_backlog[10m]) > 100
```

### AWS CloudWatch

CloudWatch is the natural fit when running on the AWS infrastructure defined in `DEPLOYMENT_TERRAFORM_BASELINE.md` and `CLOUD_REFERENCE_ARCHITECTURE.md`.

**Setup:**
1. Configure the OpenTelemetry Collector to export to CloudWatch via the `awsemf` exporter, or use the CloudWatch agent on EC2 instances.
2. Application metrics (`taskdeck.*`) land in a custom CloudWatch namespace (e.g., `Taskdeck/Application`).
3. Infrastructure metrics (CPU, memory, disk) are collected automatically by the CloudWatch agent or ECS container insights.
4. Create CloudWatch Alarms for each rule in the summary table.
5. Route alarms to an SNS topic per priority:
- `taskdeck-alerts-p1` -> PagerDuty integration + ops email
- `taskdeck-alerts-p2` -> Slack webhook + ops email
- `taskdeck-alerts-p3` -> ticket creation Lambda or email

**Example CloudWatch Alarm (Terraform):**
```hcl
resource "aws_cloudwatch_metric_alarm" "worker_heartbeat_stale" {
alarm_name = "taskdeck-worker-heartbeat-stale"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "taskdeck.worker.heartbeat.staleness"
namespace = "Taskdeck/Application"
period = 60
statistic = "Maximum"
threshold = 300
alarm_description = "Worker heartbeat missing for >5 minutes. See docs/ops/ALERTING_RULES.md Alert #3."
alarm_actions = [aws_sns_topic.taskdeck_alerts_p1.arn]
ok_actions = [aws_sns_topic.taskdeck_alerts_p1.arn]

dimensions = {
WorkerName = "LlmQueueToProposalWorker"
}
}
```

**ALB-based alarms** for API error rate and latency:
```hcl
resource "aws_cloudwatch_metric_alarm" "api_5xx_rate" {
alarm_name = "taskdeck-api-5xx-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
threshold = 0.01

metric_query {
id = "error_rate"
expression = "errors / requests"
label = "5xx Rate"
return_data = true
}

metric_query {
id = "errors"
metric {
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = { LoadBalancer = var.alb_arn_suffix }
}
}

metric_query {
id = "requests"
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = { LoadBalancer = var.alb_arn_suffix }
}
}

alarm_description = "API 5xx rate >1%. See docs/ops/ALERTING_RULES.md Alert #1."
alarm_actions = [aws_sns_topic.taskdeck_alerts_p1.arn]
}
```

### PagerDuty

PagerDuty handles on-call routing, escalation policies, and incident tracking for P1 alerts.

**Setup:**
1. Create a PagerDuty service for Taskdeck (e.g., `Taskdeck Production`).
2. Create an escalation policy:
- Level 1: On-call engineer (immediate page).
- Level 2: Engineering lead (escalate after 15 minutes without acknowledgment).
- Level 3: All engineering (escalate after 30 minutes).
3. Generate integration keys:
- **Grafana**: Use the PagerDuty contact point integration with the Events API v2 integration key.
- **CloudWatch**: Route SNS topics to PagerDuty via the PagerDuty AWS CloudWatch integration or an SNS-to-PagerDuty Lambda.
4. Map alert priorities to PagerDuty severities:
- P1 -> `critical` (pages immediately)
- P2 -> `warning` (creates incident, no page)
- P3 -> `info` (suppressed or ticket-only)

### External Health Check (Uptime Monitoring)

In addition to internal alerting, configure an external uptime monitor to detect total outages that internal monitoring cannot report.

**Setup:**
1. Use an external service (e.g., Pingdom, UptimeRobot, Better Uptime, or AWS Route 53 health checks).
2. Monitor `GET /health/live` from at least 2 geographic regions.
3. Alert on 3 consecutive failures (typically 90 seconds with 30-second intervals).
4. Route to the P1 notification channel — if the live endpoint is unreachable, the entire service is down.

---

## Silence and Maintenance Windows

- **Planned maintenance**: Create a silence/maintenance window in the alerting system before deployments. Recommended duration: 15 minutes for rolling deployments, 30 minutes for full restarts.
- **Startup grace**: Worker heartbeat alerts should suppress for 30 seconds after deployment (matches the `WorkerHeartbeatRegistry` startup grace in the health controller).
- **Low-traffic suppression**: API error rate and latency alerts should require a minimum request count (50 requests in the evaluation window) to avoid false positives during off-peak hours.

---

## Threshold Tuning Guidance

These thresholds are initial values based on the application's architecture and expected traffic profile. Operators should tune them within the first 30 days of production deployment:

1. **Baseline measurement**: Run the application under normal load for 1-2 weeks. Record p50/p95/p99 latency, error rate, queue depth, and resource utilization.
2. **Set thresholds at 2-3x baseline**: If p95 latency baseline is 500ms, a 2s threshold gives 4x headroom. Adjust if this is too noisy or too quiet.
3. **Track alert frequency**: If an alert fires more than 3 times per week without actionable cause, the threshold is too tight. If it has never fired after 30 days of production traffic, verify the metric is being collected correctly.
4. **Document changes**: When tuning a threshold, update this document and note the rationale in a commit message.

---

## Relationship to Existing Health Infrastructure

The alerting rules in this document are designed to work with the health check infrastructure already implemented in Taskdeck:

- **`GET /health/live`** (liveness probe): Returns 200 if the process is running. Used by container orchestrators (ECS, Kubernetes) for restart decisions and by external uptime monitors.
- **`GET /health/ready`** (readiness probe): Returns 200/503 with detailed subsystem status (database, queue, workers, Redis backplane). Alerts 7, 8, and 10 consume this endpoint's output.
- **`WorkerHeartbeatRegistry`**: In-process registry tracking the last heartbeat of each background worker. The health controller reports staleness for `LlmQueueToProposalWorker` and `ProposalHousekeepingWorker`. Alert 3 monitors the corresponding OTLP metric.
- **`TaskdeckTelemetry`** (OpenTelemetry meter): Emits the custom metrics (`taskdeck.automation.queue.backlog`, `taskdeck.worker.items.processed`, `taskdeck.worker.heartbeat.staleness`, etc.) that Alerts 3 and 6 consume.

For the full list of emitted metrics and trace attributes, see `docs/ops/OBSERVABILITY_BASELINE.md`.

---

## Related Docs

- `docs/ops/OBSERVABILITY_BASELINE.md` — OpenTelemetry metric names, trace attributes, and dashboard definition
- `docs/ops/OBSERVABILITY_SETUP.md` — Error tracking, analytics, and telemetry configuration
- `docs/ops/CLOUD_REFERENCE_ARCHITECTURE.md` — Cloud topology and initial alarm stubs
- `docs/ops/DISASTER_RECOVERY_RUNBOOK.md` — Recovery procedures referenced by database and disk alerts
- `docs/ops/INCIDENT_REHEARSAL_CADENCE.md` — Rehearsal program for validating alert-to-response paths
- `docs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md` — Terraform IaC where CloudWatch alarms would be defined
- `docs/ops/BUDGET_BREACH_RUNBOOK.md` — Cost-related alerting (separate from operational alerts)
16 changes: 10 additions & 6 deletions docs/ops/OBSERVABILITY_BASELINE.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,12 +61,16 @@ Recommended panels:

## Alert Threshold Baseline

Suggested initial alerts:
1. API error rate > 5% for 5m.
2. API p95 request latency > 1500ms for 10m.
3. Queue backlog > 100 for 10m.
4. Queue worker heartbeat staleness > 30s for 3 consecutive samples.
5. Proposal housekeeping heartbeat staleness > 180s for 3 consecutive samples.
> **Comprehensive alerting rules**: See `docs/ops/ALERTING_RULES.md` for full alert definitions
> with priorities, escalation paths, runbook references, and Grafana/CloudWatch/PagerDuty
> integration guidance.

Suggested initial alerts (summary — see `ALERTING_RULES.md` for authoritative thresholds):
1. API 5xx error rate > 1% for 5m (P1).
2. API p95 request latency > 2s for 10m (P2).
3. Queue backlog > 100 for 10m (P2).
4. Worker heartbeat staleness > 300s for 3 consecutive samples (P1).
5. Disk usage > 80% (P2). Memory usage > 85% (P2).

## Non-Prod Smoke Verification Path

Expand Down
Loading
Loading