Last Updated: 2026-04-22
Issue: #868 OPS-30 Define monitoring and alerting rules
Depends on: docs/ops/OBSERVABILITY_BASELINE.md (OBS-01, #68)
This document defines monitoring thresholds, alert priorities, escalation paths, and integration guidance for Taskdeck production deployments. It translates the raw metrics from OBSERVABILITY_BASELINE.md and the cloud alarm stubs in CLOUD_REFERENCE_ARCHITECTURE.md into actionable alerting rules that operators can wire into Grafana, CloudWatch, PagerDuty, or equivalent systems.
Design principles:
- Every alert must have a clear owner and runbook reference.
- Prefer fewer, high-signal alerts over many low-signal ones. Noisy alerts erode trust.
- All thresholds are starting points. Tune them based on production traffic baselines within 30 days of deployment.
- P1 alerts page on-call. P2 alerts notify the ops channel. P3 alerts create tickets for review.
| Priority | Meaning | Response SLA | Notification channel |
|---|---|---|---|
| P1 (Critical) | Service is down or severely degraded for users. Data loss risk. | Acknowledge within 15 minutes. Mitigate within 1 hour. | Page on-call (PagerDuty / phone) |
| P2 (Warning) | Service is degraded but still functional. Risk of escalation to P1 if unaddressed. | Acknowledge within 1 hour. Investigate within 4 hours. | Ops channel (Slack / Teams / email) |
| P3 (Info) | Anomaly detected. No immediate user impact but warrants investigation. | Review within 1 business day. | Ops channel or ticket creation |
| Field | Value |
|---|---|
| Metric | http.server.request.duration status code dimension, or ALB HTTPCode_Target_5XX_Count / total request count |
| Condition | 5xx response rate > 1% of total requests |
| Evaluation window | 5 minutes (rolling) |
| Minimum sample | At least 50 requests in the window (suppress during very low traffic) |
| Priority | P1 |
| Runbook | Check application logs for stack traces. Inspect /health/ready for subsystem failures. If database is unhealthy, follow DISASTER_RECOVERY_RUNBOOK.md. If queue-related, check worker health. |
| Escalation | If rate exceeds 10% for 3 minutes, escalate to incident declaration. |
| Field | Value |
|---|---|
| Metric | http.server.request.duration p95, or ALB TargetResponseTime p95 |
| Condition | p95 latency > 2 seconds |
| Evaluation window | 10 minutes (rolling) |
| Minimum sample | At least 50 requests in the window |
| Priority | P2 |
| Runbook | Identify slow routes via trace data. Check database query performance. Review queue backlog depth (a full queue can back-pressure API requests). Check host CPU/memory. |
| Escalation | If p95 exceeds 5 seconds for 5 minutes, escalate to P1. |
| Field | Value |
|---|---|
| Metric | taskdeck.worker.heartbeat.staleness (per worker name) |
| Condition | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker |
| Evaluation window | 3 consecutive samples |
| Applies to | LlmQueueToProposalWorker, ProposalHousekeepingWorker, OutboundWebhookDeliveryWorker |
| Priority | P1 |
| Runbook | Check if the worker process/container is running. Inspect worker logs for crash loops or unhandled exceptions. Verify the health endpoint: GET /health/ready reports worker status. If the worker is running but not heartbeating, check for deadlocks or stuck I/O (e.g., LLM provider timeout). Restart the worker container/process if unrecoverable. |
| Startup grace | Suppress for 30 seconds after process start (matches WorkerHeartbeatRegistry.StartupTime grace period in HealthController). |
| Escalation | If heartbeat is missing for more than 15 minutes, escalate to incident — queue items are not being processed and proposals will stale. |
| Field | Value |
|---|---|
| Metric | Host disk utilization % (CloudWatch DiskSpaceUtilization, node_exporter node_filesystem_avail_bytes, or equivalent) |
| Condition | Disk usage > 80% on any data volume |
| Evaluation window | 5 minutes |
| Priority | P2 |
| Runbook | Identify the volume (data disk vs. root). For the SQLite data volume (/var/lib/taskdeck): check WAL file size, run backup and compact if possible, review log rotation settings, check if old backups are consuming space. For root volume: check Docker image layer cache, clean unused images. |
| Escalation | If disk usage > 95%, escalate to P1 — SQLite writes will fail and the application will become unavailable. |
| Field | Value |
|---|---|
| Metric | Host/container memory utilization % (CloudWatch MemoryUtilization, ECS MemoryUtilization, or node_memory_MemAvailable_bytes) |
| Condition | Memory usage > 85% |
| Evaluation window | 5 minutes |
| Priority | P2 |
| Runbook | Check if the API process has a memory leak (monitor RSS over time). Review recent deployments for memory regression. If running in a container with a hard memory limit, the OOM killer may terminate the process imminently. Consider scaling up (vertical) or out (horizontal) if the baseline memory footprint genuinely exceeds allocation. |
| Escalation | If memory usage > 95% for 3 minutes, escalate to P1 — OOM kill is imminent. |
| Field | Value |
|---|---|
| Metric | taskdeck.automation.queue.backlog |
| Condition | Queue depth > 100 pending items |
| Evaluation window | 10 minutes (sustained) |
| Priority | P2 |
| Runbook | Check if the LlmQueueToProposalWorker is healthy (see Alert 3). Check LLM provider response times and error rates. If the provider is degraded, the queue will grow. Verify the worker batch size setting (WorkerSettings:MaxBatchSize). If backlog is growing faster than drain rate, consider scaling the worker or throttling inbound capture. |
| Escalation | If queue depth > 500 for 10 minutes, escalate to P1 — significant user-facing delay in proposal generation. |
| Field | Value |
|---|---|
| Metric | /health/ready response — checks.database.status |
| Condition | Database status is Unhealthy |
| Evaluation window | 2 consecutive health check failures (poll interval dependent, typically 30s) |
| Priority | P1 |
| Runbook | If SQLite: check that the database file exists and is not locked by another process. Check disk space (see Alert 4). Check file permissions. If the WAL file is corrupted, follow DISASTER_RECOVERY_RUNBOOK.md restore procedure. If PostgreSQL (cloud topology): check RDS status, connection count, and network connectivity between the API container and the database endpoint. |
| Escalation | Immediate — database failure means total service outage. |
| Field | Value |
|---|---|
| Metric | HTTP status code of GET /health/ready |
| Condition | Returns 503 (NotReady) or connection timeout |
| Evaluation window | 3 consecutive failures |
| Priority | P1 |
| Runbook | Parse the response body to identify which subsystem is unhealthy (database, queue, workers, signalrBackplane). Address the specific subsystem per its dedicated alert rule above. If the endpoint is unreachable entirely, the API process may have crashed — check container/process status and restart. |
| Escalation | Immediate — any sustained health endpoint failure indicates service degradation. |
| Field | Value |
|---|---|
| Metric | Host/container CPU utilization % (ECS CPUUtilization, CloudWatch, or node_exporter) |
| Condition | CPU usage > 80% |
| Evaluation window | 5 minutes (sustained) |
| Priority | P2 |
| Runbook | Identify hot processes (API, worker, or database). Check if a recent deployment introduced a CPU regression. Review request rate for traffic spikes. If the worker is CPU-bound on LLM processing, this may be expected during queue drain — check queue backlog trend alongside CPU. |
| Escalation | If CPU usage > 95% for 5 minutes, escalate to P1. |
| Field | Value |
|---|---|
| Metric | /health/ready response — checks.signalrBackplane.status |
| Condition | Status is Unhealthy (only when Redis is configured) |
| Evaluation window | 3 consecutive failures |
| Priority | P2 |
| Runbook | Check Redis connectivity from the API host. Verify Redis memory usage (see CLOUD_REFERENCE_ARCHITECTURE.md — ElastiCache BytesUsedForCache > 80% of max). Check network security group rules. Note: NotConfigured status is normal for local/single-instance deployments and should not trigger alerts. |
| Escalation | If Redis is down for more than 10 minutes and the deployment uses multi-instance API, escalate to P1 — SignalR realtime events will not propagate across instances. |
| # | Alert | Metric | Threshold | Priority | Escalation trigger |
|---|---|---|---|---|---|
| 1 | API 5xx rate | HTTP 5xx / total | > 1% for 5 min | P1 | > 10% for 3 min |
| 2 | API p95 latency | http.server.request.duration p95 |
> 2s for 10 min | P2 | > 5s for 5 min -> P1 |
| 3 | Worker heartbeat | taskdeck.worker.heartbeat.staleness |
> 300s, 3 samples | P1 | > 15 min -> incident |
| 4 | Disk usage | Host disk % | > 80% | P2 | > 95% -> P1 |
| 5 | Memory usage | Host/container memory % | > 85% | P2 | > 95% for 3 min -> P1 |
| 6 | Queue backlog | taskdeck.automation.queue.backlog |
> 100 for 10 min | P2 | > 500 for 10 min -> P1 |
| 7 | Database down | /health/ready DB check |
Unhealthy x2 | P1 | Immediate |
| 8 | Health endpoint | GET /health/ready |
503 or timeout x3 | P1 | Immediate |
| 9 | CPU usage | Host/container CPU % | > 80% for 5 min | P2 | > 95% for 5 min -> P1 |
| 10 | Redis backplane | /health/ready Redis check |
Unhealthy x3 | P2 | > 10 min -> P1 (multi-instance) |
Grafana is the recommended alerting platform for Taskdeck because it natively consumes OpenTelemetry data and supports flexible notification channels.
Setup:
- Configure the OTLP endpoint in Taskdeck: set
Observability:OtlpEndpointto point at a Grafana-compatible OTLP collector (e.g., Grafana Alloy, Grafana Agent, or Grafana Cloud OTLP endpoint). - In Grafana, create a Prometheus or OTLP data source pointing to the metrics backend.
- Import or create alert rules matching the thresholds in this document.
- Configure contact points for each priority level:
- P1: PagerDuty integration or phone/SMS notification channel
- P2: Slack/Teams webhook or email
- P3: Ticket creation (Jira, GitHub Issues, or equivalent)
- Configure notification policies to route alerts by priority label.
Example PromQL for API 5xx rate:
sum(rate(http_server_request_duration_seconds_count{http_response_status_code=~"5.."}[5m]))
/
sum(rate(http_server_request_duration_seconds_count[5m]))
> 0.01
Example PromQL for worker heartbeat staleness:
max(taskdeck_worker_heartbeat_staleness_seconds) by (taskdeck_worker_name)
> 300
Example PromQL for queue backlog (sustained):
avg_over_time(taskdeck_automation_queue_backlog[10m]) > 100
CloudWatch is the natural fit when running on the AWS infrastructure defined in DEPLOYMENT_TERRAFORM_BASELINE.md and CLOUD_REFERENCE_ARCHITECTURE.md.
Setup:
- Configure the OpenTelemetry Collector to export to CloudWatch via the
awsemfexporter, or use the CloudWatch agent on EC2 instances. - Application metrics (
taskdeck.*) land in a custom CloudWatch namespace (e.g.,Taskdeck/Application). - Infrastructure metrics (CPU, memory, disk) are collected automatically by the CloudWatch agent or ECS container insights.
- Create CloudWatch Alarms for each rule in the summary table.
- Route alarms to an SNS topic per priority:
taskdeck-alerts-p1-> PagerDuty integration + ops emailtaskdeck-alerts-p2-> Slack webhook + ops emailtaskdeck-alerts-p3-> ticket creation Lambda or email
Example CloudWatch Alarm (Terraform):
resource "aws_cloudwatch_metric_alarm" "worker_heartbeat_stale" {
alarm_name = "taskdeck-worker-heartbeat-stale"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "taskdeck.worker.heartbeat.staleness"
namespace = "Taskdeck/Application"
period = 60
statistic = "Maximum"
threshold = 300
alarm_description = "Worker heartbeat missing for >5 minutes. See docs/ops/ALERTING_RULES.md Alert #3."
alarm_actions = [aws_sns_topic.taskdeck_alerts_p1.arn]
ok_actions = [aws_sns_topic.taskdeck_alerts_p1.arn]
dimensions = {
WorkerName = "LlmQueueToProposalWorker"
}
}ALB-based alarms for API error rate and latency:
resource "aws_cloudwatch_metric_alarm" "api_5xx_rate" {
alarm_name = "taskdeck-api-5xx-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
threshold = 0.01
metric_query {
id = "error_rate"
expression = "errors / requests"
label = "5xx Rate"
return_data = true
}
metric_query {
id = "errors"
metric {
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = { LoadBalancer = var.alb_arn_suffix }
}
}
metric_query {
id = "requests"
metric {
metric_name = "RequestCount"
namespace = "AWS/ApplicationELB"
period = 300
stat = "Sum"
dimensions = { LoadBalancer = var.alb_arn_suffix }
}
}
alarm_description = "API 5xx rate >1%. See docs/ops/ALERTING_RULES.md Alert #1."
alarm_actions = [aws_sns_topic.taskdeck_alerts_p1.arn]
}PagerDuty handles on-call routing, escalation policies, and incident tracking for P1 alerts.
Setup:
- Create a PagerDuty service for Taskdeck (e.g.,
Taskdeck Production). - Create an escalation policy:
- Level 1: On-call engineer (immediate page).
- Level 2: Engineering lead (escalate after 15 minutes without acknowledgment).
- Level 3: All engineering (escalate after 30 minutes).
- Generate integration keys:
- Grafana: Use the PagerDuty contact point integration with the Events API v2 integration key.
- CloudWatch: Route SNS topics to PagerDuty via the PagerDuty AWS CloudWatch integration or an SNS-to-PagerDuty Lambda.
- Map alert priorities to PagerDuty severities:
- P1 ->
critical(pages immediately) - P2 ->
warning(creates incident, no page) - P3 ->
info(suppressed or ticket-only)
- P1 ->
In addition to internal alerting, configure an external uptime monitor to detect total outages that internal monitoring cannot report.
Setup:
- Use an external service (e.g., Pingdom, UptimeRobot, Better Uptime, or AWS Route 53 health checks).
- Monitor
GET /health/livefrom at least 2 geographic regions. - Alert on 3 consecutive failures (typically 90 seconds with 30-second intervals).
- Route to the P1 notification channel — if the live endpoint is unreachable, the entire service is down.
- Planned maintenance: Create a silence/maintenance window in the alerting system before deployments. Recommended duration: 15 minutes for rolling deployments, 30 minutes for full restarts.
- Startup grace: Worker heartbeat alerts should suppress for 30 seconds after deployment (matches the
WorkerHeartbeatRegistrystartup grace in the health controller). - Low-traffic suppression: API error rate and latency alerts should require a minimum request count (50 requests in the evaluation window) to avoid false positives during off-peak hours.
These thresholds are initial values based on the application's architecture and expected traffic profile. Operators should tune them within the first 30 days of production deployment:
- Baseline measurement: Run the application under normal load for 1-2 weeks. Record p50/p95/p99 latency, error rate, queue depth, and resource utilization.
- Set thresholds at 2-3x baseline: If p95 latency baseline is 500ms, a 2s threshold gives 4x headroom. Adjust if this is too noisy or too quiet.
- Track alert frequency: If an alert fires more than 3 times per week without actionable cause, the threshold is too tight. If it has never fired after 30 days of production traffic, verify the metric is being collected correctly.
- Document changes: When tuning a threshold, update this document and note the rationale in a commit message.
The alerting rules in this document are designed to work with the health check infrastructure already implemented in Taskdeck:
GET /health/live(liveness probe): Returns 200 if the process is running. Used by container orchestrators (ECS, Kubernetes) for restart decisions and by external uptime monitors.GET /health/ready(readiness probe): Returns 200/503 with detailed subsystem status (database, queue, workers, Redis backplane). Alerts 7, 8, and 10 consume this endpoint's output.WorkerHeartbeatRegistry: In-process registry tracking the last heartbeat of each background worker. The health controller reports staleness forLlmQueueToProposalWorkerandProposalHousekeepingWorker. Alert 3 monitors the corresponding OTLP metric.TaskdeckTelemetry(OpenTelemetry meter): Emits the custom metrics (taskdeck.automation.queue.backlog,taskdeck.worker.items.processed,taskdeck.worker.heartbeat.staleness, etc.) that Alerts 3 and 6 consume.
For the full list of emitted metrics and trace attributes, see docs/ops/OBSERVABILITY_BASELINE.md.
docs/ops/OBSERVABILITY_BASELINE.md— OpenTelemetry metric names, trace attributes, and dashboard definitiondocs/ops/OBSERVABILITY_SETUP.md— Error tracking, analytics, and telemetry configurationdocs/ops/CLOUD_REFERENCE_ARCHITECTURE.md— Cloud topology and initial alarm stubsdocs/ops/DISASTER_RECOVERY_RUNBOOK.md— Recovery procedures referenced by database and disk alertsdocs/ops/INCIDENT_REHEARSAL_CADENCE.md— Rehearsal program for validating alert-to-response pathsdocs/ops/DEPLOYMENT_TERRAFORM_BASELINE.md— Terraform IaC where CloudWatch alarms would be defineddocs/ops/BUDGET_BREACH_RUNBOOK.md— Cost-related alerting (separate from operational alerts)