docs: define monitoring and alerting rules (OPS-30)#914
Conversation
Add docs/ops/ALERTING_RULES.md with 10 alert rules covering API error rate, latency, worker heartbeat, disk, memory, queue backlog, database connectivity, health endpoint, CPU, and Redis backplane. Each rule specifies metric source, threshold, evaluation window, priority (P1/P2), runbook steps, and escalation triggers. Includes integration guidance for Grafana, AWS CloudWatch, PagerDuty, and external uptime monitoring with example PromQL queries and Terraform alarm definitions. Closes #868
Cross-reference the new alerting rules document from the ops directory index alongside the existing observability docs.
Update the alert threshold baseline section to match the authoritative thresholds in ALERTING_RULES.md and add a callout directing operators to the comprehensive alerting rules document.
Document three known gaps found during adversarial review: 1. OutboundWebhookDeliveryWorker not monitored by health endpoint 2. Health endpoint staleness thresholds differ from alert thresholds 3. No dedicated LLM provider error rate alert Also clarify that Alert 3 applies to workers with OTLP metric emission (LlmQueueToProposalWorker and ProposalHousekeepingWorker only).
Adversarial Self-ReviewIssues found and fixed
Items reviewed and confirmed correct
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of alerting rules and operational documentation for Taskdeck, including monitoring thresholds, escalation paths, and integration guides for Grafana, CloudWatch, and PagerDuty. The feedback identifies two inconsistencies between the documentation and the current backend implementation: the OutboundWebhookDeliveryWorker is not yet monitored by the HealthController, and the static queue backlog threshold of 100 items conflicts with the dynamic threshold logic used in the application's health checks.
| | **Metric** | `taskdeck.worker.heartbeat.staleness` (per worker name) | | ||
| | **Condition** | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker | | ||
| | **Evaluation window** | 3 consecutive samples | | ||
| | **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` (OTLP metrics emitted by health controller). Note: `OutboundWebhookDeliveryWorker` reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric — see Known Gaps below. | |
There was a problem hiding this comment.
The OutboundWebhookDeliveryWorker is listed here, but the current implementation in HealthController.cs does not monitor this worker or emit the taskdeck.worker.heartbeat.staleness metric for it. Consequently, Alert 3 will not function for this specific worker until the telemetry is added to the backend.
| | **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` (OTLP metrics emitted by health controller). Note: `OutboundWebhookDeliveryWorker` reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric — see Known Gaps below. | | |
| | **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` | |
| | Field | Value | | ||
| | --- | --- | | ||
| | **Metric** | `taskdeck.automation.queue.backlog` | | ||
| | **Condition** | Queue depth > 100 pending items | |
There was a problem hiding this comment.
The static threshold of 100 items is inconsistent with the dynamic threshold used in HealthController.cs (line 77), which is Math.Max(_workerSettings.MaxBatchSize * 20, 100). If MaxBatchSize is set to 10 or higher, the health check will remain 'Healthy' while this alert fires, which may cause confusion during incident response.
| | **Condition** | Queue depth > 100 pending items | | |
| | **Condition** | Queue depth > 100 pending items (or matches dynamic threshold in `HealthController`) | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 77ab8d819f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | **Metric** | `taskdeck.worker.heartbeat.staleness` (per worker name) | | ||
| | **Condition** | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker | | ||
| | **Evaluation window** | 3 consecutive samples | | ||
| | **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker`, `OutboundWebhookDeliveryWorker` | |
There was a problem hiding this comment.
Align heartbeat alert scope with emitted metrics
The rule states this alert applies to OutboundWebhookDeliveryWorker, but HealthController.ReadyCheck only records taskdeck.worker.heartbeat.staleness for LlmQueueToProposalWorker and ProposalHousekeepingWorker. In production, configuring this rule as written creates a false sense of coverage: an outbound webhook worker stall will not produce the documented heartbeat metric series, so the alert cannot fire for that worker even though the runbook implies it will.
Useful? React with 👍 / 👎.
Adversarial Review: PR #914 — Alerting Rules (OPS-30)Overall AssessmentThe document is well-structured and comprehensive. However, cross-referencing against the actual codebase reveals several accuracy issues that need correction before merge. I've also reviewed the bot comments and will address those. Issues Found (will fix)1. Config path error in Alert 6 runbook 2. Queue backlog threshold divergence from HealthController (bot finding -- confirmed valid) 3. Threshold conflicts with CLOUD_REFERENCE_ARCHITECTURE.md
These are not necessarily wrong (different environments may need different thresholds), but the docs should cross-reference each other and explain why the values differ, or the cloud doc thresholds should be updated to match. 4. PromQL examples use incorrect metric types
These should either use 5. OutboundWebhookDeliveryWorker coverage (bot finding -- already addressed) Minor Issues (will fix)6. Health controller queue worker staleness formula is understated Assessment of Bot Comments
Fixes ComingI will push fixes for issues #1, #2, #4, and #6 above. Issue #3 (cloud doc threshold conflicts) I will address by adding a reconciliation note in ALERTING_RULES.md pointing operators to the cloud doc and explaining the relationship. |
- Fix config path: WorkerSettings:MaxBatchSize -> Workers:MaxBatchSize - Document queue backlog threshold divergence from HealthController's dynamic formula Math.Max(MaxBatchSize * 20, 100) - Fix PromQL examples: metrics are Histograms, not gauges -- use _sum/_count series with appropriate caveats - Add threshold reconciliation section explaining differences with CLOUD_REFERENCE_ARCHITECTURE.md alarm stubs - Fix Known Gap #2: use exact default (30s) instead of approximate (~30s) and show the full Math.Max formula
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 04170c664b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| max( | ||
| taskdeck_worker_heartbeat_staleness_seconds_sum | ||
| / taskdeck_worker_heartbeat_staleness_seconds_count | ||
| ) by (taskdeck_worker_name) |
There was a problem hiding this comment.
Use rate-based histogram math for heartbeat alert query
This query divides taskdeck_worker_heartbeat_staleness_seconds_sum by _count directly, which yields a cumulative lifetime average for an OpenTelemetry histogram rather than recent staleness. In HealthController.ReadyCheck, staleness is recorded continuously, so long healthy periods can keep this average low even when a worker has been stale for >300s, causing Alert #3 to miss or significantly delay incidents. Use a windowed calculation (for example rate(sum)/rate(count)) or another recent-value strategy.
Useful? React with 👍 / 👎.
| avg_over_time( | ||
| (taskdeck_automation_queue_backlog_sum / taskdeck_automation_queue_backlog_count)[10m:] | ||
| ) > 100 |
There was a problem hiding this comment.
Fix queue backlog query to avoid lifetime averaging
The backlog expression computes taskdeck_automation_queue_backlog_sum / ..._count and then averages that, but _sum and _count are cumulative histogram counters, so this tracks lifetime average queue depth rather than sustained depth over the last 10 minutes. After long healthy uptime, a real backlog surge can stay below the alert threshold and Alert #6 may never fire when it should. The query should use a recent-window method (for example rate(sum)/rate(count)) or a gauge-style metric.
Useful? React with 👍 / 👎.
* docs: update STATUS.md and AUDIT.md for 10-PR post-merge sweep Add delivery wave entry for PRs #914--#924 covering CI/hardening (SAST, migration validation, performance regression gate, circuit breaker), frontend decomposition (ReviewView, InboxView, AutomationChatView), ops (alerting rules), docs (data model ERD), and UX (session timeout warning). Mark resolved items in AUDIT.md: oversized views, session timeout, SAST, alerting rules, data model reference, performance regression tests. * docs: update IMPLEMENTATION_MASTERPLAN.md with wave 27 delivery history Add delivery wave 27 for PRs #914--#924 covering CI/hardening (SAST, migration validation, performance regression gate, circuit breaker), frontend decomposition (ReviewView, InboxView, AutomationChatView), ops (alerting rules), docs (data model ERD), and session timeout warning. Note ADR-0031 and ADR-0032. Update wave 26 to cross-reference view decomposition resolution. * docs: mark 10 issues delivered in ISSUE_EXECUTION_GUIDE.md Add Stage 7 with all 10 issues from PRs #914--#924 marked as delivered. Update Stage 6 execution note to reflect view decomposition is now resolved. * docs: add ADR-0031 and ADR-0032 to decisions index ADR-0031: SAST Scanning with Semgrep (from PR #915, CI-01) ADR-0032: Circuit Breaker for External API Calls (from PR #924, HARD-01) Note: Both PRs originally created ADR-0031. Renumbered the circuit breaker ADR to ADR-0032 to resolve the conflict. * docs: update TESTING_GUIDE.md with new CI gates and test totals Add migration-validation job to ci-required, SAST scanning and performance regression gate to ci-extended, and both to ci-nightly. Update test counts for circuit breaker (23 backend) and session timeout (19 frontend) tests. * docs: update CLAUDE.md frontend architecture description Reflect view decomposition pattern (thin shells + extracted composables/components) and add examples of view-specific composables and component directories. * docs: mark resolved items in EXPANSION_ROADMAP and HARDENING docs Mark view decomposition (ReviewView, InboxView, AutomationChatView) and monitoring/alerting setup as resolved in both roadmap files. * fix: format STATUS.md Last Updated line to pass docs governance check The governance regex requires the line to end after the date (YYYY-MM-DD) with only optional whitespace. Move the parenthetical sweep note to its own line. * fix: correct OPS-27 and PERF-12 issue numbers across docs OPS-27 (config validation) is GitHub issue #863, not #858. PERF-12 (board list pagination) is GitHub issue #848, not #859. The wrong numbers (#858/#859) belong to FE-17 and FE-18 respectively. Fixes references in STATUS.md, IMPLEMENTATION_MASTERPLAN.md, ISSUE_EXECUTION_GUIDE.md, AUDIT.md, HARDENING_AND_PERFORMANCE.md, TESTING_GUIDE.md, and CONFIGURATION_REFERENCE.md.
Summary
docs/ops/ALERTING_RULES.mdwith 10 comprehensive alerting rules covering API error rate (>1% 5xx = P1), API latency (>2s p95 = P2), worker heartbeat missing (>5min = P1), disk usage (>80% = P2), memory usage (>85% = P2), queue backlog, database connectivity, health endpoint, CPU, and Redis backplanedocs/ops/OBSERVABILITY_BASELINE.mdanddocs/ops/README.mdCloses #868
Test plan
docs/ops/ALERTING_RULES.mdexists and renders correctlyOBSERVABILITY_BASELINE.mdandREADME.mdlink correctlyTaskdeckTelemetry.cs