docs: define monitoring and alerting rules (OPS-30) by Chris0Jeky · Pull Request #914 · Chris0Jeky/Taskdeck

Chris0Jeky · 2026-04-22T19:38:05Z

Summary

Add docs/ops/ALERTING_RULES.md with 10 comprehensive alerting rules covering API error rate (>1% 5xx = P1), API latency (>2s p95 = P2), worker heartbeat missing (>5min = P1), disk usage (>80% = P2), memory usage (>85% = P2), queue backlog, database connectivity, health endpoint, CPU, and Redis backplane
Each rule includes metric source, threshold, evaluation window, priority, runbook steps, and escalation triggers
Document integration paths for Grafana (with PromQL examples), AWS CloudWatch (with Terraform alarm examples), PagerDuty (escalation policy guidance), and external uptime monitoring
Cross-reference from docs/ops/OBSERVABILITY_BASELINE.md and docs/ops/README.md
Update the alert threshold baseline section in OBSERVABILITY_BASELINE.md to align with the new authoritative thresholds

Closes #868

Test plan

Verify docs/ops/ALERTING_RULES.md exists and renders correctly
Verify all 10 alert rules specify metric, threshold, evaluation window, priority, and runbook
Verify cross-references from OBSERVABILITY_BASELINE.md and README.md link correctly
Verify PromQL examples reference actual metric names from TaskdeckTelemetry.cs
Verify CloudWatch Terraform examples reference metric names matching the OTLP exports
Verify thresholds match issue requirements: 5xx >1% (P1), p95 >2s (P2), disk >80% (P2), memory >85% (P2), heartbeat >5min (P1)

Add docs/ops/ALERTING_RULES.md with 10 alert rules covering API error rate, latency, worker heartbeat, disk, memory, queue backlog, database connectivity, health endpoint, CPU, and Redis backplane. Each rule specifies metric source, threshold, evaluation window, priority (P1/P2), runbook steps, and escalation triggers. Includes integration guidance for Grafana, AWS CloudWatch, PagerDuty, and external uptime monitoring with example PromQL queries and Terraform alarm definitions. Closes #868

Cross-reference the new alerting rules document from the ops directory index alongside the existing observability docs.

Update the alert threshold baseline section to match the authoritative thresholds in ALERTING_RULES.md and add a callout directing operators to the comprehensive alerting rules document.

Document three known gaps found during adversarial review: 1. OutboundWebhookDeliveryWorker not monitored by health endpoint 2. Health endpoint staleness thresholds differ from alert thresholds 3. No dedicated LLM provider error rate alert Also clarify that Alert 3 applies to workers with OTLP metric emission (LlmQueueToProposalWorker and ProposalHousekeepingWorker only).

Chris0Jeky · 2026-04-22T19:40:00Z

Adversarial Self-Review

Issues found and fixed

OutboundWebhookDeliveryWorker gap (fixed): Alert 3 originally listed OutboundWebhookDeliveryWorker as a monitored worker, but the HealthController only checks LlmQueueToProposalWorker and ProposalHousekeepingWorker. The webhook delivery worker reports heartbeats to WorkerHeartbeatRegistry but no OTLP metric is emitted for it. Fixed by clarifying the "Applies to" field in Alert 3 and adding a Known Gaps section documenting this limitation with a follow-up recommendation.
Added Known Gaps section (fixed): The document now explicitly calls out three known limitations:
- Webhook worker not monitored via OTLP or health endpoint
- Health endpoint staleness thresholds are tighter than the alert thresholds (intentional, documented why)
- No dedicated LLM provider error rate alert (failures surface indirectly via queue backlog)

Items reviewed and confirmed correct

PromQL metric naming: Verified that OTLP metric names with dots (taskdeck.worker.heartbeat.staleness) are correctly converted to Prometheus naming with underscores and unit suffix (taskdeck_worker_heartbeat_staleness_seconds). All three PromQL examples use correct naming.
Threshold alignment with issue requirements: 5xx > 1% (P1), p95 > 2s (P2), disk > 80% (P2), memory > 85% (P2), heartbeat > 5min (P1) -- all match issue OPS-30: Define monitoring and alerting rules #868 acceptance criteria.
OBSERVABILITY_BASELINE.md threshold updates: The old thresholds (5% error rate, 1500ms latency, 30s heartbeat) were intentionally updated to match the new alerting rules. The old thresholds were "suggested initial alerts" that predated a formal alerting rules document. The new doc supersedes them and the baseline now cross-references the authoritative source.
CloudWatch Terraform examples: The alarm definitions reference actual AWS metric namespaces and the correct metric names. The math expression alarm for 5xx rate uses the standard ALB metric pattern.
Escalation paths: Every P2 alert has an explicit escalation trigger to P1. Every P1 alert has an escalation to incident declaration. No alert is a dead end.
Runbook references: All alert runbooks reference specific existing docs (DISASTER_RECOVERY_RUNBOOK.md, health endpoints) rather than vague guidance.

gemini-code-assist

Code Review

This pull request introduces a comprehensive set of alerting rules and operational documentation for Taskdeck, including monitoring thresholds, escalation paths, and integration guides for Grafana, CloudWatch, and PagerDuty. The feedback identifies two inconsistencies between the documentation and the current backend implementation: the OutboundWebhookDeliveryWorker is not yet monitored by the HealthController, and the static queue backlog threshold of 100 items conflicts with the dynamic threshold logic used in the application's health checks.

gemini-code-assist · 2026-04-22T19:40:17Z

+| **Metric** | `taskdeck.worker.heartbeat.staleness` (per worker name) |
+| **Condition** | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker |
+| **Evaluation window** | 3 consecutive samples |
+| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` (OTLP metrics emitted by health controller). Note: `OutboundWebhookDeliveryWorker` reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric — see Known Gaps below. |


The OutboundWebhookDeliveryWorker is listed here, but the current implementation in HealthController.cs does not monitor this worker or emit the taskdeck.worker.heartbeat.staleness metric for it. Consequently, Alert 3 will not function for this specific worker until the telemetry is added to the backend.

Suggested change

| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` (OTLP metrics emitted by health controller). Note: `OutboundWebhookDeliveryWorker` reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric — see Known Gaps below. |

| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` |

gemini-code-assist · 2026-04-22T19:40:18Z

+| Field | Value |
+| --- | --- |
+| **Metric** | `taskdeck.automation.queue.backlog` |
+| **Condition** | Queue depth > 100 pending items |


The static threshold of 100 items is inconsistent with the dynamic threshold used in HealthController.cs (line 77), which is Math.Max(_workerSettings.MaxBatchSize * 20, 100). If MaxBatchSize is set to 10 or higher, the health check will remain 'Healthy' while this alert fires, which may cause confusion during incident response.

Suggested change

| **Condition** | Queue depth > 100 pending items |

| **Condition** | Queue depth > 100 pending items (or matches dynamic threshold in `HealthController`) |

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77ab8d819f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T19:41:02Z

+| **Metric** | `taskdeck.worker.heartbeat.staleness` (per worker name) |
+| **Condition** | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker |
+| **Evaluation window** | 3 consecutive samples |
+| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker`, `OutboundWebhookDeliveryWorker` |


Align heartbeat alert scope with emitted metrics

The rule states this alert applies to OutboundWebhookDeliveryWorker, but HealthController.ReadyCheck only records taskdeck.worker.heartbeat.staleness for LlmQueueToProposalWorker and ProposalHousekeepingWorker. In production, configuring this rule as written creates a false sense of coverage: an outbound webhook worker stall will not produce the documented heartbeat metric series, so the alert cannot fire for that worker even though the runbook implies it will.

Useful? React with 👍 / 👎.

Chris0Jeky · 2026-04-22T21:35:48Z

Adversarial Review: PR #914 — Alerting Rules (OPS-30)

Overall Assessment

The document is well-structured and comprehensive. However, cross-referencing against the actual codebase reveals several accuracy issues that need correction before merge. I've also reviewed the bot comments and will address those.

Issues Found (will fix)

1. Config path error in Alert 6 runbook
The doc refers to WorkerSettings:MaxBatchSize but the actual appsettings config section is Workers, not WorkerSettings. The correct config path is Workers:MaxBatchSize. (WorkerSettings is only the C# class name.)

2. Queue backlog threshold divergence from HealthController (bot finding -- confirmed valid)
Both gemini-code-assist and chatgpt-codex-connector flagged this. The HealthController uses Math.Max(_workerSettings.MaxBatchSize * 20, 100) (line 77). With the default MaxBatchSize=5, this gives 100, matching the alert threshold. But if operators increase MaxBatchSize to e.g. 10, the health check threshold becomes 200 while the alert still fires at 100. This creates a confusing state where the alert fires but /health/ready reports the queue as healthy. The doc should document this divergence explicitly and recommend aligning the alert threshold with the health check formula.

3. Threshold conflicts with CLOUD_REFERENCE_ARCHITECTURE.md
The cloud reference doc (lines 262-272) defines significantly different thresholds that are now inconsistent with the ALERTING_RULES.md:

Latency: Cloud doc says p95 > 300ms (SLO breach) / > 500ms (critical). ALERTING_RULES.md says > 2s (P2) / > 5s (P1). These are 6-10x apart.
Worker heartbeat: Cloud doc says > 120s. ALERTING_RULES.md says > 300s. 2.5x apart.
5xx rate: Cloud doc uses absolute count (> 10/min). ALERTING_RULES.md uses percentage (> 1%).

These are not necessarily wrong (different environments may need different thresholds), but the docs should cross-reference each other and explain why the values differ, or the cloud doc thresholds should be updated to match.

4. PromQL examples use incorrect metric types
AutomationQueueBacklog is declared as Histogram<long> and WorkerHeartbeatStalenessSeconds is declared as Histogram<double> in TaskdeckTelemetry.cs. When exported to Prometheus via OpenTelemetry, histograms produce _bucket, _sum, and _count series -- not simple gauge series. The PromQL examples treat them as gauges:

max(taskdeck_worker_heartbeat_staleness_seconds) by (taskdeck_worker_name) -- won't work as written against a histogram
avg_over_time(taskdeck_automation_queue_backlog[10m]) > 100 -- won't work as written against a histogram

These should either use histogram_quantile() or note that operators may need to adapt based on their OTLP-to-Prometheus exporter configuration (some exporters can export histograms as gauges via delta temporality). At minimum, a caveat is needed.

5. OutboundWebhookDeliveryWorker coverage (bot finding -- already addressed)
Both bots flagged that Alert 3's "Applies to" field originally listed OutboundWebhookDeliveryWorker, but the doc already includes a note saying this worker "reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric -- see Known Gaps below." This is correctly documented. The current text appropriately sets expectations. No change needed here.

Minor Issues (will fix)

6. Health controller queue worker staleness formula is understated
Known Gap #2 says the health controller uses QueuePollIntervalSeconds * 3 with "default ~30s". With default QueuePollIntervalSeconds=5, the actual formula gives Math.Max(5 * 3, 30) = 30s, which is accurate. But the parenthetical "default ~30s" is misleading -- it should say "default 30s" (it's exact, not approximate) or better, show the full formula.

Assessment of Bot Comments

gemini-code-assist (Alert 3, Applies to): Valid concern but already addressed in the current text with the Known Gaps note. No change needed.
gemini-code-assist (Alert 6, queue threshold): Valid. Will fix by adding the dynamic threshold formula reference.
chatgpt-codex-connector (Alert 3): Same as gemini's first comment. Already addressed.

Fixes Coming

I will push fixes for issues #1, #2, #4, and #6 above. Issue #3 (cloud doc threshold conflicts) I will address by adding a reconciliation note in ALERTING_RULES.md pointing operators to the cloud doc and explaining the relationship.

- Fix config path: WorkerSettings:MaxBatchSize -> Workers:MaxBatchSize - Document queue backlog threshold divergence from HealthController's dynamic formula Math.Max(MaxBatchSize * 20, 100) - Fix PromQL examples: metrics are Histograms, not gauges -- use _sum/_count series with appropriate caveats - Add threshold reconciliation section explaining differences with CLOUD_REFERENCE_ARCHITECTURE.md alarm stubs - Fix Known Gap #2: use exact default (30s) instead of approximate (~30s) and show the full Math.Max formula

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04170c664b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-22T21:41:03Z

+max(
+  taskdeck_worker_heartbeat_staleness_seconds_sum
+  / taskdeck_worker_heartbeat_staleness_seconds_count
+) by (taskdeck_worker_name)


Use rate-based histogram math for heartbeat alert query

This query divides taskdeck_worker_heartbeat_staleness_seconds_sum by _count directly, which yields a cumulative lifetime average for an OpenTelemetry histogram rather than recent staleness. In HealthController.ReadyCheck, staleness is recorded continuously, so long healthy periods can keep this average low even when a worker has been stale for >300s, causing Alert #3 to miss or significantly delay incidents. Use a windowed calculation (for example rate(sum)/rate(count)) or another recent-value strategy.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-22T21:41:03Z

+avg_over_time(
+  (taskdeck_automation_queue_backlog_sum / taskdeck_automation_queue_backlog_count)[10m:]
+) > 100


Fix queue backlog query to avoid lifetime averaging

The backlog expression computes taskdeck_automation_queue_backlog_sum / ..._count and then averages that, but _sum and _count are cumulative histogram counters, so this tracks lifetime average queue depth rather than sustained depth over the last 10 minutes. After long healthy uptime, a real backlog surge can stay below the alert threshold and Alert #6 may never fire when it should. The query should use a recent-window method (for example rate(sum)/rate(count)) or a gauge-style metric.

Useful? React with 👍 / 👎.

* docs: update STATUS.md and AUDIT.md for 10-PR post-merge sweep Add delivery wave entry for PRs #914--#924 covering CI/hardening (SAST, migration validation, performance regression gate, circuit breaker), frontend decomposition (ReviewView, InboxView, AutomationChatView), ops (alerting rules), docs (data model ERD), and UX (session timeout warning). Mark resolved items in AUDIT.md: oversized views, session timeout, SAST, alerting rules, data model reference, performance regression tests. * docs: update IMPLEMENTATION_MASTERPLAN.md with wave 27 delivery history Add delivery wave 27 for PRs #914--#924 covering CI/hardening (SAST, migration validation, performance regression gate, circuit breaker), frontend decomposition (ReviewView, InboxView, AutomationChatView), ops (alerting rules), docs (data model ERD), and session timeout warning. Note ADR-0031 and ADR-0032. Update wave 26 to cross-reference view decomposition resolution. * docs: mark 10 issues delivered in ISSUE_EXECUTION_GUIDE.md Add Stage 7 with all 10 issues from PRs #914--#924 marked as delivered. Update Stage 6 execution note to reflect view decomposition is now resolved. * docs: add ADR-0031 and ADR-0032 to decisions index ADR-0031: SAST Scanning with Semgrep (from PR #915, CI-01) ADR-0032: Circuit Breaker for External API Calls (from PR #924, HARD-01) Note: Both PRs originally created ADR-0031. Renumbered the circuit breaker ADR to ADR-0032 to resolve the conflict. * docs: update TESTING_GUIDE.md with new CI gates and test totals Add migration-validation job to ci-required, SAST scanning and performance regression gate to ci-extended, and both to ci-nightly. Update test counts for circuit breaker (23 backend) and session timeout (19 frontend) tests. * docs: update CLAUDE.md frontend architecture description Reflect view decomposition pattern (thin shells + extracted composables/components) and add examples of view-specific composables and component directories. * docs: mark resolved items in EXPANSION_ROADMAP and HARDENING docs Mark view decomposition (ReviewView, InboxView, AutomationChatView) and monitoring/alerting setup as resolved in both roadmap files. * fix: format STATUS.md Last Updated line to pass docs governance check The governance regex requires the line to end after the date (YYYY-MM-DD) with only optional whitespace. Move the parenthetical sweep note to its own line. * fix: correct OPS-27 and PERF-12 issue numbers across docs OPS-27 (config validation) is GitHub issue #863, not #858. PERF-12 (board list pagination) is GitHub issue #848, not #859. The wrong numbers (#858/#859) belong to FE-17 and FE-18 respectively. Fixes references in STATUS.md, IMPLEMENTATION_MASTERPLAN.md, ISSUE_EXECUTION_GUIDE.md, AUDIT.md, HARDENING_AND_PERFORMANCE.md, TESTING_GUIDE.md, and CONFIGURATION_REFERENCE.md.

Chris0Jeky added 3 commits April 22, 2026 20:37

docs: add ALERTING_RULES.md to ops README index

88bee19

Cross-reference the new alerting rules document from the ops directory index alongside the existing observability docs.

docs: update OBSERVABILITY_BASELINE alert thresholds and cross-reference

77ab8d8

Update the alert threshold baseline section to match the authoritative thresholds in ALERTING_RULES.md and add a callout directing operators to the comprehensive alerting rules document.

github-project-automation Bot added this to Taskdeck Execution Apr 22, 2026

github-project-automation Bot moved this to Pending in Taskdeck Execution Apr 22, 2026

gemini-code-assist Bot reviewed Apr 22, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 22, 2026

View reviewed changes

This was referenced Apr 22, 2026

docs: post-merge sweep for 10 PRs (2026-04-22) #925

Merged

docs: post-round2 sweep — 8 PROD-00 PRs merged (2026-04-16) #892

Closed

Chris0Jeky merged commit 52c7400 into main Apr 22, 2026
14 checks passed

github-project-automation Bot moved this from Pending to Done in Taskdeck Execution Apr 22, 2026

Chris0Jeky deleted the docs/ops-30-monitoring-alerting-rules branch April 23, 2026 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: define monitoring and alerting rules (OPS-30)#914

docs: define monitoring and alerting rules (OPS-30)#914
Chris0Jeky merged 5 commits intomainfrom
docs/ops-30-monitoring-alerting-rules

Chris0Jeky commented Apr 22, 2026

Uh oh!

Chris0Jeky commented Apr 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

Chris0Jeky commented Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	\| Applies to \| `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` (OTLP metrics emitted by health controller). Note: `OutboundWebhookDeliveryWorker` reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric — see Known Gaps below. \|
	\| Applies to \| `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` \|

	\| Condition \| Queue depth > 100 pending items \|
	\| Condition \| Queue depth > 100 pending items (or matches dynamic threshold in `HealthController`) \|

Conversation

Chris0Jeky commented Apr 22, 2026

Summary

Test plan

Uh oh!

Chris0Jeky commented Apr 22, 2026

Adversarial Self-Review

Issues found and fixed

Items reviewed and confirmed correct

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Chris0Jeky commented Apr 22, 2026

Adversarial Review: PR #914 — Alerting Rules (OPS-30)

Overall Assessment

Issues Found (will fix)

Minor Issues (will fix)

Assessment of Bot Comments

Fixes Coming

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant