Skip to content

docs: define monitoring and alerting rules (OPS-30)#914

Merged
Chris0Jeky merged 5 commits intomainfrom
docs/ops-30-monitoring-alerting-rules
Apr 22, 2026
Merged

docs: define monitoring and alerting rules (OPS-30)#914
Chris0Jeky merged 5 commits intomainfrom
docs/ops-30-monitoring-alerting-rules

Conversation

@Chris0Jeky
Copy link
Copy Markdown
Owner

Summary

  • Add docs/ops/ALERTING_RULES.md with 10 comprehensive alerting rules covering API error rate (>1% 5xx = P1), API latency (>2s p95 = P2), worker heartbeat missing (>5min = P1), disk usage (>80% = P2), memory usage (>85% = P2), queue backlog, database connectivity, health endpoint, CPU, and Redis backplane
  • Each rule includes metric source, threshold, evaluation window, priority, runbook steps, and escalation triggers
  • Document integration paths for Grafana (with PromQL examples), AWS CloudWatch (with Terraform alarm examples), PagerDuty (escalation policy guidance), and external uptime monitoring
  • Cross-reference from docs/ops/OBSERVABILITY_BASELINE.md and docs/ops/README.md
  • Update the alert threshold baseline section in OBSERVABILITY_BASELINE.md to align with the new authoritative thresholds

Closes #868

Test plan

  • Verify docs/ops/ALERTING_RULES.md exists and renders correctly
  • Verify all 10 alert rules specify metric, threshold, evaluation window, priority, and runbook
  • Verify cross-references from OBSERVABILITY_BASELINE.md and README.md link correctly
  • Verify PromQL examples reference actual metric names from TaskdeckTelemetry.cs
  • Verify CloudWatch Terraform examples reference metric names matching the OTLP exports
  • Verify thresholds match issue requirements: 5xx >1% (P1), p95 >2s (P2), disk >80% (P2), memory >85% (P2), heartbeat >5min (P1)

Add docs/ops/ALERTING_RULES.md with 10 alert rules covering API error
rate, latency, worker heartbeat, disk, memory, queue backlog, database
connectivity, health endpoint, CPU, and Redis backplane. Each rule
specifies metric source, threshold, evaluation window, priority (P1/P2),
runbook steps, and escalation triggers.

Includes integration guidance for Grafana, AWS CloudWatch, PagerDuty,
and external uptime monitoring with example PromQL queries and Terraform
alarm definitions.

Closes #868
Cross-reference the new alerting rules document from the ops directory
index alongside the existing observability docs.
Update the alert threshold baseline section to match the authoritative
thresholds in ALERTING_RULES.md and add a callout directing operators
to the comprehensive alerting rules document.
Document three known gaps found during adversarial review:
1. OutboundWebhookDeliveryWorker not monitored by health endpoint
2. Health endpoint staleness thresholds differ from alert thresholds
3. No dedicated LLM provider error rate alert

Also clarify that Alert 3 applies to workers with OTLP metric emission
(LlmQueueToProposalWorker and ProposalHousekeepingWorker only).
@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Self-Review

Issues found and fixed

  1. OutboundWebhookDeliveryWorker gap (fixed): Alert 3 originally listed OutboundWebhookDeliveryWorker as a monitored worker, but the HealthController only checks LlmQueueToProposalWorker and ProposalHousekeepingWorker. The webhook delivery worker reports heartbeats to WorkerHeartbeatRegistry but no OTLP metric is emitted for it. Fixed by clarifying the "Applies to" field in Alert 3 and adding a Known Gaps section documenting this limitation with a follow-up recommendation.

  2. Added Known Gaps section (fixed): The document now explicitly calls out three known limitations:

    • Webhook worker not monitored via OTLP or health endpoint
    • Health endpoint staleness thresholds are tighter than the alert thresholds (intentional, documented why)
    • No dedicated LLM provider error rate alert (failures surface indirectly via queue backlog)

Items reviewed and confirmed correct

  • PromQL metric naming: Verified that OTLP metric names with dots (taskdeck.worker.heartbeat.staleness) are correctly converted to Prometheus naming with underscores and unit suffix (taskdeck_worker_heartbeat_staleness_seconds). All three PromQL examples use correct naming.
  • Threshold alignment with issue requirements: 5xx > 1% (P1), p95 > 2s (P2), disk > 80% (P2), memory > 85% (P2), heartbeat > 5min (P1) -- all match issue OPS-30: Define monitoring and alerting rules #868 acceptance criteria.
  • OBSERVABILITY_BASELINE.md threshold updates: The old thresholds (5% error rate, 1500ms latency, 30s heartbeat) were intentionally updated to match the new alerting rules. The old thresholds were "suggested initial alerts" that predated a formal alerting rules document. The new doc supersedes them and the baseline now cross-references the authoritative source.
  • CloudWatch Terraform examples: The alarm definitions reference actual AWS metric namespaces and the correct metric names. The math expression alarm for 5xx rate uses the standard ALB metric pattern.
  • Escalation paths: Every P2 alert has an explicit escalation trigger to P1. Every P1 alert has an escalation to incident declaration. No alert is a dead end.
  • Runbook references: All alert runbooks reference specific existing docs (DISASTER_RECOVERY_RUNBOOK.md, health endpoints) rather than vague guidance.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of alerting rules and operational documentation for Taskdeck, including monitoring thresholds, escalation paths, and integration guides for Grafana, CloudWatch, and PagerDuty. The feedback identifies two inconsistencies between the documentation and the current backend implementation: the OutboundWebhookDeliveryWorker is not yet monitored by the HealthController, and the static queue backlog threshold of 100 items conflicts with the dynamic threshold logic used in the application's health checks.

| **Metric** | `taskdeck.worker.heartbeat.staleness` (per worker name) |
| **Condition** | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker |
| **Evaluation window** | 3 consecutive samples |
| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` (OTLP metrics emitted by health controller). Note: `OutboundWebhookDeliveryWorker` reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric — see Known Gaps below. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The OutboundWebhookDeliveryWorker is listed here, but the current implementation in HealthController.cs does not monitor this worker or emit the taskdeck.worker.heartbeat.staleness metric for it. Consequently, Alert 3 will not function for this specific worker until the telemetry is added to the backend.

Suggested change
| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` (OTLP metrics emitted by health controller). Note: `OutboundWebhookDeliveryWorker` reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric — see Known Gaps below. |
| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker` |

Comment thread docs/ops/ALERTING_RULES.md Outdated
| Field | Value |
| --- | --- |
| **Metric** | `taskdeck.automation.queue.backlog` |
| **Condition** | Queue depth > 100 pending items |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The static threshold of 100 items is inconsistent with the dynamic threshold used in HealthController.cs (line 77), which is Math.Max(_workerSettings.MaxBatchSize * 20, 100). If MaxBatchSize is set to 10 or higher, the health check will remain 'Healthy' while this alert fires, which may cause confusion during incident response.

Suggested change
| **Condition** | Queue depth > 100 pending items |
| **Condition** | Queue depth > 100 pending items (or matches dynamic threshold in `HealthController`) |

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 77ab8d819f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread docs/ops/ALERTING_RULES.md Outdated
| **Metric** | `taskdeck.worker.heartbeat.staleness` (per worker name) |
| **Condition** | Heartbeat staleness > 5 minutes (300 seconds) for any registered worker |
| **Evaluation window** | 3 consecutive samples |
| **Applies to** | `LlmQueueToProposalWorker`, `ProposalHousekeepingWorker`, `OutboundWebhookDeliveryWorker` |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Align heartbeat alert scope with emitted metrics

The rule states this alert applies to OutboundWebhookDeliveryWorker, but HealthController.ReadyCheck only records taskdeck.worker.heartbeat.staleness for LlmQueueToProposalWorker and ProposalHousekeepingWorker. In production, configuring this rule as written creates a false sense of coverage: an outbound webhook worker stall will not produce the documented heartbeat metric series, so the alert cannot fire for that worker even though the runbook implies it will.

Useful? React with 👍 / 👎.

@Chris0Jeky
Copy link
Copy Markdown
Owner Author

Adversarial Review: PR #914 — Alerting Rules (OPS-30)

Overall Assessment

The document is well-structured and comprehensive. However, cross-referencing against the actual codebase reveals several accuracy issues that need correction before merge. I've also reviewed the bot comments and will address those.


Issues Found (will fix)

1. Config path error in Alert 6 runbook
The doc refers to WorkerSettings:MaxBatchSize but the actual appsettings config section is Workers, not WorkerSettings. The correct config path is Workers:MaxBatchSize. (WorkerSettings is only the C# class name.)

2. Queue backlog threshold divergence from HealthController (bot finding -- confirmed valid)
Both gemini-code-assist and chatgpt-codex-connector flagged this. The HealthController uses Math.Max(_workerSettings.MaxBatchSize * 20, 100) (line 77). With the default MaxBatchSize=5, this gives 100, matching the alert threshold. But if operators increase MaxBatchSize to e.g. 10, the health check threshold becomes 200 while the alert still fires at 100. This creates a confusing state where the alert fires but /health/ready reports the queue as healthy. The doc should document this divergence explicitly and recommend aligning the alert threshold with the health check formula.

3. Threshold conflicts with CLOUD_REFERENCE_ARCHITECTURE.md
The cloud reference doc (lines 262-272) defines significantly different thresholds that are now inconsistent with the ALERTING_RULES.md:

  • Latency: Cloud doc says p95 > 300ms (SLO breach) / > 500ms (critical). ALERTING_RULES.md says > 2s (P2) / > 5s (P1). These are 6-10x apart.
  • Worker heartbeat: Cloud doc says > 120s. ALERTING_RULES.md says > 300s. 2.5x apart.
  • 5xx rate: Cloud doc uses absolute count (> 10/min). ALERTING_RULES.md uses percentage (> 1%).

These are not necessarily wrong (different environments may need different thresholds), but the docs should cross-reference each other and explain why the values differ, or the cloud doc thresholds should be updated to match.

4. PromQL examples use incorrect metric types
AutomationQueueBacklog is declared as Histogram<long> and WorkerHeartbeatStalenessSeconds is declared as Histogram<double> in TaskdeckTelemetry.cs. When exported to Prometheus via OpenTelemetry, histograms produce _bucket, _sum, and _count series -- not simple gauge series. The PromQL examples treat them as gauges:

  • max(taskdeck_worker_heartbeat_staleness_seconds) by (taskdeck_worker_name) -- won't work as written against a histogram
  • avg_over_time(taskdeck_automation_queue_backlog[10m]) > 100 -- won't work as written against a histogram

These should either use histogram_quantile() or note that operators may need to adapt based on their OTLP-to-Prometheus exporter configuration (some exporters can export histograms as gauges via delta temporality). At minimum, a caveat is needed.

5. OutboundWebhookDeliveryWorker coverage (bot finding -- already addressed)
Both bots flagged that Alert 3's "Applies to" field originally listed OutboundWebhookDeliveryWorker, but the doc already includes a note saying this worker "reports heartbeats to the registry but is not yet monitored by the health endpoint or emitted as an OTLP metric -- see Known Gaps below." This is correctly documented. The current text appropriately sets expectations. No change needed here.


Minor Issues (will fix)

6. Health controller queue worker staleness formula is understated
Known Gap #2 says the health controller uses QueuePollIntervalSeconds * 3 with "default ~30s". With default QueuePollIntervalSeconds=5, the actual formula gives Math.Max(5 * 3, 30) = 30s, which is accurate. But the parenthetical "default ~30s" is misleading -- it should say "default 30s" (it's exact, not approximate) or better, show the full formula.


Assessment of Bot Comments

  • gemini-code-assist (Alert 3, Applies to): Valid concern but already addressed in the current text with the Known Gaps note. No change needed.
  • gemini-code-assist (Alert 6, queue threshold): Valid. Will fix by adding the dynamic threshold formula reference.
  • chatgpt-codex-connector (Alert 3): Same as gemini's first comment. Already addressed.

Fixes Coming

I will push fixes for issues #1, #2, #4, and #6 above. Issue #3 (cloud doc threshold conflicts) I will address by adding a reconciliation note in ALERTING_RULES.md pointing operators to the cloud doc and explaining the relationship.

- Fix config path: WorkerSettings:MaxBatchSize -> Workers:MaxBatchSize
- Document queue backlog threshold divergence from HealthController's
  dynamic formula Math.Max(MaxBatchSize * 20, 100)
- Fix PromQL examples: metrics are Histograms, not gauges -- use
  _sum/_count series with appropriate caveats
- Add threshold reconciliation section explaining differences with
  CLOUD_REFERENCE_ARCHITECTURE.md alarm stubs
- Fix Known Gap #2: use exact default (30s) instead of approximate (~30s)
  and show the full Math.Max formula
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04170c664b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +195 to +198
max(
taskdeck_worker_heartbeat_staleness_seconds_sum
/ taskdeck_worker_heartbeat_staleness_seconds_count
) by (taskdeck_worker_name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use rate-based histogram math for heartbeat alert query

This query divides taskdeck_worker_heartbeat_staleness_seconds_sum by _count directly, which yields a cumulative lifetime average for an OpenTelemetry histogram rather than recent staleness. In HealthController.ReadyCheck, staleness is recorded continuously, so long healthy periods can keep this average low even when a worker has been stale for >300s, causing Alert #3 to miss or significantly delay incidents. Use a windowed calculation (for example rate(sum)/rate(count)) or another recent-value strategy.

Useful? React with 👍 / 👎.

Comment on lines +207 to +209
avg_over_time(
(taskdeck_automation_queue_backlog_sum / taskdeck_automation_queue_backlog_count)[10m:]
) > 100
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix queue backlog query to avoid lifetime averaging

The backlog expression computes taskdeck_automation_queue_backlog_sum / ..._count and then averages that, but _sum and _count are cumulative histogram counters, so this tracks lifetime average queue depth rather than sustained depth over the last 10 minutes. After long healthy uptime, a real backlog surge can stay below the alert threshold and Alert #6 may never fire when it should. The query should use a recent-window method (for example rate(sum)/rate(count)) or a gauge-style metric.

Useful? React with 👍 / 👎.

@Chris0Jeky Chris0Jeky merged commit 52c7400 into main Apr 22, 2026
14 checks passed
@github-project-automation github-project-automation Bot moved this from Pending to Done in Taskdeck Execution Apr 22, 2026
Chris0Jeky added a commit that referenced this pull request Apr 23, 2026
* docs: update STATUS.md and AUDIT.md for 10-PR post-merge sweep

Add delivery wave entry for PRs #914--#924 covering CI/hardening
(SAST, migration validation, performance regression gate, circuit
breaker), frontend decomposition (ReviewView, InboxView,
AutomationChatView), ops (alerting rules), docs (data model ERD),
and UX (session timeout warning).

Mark resolved items in AUDIT.md: oversized views, session timeout,
SAST, alerting rules, data model reference, performance regression
tests.

* docs: update IMPLEMENTATION_MASTERPLAN.md with wave 27 delivery history

Add delivery wave 27 for PRs #914--#924 covering CI/hardening
(SAST, migration validation, performance regression gate, circuit
breaker), frontend decomposition (ReviewView, InboxView,
AutomationChatView), ops (alerting rules), docs (data model ERD),
and session timeout warning. Note ADR-0031 and ADR-0032. Update
wave 26 to cross-reference view decomposition resolution.

* docs: mark 10 issues delivered in ISSUE_EXECUTION_GUIDE.md

Add Stage 7 with all 10 issues from PRs #914--#924 marked as
delivered. Update Stage 6 execution note to reflect view
decomposition is now resolved.

* docs: add ADR-0031 and ADR-0032 to decisions index

ADR-0031: SAST Scanning with Semgrep (from PR #915, CI-01)
ADR-0032: Circuit Breaker for External API Calls (from PR #924, HARD-01)

Note: Both PRs originally created ADR-0031. Renumbered the circuit
breaker ADR to ADR-0032 to resolve the conflict.

* docs: update TESTING_GUIDE.md with new CI gates and test totals

Add migration-validation job to ci-required, SAST scanning and
performance regression gate to ci-extended, and both to ci-nightly.
Update test counts for circuit breaker (23 backend) and session
timeout (19 frontend) tests.

* docs: update CLAUDE.md frontend architecture description

Reflect view decomposition pattern (thin shells + extracted
composables/components) and add examples of view-specific
composables and component directories.

* docs: mark resolved items in EXPANSION_ROADMAP and HARDENING docs

Mark view decomposition (ReviewView, InboxView, AutomationChatView)
and monitoring/alerting setup as resolved in both roadmap files.

* fix: format STATUS.md Last Updated line to pass docs governance check

The governance regex requires the line to end after the date
(YYYY-MM-DD) with only optional whitespace. Move the parenthetical
sweep note to its own line.

* fix: correct OPS-27 and PERF-12 issue numbers across docs

OPS-27 (config validation) is GitHub issue #863, not #858.
PERF-12 (board list pagination) is GitHub issue #848, not #859.
The wrong numbers (#858/#859) belong to FE-17 and FE-18 respectively.

Fixes references in STATUS.md, IMPLEMENTATION_MASTERPLAN.md,
ISSUE_EXECUTION_GUIDE.md, AUDIT.md, HARDENING_AND_PERFORMANCE.md,
TESTING_GUIDE.md, and CONFIGURATION_REFERENCE.md.
@Chris0Jeky Chris0Jeky deleted the docs/ops-30-monitoring-alerting-rules branch April 23, 2026 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

OPS-30: Define monitoring and alerting rules

1 participant