Commit 52c7400
authored
docs: define monitoring and alerting rules (OPS-30) (#914)
* docs: define monitoring and alerting rules (OPS-30)
Add docs/ops/ALERTING_RULES.md with 10 alert rules covering API error
rate, latency, worker heartbeat, disk, memory, queue backlog, database
connectivity, health endpoint, CPU, and Redis backplane. Each rule
specifies metric source, threshold, evaluation window, priority (P1/P2),
runbook steps, and escalation triggers.
Includes integration guidance for Grafana, AWS CloudWatch, PagerDuty,
and external uptime monitoring with example PromQL queries and Terraform
alarm definitions.
Closes #868
* docs: add ALERTING_RULES.md to ops README index
Cross-reference the new alerting rules document from the ops directory
index alongside the existing observability docs.
* docs: update OBSERVABILITY_BASELINE alert thresholds and cross-reference
Update the alert threshold baseline section to match the authoritative
thresholds in ALERTING_RULES.md and add a callout directing operators
to the comprehensive alerting rules document.
* docs: add known gaps section to alerting rules
Document three known gaps found during adversarial review:
1. OutboundWebhookDeliveryWorker not monitored by health endpoint
2. Health endpoint staleness thresholds differ from alert thresholds
3. No dedicated LLM provider error rate alert
Also clarify that Alert 3 applies to workers with OTLP metric emission
(LlmQueueToProposalWorker and ProposalHousekeepingWorker only).
* fix: correct alerting rules accuracy issues from adversarial review
- Fix config path: WorkerSettings:MaxBatchSize -> Workers:MaxBatchSize
- Document queue backlog threshold divergence from HealthController's
dynamic formula Math.Max(MaxBatchSize * 20, 100)
- Fix PromQL examples: metrics are Histograms, not gauges -- use
_sum/_count series with appropriate caveats
- Add threshold reconciliation section explaining differences with
CLOUD_REFERENCE_ARCHITECTURE.md alarm stubs
- Fix Known Gap #2: use exact default (30s) instead of approximate (~30s)
and show the full Math.Max formula1 parent 5b70f2e commit 52c7400
3 files changed
Lines changed: 393 additions & 6 deletions
0 commit comments