You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(cdk): add CloudWatch alarms for FanOut + ApprovalMetricsPublisher DLQs (#117)
Ship DLQ-depth CloudWatch alarms (ApproximateNumberOfMessagesVisible >= 1,
5-min Maximum, treatMissingData NOT_BREACHING) for both stream consumers so
poison-pill records no longer accumulate silently during the 14-day
retention window.
Alarms transition to ALARM state in the CloudWatch console without a
notification action — SNS wiring is a bounded follow-up once an operational
channel is provisioned.
Also:
- Update §11.5 in CEDAR_HITL_GATES.md to reflect the shipped decision
- Narrow public type to cloudwatch.IAlarm (consumers only need addAlarmAction)
- Add JSDoc matching the errorAlarm precedent in task-orchestrator.ts
- Replace Runbook TODO with actionable diagnostic text and link to #117
- Reference issue #117 in code comments and test titles instead of §11.5
Copy file name to clipboardExpand all lines: docs/design/CEDAR_HITL_GATES.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1585,21 +1585,26 @@ Extend `TaskDashboard` (`cdk/src/constructs/task-dashboard.ts`). These are read-
1585
1585
1586
1586
Every `agent_milestone("approval_*")` event carries `trace_id` / `span_id`. A span `hitl.approval_wait` brackets the PreToolUse poll loop: `span.duration = decided_at - created_at`. `hitl.approval_race_loss` emitted when the agent's local timeout fired <5s before a late user decision (useful for tuning).
**DLQ-depth alarms (shipped):** CloudWatch alarms on `ApproximateNumberOfMessagesVisible >= 1` (5-min period, Maximum statistic, `treatMissingData: NOT_BREACHING`) are deployed for:
1591
+
1592
+
-**FanOutConsumer DLQ** — poison-pill DynamoDB Stream records that failed three consecutive Lambda invocations.
1593
+
-**ApprovalMetricsPublisher DLQ** — same failure mode for the metrics-publisher consumer.
1594
+
1595
+
These alarms transition to `ALARM` state in CloudWatch and appear in the console/dashboard, providing operator visibility into silent record loss. They ship **without an `addAlarmAction` / SNS notification target** — operators must check the CloudWatch Alarms console or configure a subscription manually. This is an intentional intermediate step: alarm state is durable and queryable even without push notifications, and prevents poison records from accumulating silently for the full 14-day DLQ retention window.
1596
+
1597
+
**Follow-up — notification channel wiring:** Once an operational notification channel (SNS topic → Slack / PagerDuty / email) is provisioned, add `alarm.addAlarmAction(new SnsAction(topic))` to both alarms. No metric or alarm restructuring is needed.
1598
+
1599
+
**Additional alarms (not yet shipped):** The following remain deferred until the notification channel exists (alarm-without-action provides limited value for rate/latency conditions that require human triage):
1589
1600
1590
-
Operator-facing CloudWatch alarms that would page on:
1591
1601
- High approval-timeout rate (users not responding, notifications broken)
…are **out of scope for v1** because the project does not yet have a notification channel (Slack / PagerDuty / SNS topic / email distribution list) configured for operational alerts. Adding alarms without a notification channel produces CloudWatch widgets that nobody sees — no safety benefit.
1600
-
1601
-
**Plumbing status (post-Chunk 8):** the supporting metric data now flows as native CloudWatch metrics in namespace `ABCA/Cedar-HITL` via `ApprovalMetricsPublisherFn` (§11.3). Alarm wiring becomes a per-threshold `cloudwatch.Alarm` + `SnsAction`; no additional metric-extraction infra is needed. The remaining gap is the SNS topic + subscriber wiring itself — when that lands, the alarms above are a small bounded follow-up (not a multi-PR metrics build-out as they were pre-Chunk-8).
1602
-
1603
1608
---
1604
1609
1605
1610
## 12. Security model
@@ -2070,7 +2075,7 @@ See §17.18 for the off-hours escalation future-work primitive, and §13.14 for
- CloudWatch alarm plumbing (§11.5) — deferred until an operational notification channel is available
2078
+
- CloudWatch alarm SNS notification wiring (§11.5) — DLQ-depth alarms ship without an action target; add `SnsAction` once a notification channel is provisioned
2074
2079
- More soft-deny policies in the default set based on real usage
Copy file name to clipboardExpand all lines: docs/src/content/docs/architecture/Cedar-hitl-gates.md
+13-8Lines changed: 13 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1589,21 +1589,26 @@ Extend `TaskDashboard` (`cdk/src/constructs/task-dashboard.ts`). These are read-
1589
1589
1590
1590
Every `agent_milestone("approval_*")` event carries `trace_id` / `span_id`. A span `hitl.approval_wait` brackets the PreToolUse poll loop: `span.duration = decided_at - created_at`. `hitl.approval_race_loss` emitted when the agent's local timeout fired <5s before a late user decision (useful for tuning).
**DLQ-depth alarms (shipped):** CloudWatch alarms on `ApproximateNumberOfMessagesVisible >= 1` (5-min period, Maximum statistic, `treatMissingData: NOT_BREACHING`) are deployed for:
1595
+
1596
+
-**FanOutConsumer DLQ** — poison-pill DynamoDB Stream records that failed three consecutive Lambda invocations.
1597
+
-**ApprovalMetricsPublisher DLQ** — same failure mode for the metrics-publisher consumer.
1598
+
1599
+
These alarms transition to `ALARM` state in CloudWatch and appear in the console/dashboard, providing operator visibility into silent record loss. They ship **without an `addAlarmAction` / SNS notification target** — operators must check the CloudWatch Alarms console or configure a subscription manually. This is an intentional intermediate step: alarm state is durable and queryable even without push notifications, and prevents poison records from accumulating silently for the full 14-day DLQ retention window.
1600
+
1601
+
**Follow-up — notification channel wiring:** Once an operational notification channel (SNS topic → Slack / PagerDuty / email) is provisioned, add `alarm.addAlarmAction(new SnsAction(topic))` to both alarms. No metric or alarm restructuring is needed.
1602
+
1603
+
**Additional alarms (not yet shipped):** The following remain deferred until the notification channel exists (alarm-without-action provides limited value for rate/latency conditions that require human triage):
1593
1604
1594
-
Operator-facing CloudWatch alarms that would page on:
1595
1605
- High approval-timeout rate (users not responding, notifications broken)
…are **out of scope for v1** because the project does not yet have a notification channel (Slack / PagerDuty / SNS topic / email distribution list) configured for operational alerts. Adding alarms without a notification channel produces CloudWatch widgets that nobody sees — no safety benefit.
1604
-
1605
-
**Plumbing status (post-Chunk 8):** the supporting metric data now flows as native CloudWatch metrics in namespace `ABCA/Cedar-HITL` via `ApprovalMetricsPublisherFn` (§11.3). Alarm wiring becomes a per-threshold `cloudwatch.Alarm` + `SnsAction`; no additional metric-extraction infra is needed. The remaining gap is the SNS topic + subscriber wiring itself — when that lands, the alarms above are a small bounded follow-up (not a multi-PR metrics build-out as they were pre-Chunk-8).
1606
-
1607
1612
---
1608
1613
1609
1614
## 12. Security model
@@ -2074,7 +2079,7 @@ See §17.18 for the off-hours escalation future-work primitive, and §13.14 for
- CloudWatch alarm plumbing (§11.5) — deferred until an operational notification channel is available
2082
+
- CloudWatch alarm SNS notification wiring (§11.5) — DLQ-depth alarms ship without an action target; add `SnsAction` once a notification channel is provisioned
2078
2083
- More soft-deny policies in the default set based on real usage
0 commit comments