Skip to content

Commit 1f0aa58

Browse files
author
nizar-lahlali
committed
feat(cdk): add CloudWatch alarms for FanOut + ApprovalMetricsPublisher DLQs (#117)
Ship DLQ-depth CloudWatch alarms (ApproximateNumberOfMessagesVisible >= 1, 5-min Maximum, treatMissingData NOT_BREACHING) for both stream consumers so poison-pill records no longer accumulate silently during the 14-day retention window. Alarms transition to ALARM state in the CloudWatch console without a notification action — SNS wiring is a bounded follow-up once an operational channel is provisioned. Also: - Update §11.5 in CEDAR_HITL_GATES.md to reflect the shipped decision - Narrow public type to cloudwatch.IAlarm (consumers only need addAlarmAction) - Add JSDoc matching the errorAlarm precedent in task-orchestrator.ts - Replace Runbook TODO with actionable diagnostic text and link to #117 - Reference issue #117 in code comments and test titles instead of §11.5
1 parent 46e2512 commit 1f0aa58

6 files changed

Lines changed: 38 additions & 24 deletions

File tree

cdk/src/constructs/approval-metrics-publisher-consumer.ts

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,8 @@ export interface ApprovalMetricsPublisherConsumerProps {
8585
export class ApprovalMetricsPublisherConsumer extends Construct {
8686
public readonly fn: lambda.NodejsFunction;
8787
public readonly dlq: sqs.Queue;
88-
public readonly dlqAlarm: cloudwatch.Alarm;
88+
/** CloudWatch alarm that fires when the DLQ has at least one poison-pill record. */
89+
public readonly dlqAlarm: cloudwatch.IAlarm;
8990

9091
constructor(scope: Construct, id: string, props: ApprovalMetricsPublisherConsumerProps) {
9192
super(scope, id);
@@ -152,7 +153,7 @@ export class ApprovalMetricsPublisherConsumer extends Construct {
152153
filters: [agentMilestoneFilter],
153154
}));
154155

155-
// §11.5: alarm on DLQ depth so poison-pill records don't silently
156+
// #117: alarm on DLQ depth so poison-pill records don't silently
156157
// accumulate without operator visibility.
157158
this.dlqAlarm = new cloudwatch.Alarm(this, 'DlqMessageAlarm', {
158159
metric: this.dlq.metricApproximateNumberOfMessagesVisible({
@@ -164,7 +165,8 @@ export class ApprovalMetricsPublisherConsumer extends Construct {
164165
comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
165166
alarmDescription:
166167
'ApprovalMetricsPublisher DLQ has at least one message; investigate poison records. ' +
167-
'Runbook: TODO — check CloudWatch Logs for the ApprovalMetricsPublisherFn error that caused the DLQ send.',
168+
'Check CloudWatch Logs for the ApprovalMetricsPublisherFn error that caused the DLQ send. ' +
169+
'See: https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/117',
168170
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
169171
});
170172

cdk/src/constructs/fanout-consumer.ts

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,8 @@ export interface FanOutConsumerProps {
107107
export class FanOutConsumer extends Construct {
108108
public readonly fn: lambda.NodejsFunction;
109109
public readonly dlq: sqs.Queue;
110-
public readonly dlqAlarm: cloudwatch.Alarm;
110+
/** CloudWatch alarm that fires when the DLQ has at least one poison-pill record. */
111+
public readonly dlqAlarm: cloudwatch.IAlarm;
111112

112113
constructor(scope: Construct, id: string, props: FanOutConsumerProps) {
113114
super(scope, id);
@@ -187,7 +188,7 @@ export class FanOutConsumer extends Construct {
187188
reportBatchItemFailures: true,
188189
}));
189190

190-
// §11.5: alarm on DLQ depth so poison-pill records don't silently
191+
// #117: alarm on DLQ depth so poison-pill records don't silently
191192
// accumulate without operator visibility.
192193
this.dlqAlarm = new cloudwatch.Alarm(this, 'DlqMessageAlarm', {
193194
metric: this.dlq.metricApproximateNumberOfMessagesVisible({
@@ -199,7 +200,8 @@ export class FanOutConsumer extends Construct {
199200
comparisonOperator: cloudwatch.ComparisonOperator.GREATER_THAN_OR_EQUAL_TO_THRESHOLD,
200201
alarmDescription:
201202
'FanOutConsumer DLQ has at least one message; investigate poison records. ' +
202-
'Runbook: TODO — check CloudWatch Logs for the FanOutFn error that caused the DLQ send.',
203+
'Check CloudWatch Logs for the FanOutFn error that caused the DLQ send. ' +
204+
'See: https://github.com/aws-samples/sample-autonomous-cloud-coding-agents/issues/117',
203205
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
204206
});
205207

cdk/test/constructs/approval-metrics-publisher-consumer.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,7 @@ describe('ApprovalMetricsPublisherConsumer', () => {
190190
template.resourceCountIs('AWS::DynamoDB::Table', 1);
191191
});
192192

193-
test('creates a CloudWatch alarm on DLQ ApproximateNumberOfMessagesVisible (§11.5)', () => {
193+
test('creates a CloudWatch alarm on DLQ ApproximateNumberOfMessagesVisible (#117)', () => {
194194
const { template } = createStack();
195195

196196
template.hasResourceProperties('AWS::CloudWatch::Alarm', {

cdk/test/constructs/fanout-consumer.test.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ describe('FanOutConsumer', () => {
171171
}
172172
});
173173

174-
test('creates a CloudWatch alarm on DLQ ApproximateNumberOfMessagesVisible (§11.5)', () => {
174+
test('creates a CloudWatch alarm on DLQ ApproximateNumberOfMessagesVisible (#117)', () => {
175175
const app = new App();
176176
const stack = new Stack(app, 'TestStack');
177177
new FanOutConsumer(stack, 'FanOut', {

docs/design/CEDAR_HITL_GATES.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1585,21 +1585,26 @@ Extend `TaskDashboard` (`cdk/src/constructs/task-dashboard.ts`). These are read-
15851585

15861586
Every `agent_milestone("approval_*")` event carries `trace_id` / `span_id`. A span `hitl.approval_wait` brackets the PreToolUse poll loop: `span.duration = decided_at - created_at`. `hitl.approval_race_loss` emitted when the agent's local timeout fired <5s before a late user decision (useful for tuning).
15871587

1588-
### 11.5 CloudWatch alarms — deferred (notification-channel gated)
1588+
### 11.5 CloudWatch alarms
1589+
1590+
**DLQ-depth alarms (shipped):** CloudWatch alarms on `ApproximateNumberOfMessagesVisible >= 1` (5-min period, Maximum statistic, `treatMissingData: NOT_BREACHING`) are deployed for:
1591+
1592+
- **FanOutConsumer DLQ** — poison-pill DynamoDB Stream records that failed three consecutive Lambda invocations.
1593+
- **ApprovalMetricsPublisher DLQ** — same failure mode for the metrics-publisher consumer.
1594+
1595+
These alarms transition to `ALARM` state in CloudWatch and appear in the console/dashboard, providing operator visibility into silent record loss. They ship **without an `addAlarmAction` / SNS notification target** — operators must check the CloudWatch Alarms console or configure a subscription manually. This is an intentional intermediate step: alarm state is durable and queryable even without push notifications, and prevents poison records from accumulating silently for the full 14-day DLQ retention window.
1596+
1597+
**Follow-up — notification channel wiring:** Once an operational notification channel (SNS topic → Slack / PagerDuty / email) is provisioned, add `alarm.addAlarmAction(new SnsAction(topic))` to both alarms. No metric or alarm restructuring is needed.
1598+
1599+
**Additional alarms (not yet shipped):** The following remain deferred until the notification channel exists (alarm-without-action provides limited value for rate/latency conditions that require human triage):
15891600

1590-
Operator-facing CloudWatch alarms that would page on:
15911601
- High approval-timeout rate (users not responding, notifications broken)
15921602
- Tasks stuck in AWAITING_APPROVAL beyond `timeout_s + 60s` (reconciler failure)
15931603
- High approval-write failure rate (DDB throttled or IAM drift)
15941604
- Approval-gate cap hit (suspicious retry loop)
1595-
- Publisher / fanout DLQ non-empty (persistent consumer-side poison pills)
15961605
- `MetricEmitSkipped` sustained > 0 (publisher schema mismatch — agent / publisher version skew)
15971606
- `MetricsPublisherHeartbeat` flat-line (publisher pipeline broken)
15981607

1599-
…are **out of scope for v1** because the project does not yet have a notification channel (Slack / PagerDuty / SNS topic / email distribution list) configured for operational alerts. Adding alarms without a notification channel produces CloudWatch widgets that nobody sees — no safety benefit.
1600-
1601-
**Plumbing status (post-Chunk 8):** the supporting metric data now flows as native CloudWatch metrics in namespace `ABCA/Cedar-HITL` via `ApprovalMetricsPublisherFn` (§11.3). Alarm wiring becomes a per-threshold `cloudwatch.Alarm` + `SnsAction`; no additional metric-extraction infra is needed. The remaining gap is the SNS topic + subscriber wiring itself — when that lands, the alarms above are a small bounded follow-up (not a multi-PR metrics build-out as they were pre-Chunk-8).
1602-
16031608
---
16041609

16051610
## 12. Security model
@@ -2070,7 +2075,7 @@ See §17.18 for the off-hours escalation future-work primitive, and §13.14 for
20702075
**Future work — polish (tracked in §17):**
20712076
- CLI inline streaming prompt (UX research first)
20722077
- `approve --defer` / allowlist revocation (`bgagent revoke-approval`)
2073-
- CloudWatch alarm plumbing (§11.5) — deferred until an operational notification channel is available
2078+
- CloudWatch alarm SNS notification wiring (§11.5) — DLQ-depth alarms ship without an action target; add `SnsAction` once a notification channel is provisioned
20742079
- More soft-deny policies in the default set based on real usage
20752080
- Persistent recent-decision cache (if container-restart telemetry justifies it)
20762081
- Persistent per-minute rate limit (if restart amplification becomes significant)

docs/src/content/docs/architecture/Cedar-hitl-gates.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1589,21 +1589,26 @@ Extend `TaskDashboard` (`cdk/src/constructs/task-dashboard.ts`). These are read-
15891589

15901590
Every `agent_milestone("approval_*")` event carries `trace_id` / `span_id`. A span `hitl.approval_wait` brackets the PreToolUse poll loop: `span.duration = decided_at - created_at`. `hitl.approval_race_loss` emitted when the agent's local timeout fired <5s before a late user decision (useful for tuning).
15911591

1592-
### 11.5 CloudWatch alarms — deferred (notification-channel gated)
1592+
### 11.5 CloudWatch alarms
1593+
1594+
**DLQ-depth alarms (shipped):** CloudWatch alarms on `ApproximateNumberOfMessagesVisible >= 1` (5-min period, Maximum statistic, `treatMissingData: NOT_BREACHING`) are deployed for:
1595+
1596+
- **FanOutConsumer DLQ** — poison-pill DynamoDB Stream records that failed three consecutive Lambda invocations.
1597+
- **ApprovalMetricsPublisher DLQ** — same failure mode for the metrics-publisher consumer.
1598+
1599+
These alarms transition to `ALARM` state in CloudWatch and appear in the console/dashboard, providing operator visibility into silent record loss. They ship **without an `addAlarmAction` / SNS notification target** — operators must check the CloudWatch Alarms console or configure a subscription manually. This is an intentional intermediate step: alarm state is durable and queryable even without push notifications, and prevents poison records from accumulating silently for the full 14-day DLQ retention window.
1600+
1601+
**Follow-up — notification channel wiring:** Once an operational notification channel (SNS topic → Slack / PagerDuty / email) is provisioned, add `alarm.addAlarmAction(new SnsAction(topic))` to both alarms. No metric or alarm restructuring is needed.
1602+
1603+
**Additional alarms (not yet shipped):** The following remain deferred until the notification channel exists (alarm-without-action provides limited value for rate/latency conditions that require human triage):
15931604

1594-
Operator-facing CloudWatch alarms that would page on:
15951605
- High approval-timeout rate (users not responding, notifications broken)
15961606
- Tasks stuck in AWAITING_APPROVAL beyond `timeout_s + 60s` (reconciler failure)
15971607
- High approval-write failure rate (DDB throttled or IAM drift)
15981608
- Approval-gate cap hit (suspicious retry loop)
1599-
- Publisher / fanout DLQ non-empty (persistent consumer-side poison pills)
16001609
- `MetricEmitSkipped` sustained > 0 (publisher schema mismatch — agent / publisher version skew)
16011610
- `MetricsPublisherHeartbeat` flat-line (publisher pipeline broken)
16021611

1603-
…are **out of scope for v1** because the project does not yet have a notification channel (Slack / PagerDuty / SNS topic / email distribution list) configured for operational alerts. Adding alarms without a notification channel produces CloudWatch widgets that nobody sees — no safety benefit.
1604-
1605-
**Plumbing status (post-Chunk 8):** the supporting metric data now flows as native CloudWatch metrics in namespace `ABCA/Cedar-HITL` via `ApprovalMetricsPublisherFn` (§11.3). Alarm wiring becomes a per-threshold `cloudwatch.Alarm` + `SnsAction`; no additional metric-extraction infra is needed. The remaining gap is the SNS topic + subscriber wiring itself — when that lands, the alarms above are a small bounded follow-up (not a multi-PR metrics build-out as they were pre-Chunk-8).
1606-
16071612
---
16081613

16091614
## 12. Security model
@@ -2074,7 +2079,7 @@ See §17.18 for the off-hours escalation future-work primitive, and §13.14 for
20742079
**Future work — polish (tracked in §17):**
20752080
- CLI inline streaming prompt (UX research first)
20762081
- `approve --defer` / allowlist revocation (`bgagent revoke-approval`)
2077-
- CloudWatch alarm plumbing (§11.5) — deferred until an operational notification channel is available
2082+
- CloudWatch alarm SNS notification wiring (§11.5) — DLQ-depth alarms ship without an action target; add `SnsAction` once a notification channel is provisioned
20782083
- More soft-deny policies in the default set based on real usage
20792084
- Persistent recent-decision cache (if container-restart telemetry justifies it)
20802085
- Persistent per-minute rate limit (if restart amplification becomes significant)

0 commit comments

Comments
 (0)