Skip to content

Commit 014ff20

Browse files
committed
Merge branch 'feat/circuit-breaker' into 'develop'
Feat/circuit breaker See merge request genaiic-reusable-assets/engagement-artifacts/genaiic-idp-accelerator!638
2 parents a88172b + 6756712 commit 014ff20

28 files changed

Lines changed: 2844 additions & 4 deletions

CHANGELOG.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,27 @@ SPDX-License-Identifier: MIT-0
55

66
## [Unreleased]
77

8+
### Added
9+
10+
- **Bedrock circuit breaker** — a CFN-parameterized circuit breaker that pauses new workflow starts when Bedrock is unhealthy and auto-recovers once the service comes back, so transient Bedrock outages no longer burn through SQS retries or leave documents half-processed. Off by default for full backward compatibility.
11+
- New `circuit_breaker_manager` Lambda (`src/lambda/circuit_breaker_manager/`) owns state transitions in the existing `ConcurrencyTable` under `counter_id = "circuit_breaker"`. Three states: `CLOSED` (normal), `OPEN` (block new workflow starts), `HALF_OPEN` (probe traffic allowed).
12+
- **Triggers:** (1) CloudWatch Alarm state changes on Bedrock error metrics fan out via SNS → `circuit_breaker_manager`; (2) an EventBridge-scheduled health check promotes `OPEN → HALF_OPEN` once `RECOVERY_TIMEOUT_SECONDS` has elapsed.
13+
- **Gate:** `queue_processor.check_circuit_breaker()` runs *before* the concurrency-counter increment — cheap filter first. `OPEN` → message is not deleted, SQS redelivers after visibility timeout. `HALF_OPEN` → probe traffic proceeds. DDB errors fail **open** (do not block traffic) and return state `ERROR`.
14+
- **Recovery:** `workflow_tracker.notify_circuit_breaker_success()` transitions `HALF_OPEN → CLOSED` after the first successful workflow completion, conditionally on state still being `HALF_OPEN`.
15+
- **Race-safe DDB writes:** all state transitions (`handle_alarm_event`, `handle_health_check`, `notify_circuit_breaker_success`) use `ConditionExpression` on the expected prior state. A concurrent alarm fan-out or a workflow-completion racing against an alarm can no longer clobber each other's state write; the loser's `ConditionalCheckFailedException` is swallowed as a no-op and side effects (SNS publish, CloudWatch metric, custom error-handler invoke) are skipped so no stale notification is emitted.
16+
- **Operator hooks:** manual `{"action": "reset"}` and `{"action": "get_state"}` invocations on `circuit_breaker_manager`; optional customer Lambda invoked via `ERROR_HANDLER_ARN` when the breaker opens.
17+
- **Metrics & notifications:** `CircuitBreakerOpened` / `CircuitBreakerHalfOpen` / `CircuitBreakerClosed` CloudWatch metrics under the existing `METRIC_NAMESPACE`, and state-change notifications to the existing `AlertsTopic`.
18+
- **New CFN parameters** (all default to off, fully backward-compatible): `EnableCircuitBreaker`, `CircuitBreakerRecoveryTimeoutSeconds`, `CircuitBreakerErrorHandlerArn`. New env vars `CIRCUIT_BREAKER_ENABLED` / `CIRCUIT_BREAKER_ID` on `queue_processor` and `workflow_tracker`.
19+
- **Unit test coverage:** `src/lambda/circuit_breaker_manager/test_index.py`, `src/lambda/queue_processor/test_check_circuit_breaker.py`, `src/lambda/workflow_tracker/test_notify_circuit_breaker.py` cover alarm ALARM/OK per prior-state branch, health-check recovery elapsed vs. not elapsed, counter preservation on OPEN→HALF_OPEN, manual reset/get_state (including `last_error` removal), disabled / item-missing / OPEN / HALF_OPEN / DDB-error gate paths, OPEN-state SQS visibility extension, HALF_OPEN→CLOSED success path, and the ConditionalCheckFailedException race-loss branch.
20+
- **Outage-aware SQS retry backoff:** when the breaker is `OPEN`, `queue_processor` calls `sqs.ChangeMessageVisibility` to push the message invisibility out to `CircuitBreakerRecoveryTimeoutSeconds` (default 300 s). Keeps long Bedrock outages from burning through the source queue's `maxReceiveCount` and landing documents in the DLQ. Failures are non-fatal — messages fall back to the default 30 s visibility timeout.
21+
- **Counter preservation on recovery:** `OPEN → HALF_OPEN` transitions (both alarm-cleared and recovery-timeout paths) now preserve `failure_count` instead of resetting it to 0, so operators can see the cumulative failure count across a full outage. Manual reset still zeros both counters and removes `last_error` for a clean-slate restart.
22+
- **Docs:** `docs/circuit-breaker.md` and `src/lambda/circuit_breaker_manager/README.md`.
23+
- **Web UI visibility & admin controls** — the document list header shows a live status badge (green CLOSED / blue HALF_OPEN / red OPEN with `lastError` tooltip) powered by an AppSync subscription. Clicking the badge opens a details panel (state, `openedAt`, `failureCount`, `recoveryAttempts`, `lastError`). Users in the **Admin** Cognito group additionally see **Pause processing**, **Resume processing**, and **Probe recovery** buttons — each requires a reason that is persisted to DynamoDB and broadcast over the existing `AlertsTopic`. All automatic transitions (alarm fan-out, scheduled health-check probe) also publish to the subscription, so the badge reflects real state within ~1 s across every connected browser. Badge and panel are hidden entirely when `CircuitBreakerEnabled=false`.
24+
- New AppSync query/mutations/subscription: `getCircuitBreakerStatus`, `pauseCircuitBreaker`, `resumeCircuitBreaker`, `probeCircuitBreaker`, `onCircuitBreakerStatusChange` (backed by IAM-only `publishCircuitBreakerStatus` fan-out mutation).
25+
- New resolver Lambda (`nested/appsync/src/lambda/circuit_breaker_resolver/`) handles reads directly from `ConcurrencyTable` and forwards admin mutations to `circuit_breaker_manager` as new `manual_open` / `manual_close` / `manual_probe` actions (reasons recorded in `last_error`). Admin authorization is enforced at both the AppSync schema layer (`@aws_auth(cognito_groups: ["Admin"])`) and in the resolver.
26+
- `HALF_OPEN → CLOSED` recoveries fan out to the UI in real time. `workflow_tracker` async-invokes `circuit_breaker_manager` with a new `broadcast` action after it closes the breaker, so the badge flips from "Circuit: recovering" to "Circuit: closed" without a manual refresh. Previously the DDB write was correct but the UI stayed stuck on recovering because only the manager Lambda published to AppSync.
27+
- Badge label distinguishes `Manual pause by <user>` entries in `last_error` ("Circuit: manually paused") from automatic alarm-triggered opens ("Circuit: Bedrock outage"). Badge is also restyled as a Cloudscape `Button` to match the other header actions and moved to the right of **Release Review**.
28+
829
### Changed
930

1031
- **Replaced DSR with open-source SRT security scanning tool** — Migrated from deprecated internal DSR (Design Security Review) tool to the actively maintained open-source [Sample Security Review Tool (SRT)](https://github.com/aws-samples/sample-security-review-tool). Added automated security scanning in GitLab CI/CD pipeline that runs on merge requests targeting `develop` branch. Pipeline fails if security findings are detected, providing a security gate before production deployments. New Makefile targets: `make srt`, `make srt-setup`, `make srt-scan`, `make srt-fix`. Updated documentation in CLAUDE.md, CONTRIBUTING.md, and scripts/README.md.

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ This folder contains detailed documentation on various aspects of the GenAI Inte
7171
- [Monitoring](./monitoring.md) - Monitoring and logging capabilities
7272
- [Reporting Database](./reporting-database.md) - Analytics database for evaluation metrics and metering data
7373
- [Capacity Planning](./capacity-planning.md) - Performance optimization and resource scaling guidance
74+
- [Circuit Breaker](./circuit-breaker.md) - Automatic protection from cascading failures during Bedrock outages
7475
- [Cost Calculator](./cost-calculator.md) - Framework for estimating solution costs
7576

7677
## Planning & Security

docs/circuit-breaker.md

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
---
2+
title: "Circuit Breaker"
3+
---
4+
5+
Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
6+
SPDX-License-Identifier: MIT-0
7+
8+
# Circuit Breaker
9+
10+
Protects the IDP pipeline from cascading failures when Amazon Bedrock is degraded or unavailable. When Bedrock starts returning errors at a configurable rate, the circuit breaker **opens** and new workflows stop starting. Messages stay in SQS instead of fanning out into Lambda retries that would eventually time out or burn through the Step Functions retry budget. Once Bedrock recovers, the breaker transitions through a **half-open** probe state back to **closed** and normal processing resumes.
11+
12+
## Why it exists
13+
14+
Without the circuit breaker, a Bedrock outage produces this chain:
15+
16+
1. Workflows start normally.
17+
2. Every Bedrock call hits the in-client retry loop (up to 7 attempts, exponential backoff up to 5 minutes).
18+
3. Step Functions retries another 8 times on failure.
19+
4. Individual executions hang for up to 15 minutes before failing.
20+
5. Meanwhile new documents keep getting pulled from SQS and starting more doomed workflows, which waste Lambda concurrency and inflate cost.
21+
22+
The circuit breaker short-circuits that cycle by refusing to start new workflows while Bedrock is unhealthy, so messages stay in the queue and process cleanly after recovery.
23+
24+
## States
25+
26+
| State | Behavior |
27+
|-------|----------|
28+
| **CLOSED** | Normal operation. All requests processed. |
29+
| **OPEN** | Bedrock unavailable. Queue Processor returns messages to SQS for retry. |
30+
| **HALF_OPEN** | Testing recovery. Limited traffic allowed through. First successful workflow closes the breaker; first new alarm reopens it. |
31+
32+
### Transitions
33+
34+
```
35+
┌──────────┐
36+
│ CLOSED │◄─────────── Successful workflow in HALF_OPEN
37+
└────┬─────┘ OR alarm returns to OK
38+
39+
│ CloudWatch alarm fires
40+
41+
┌──────────┐
42+
│ OPEN │ ◄── Alarm fires during HALF_OPEN
43+
└────┬─────┘
44+
45+
│ Recovery timeout OR alarm OK
46+
47+
┌───────────┐
48+
│ HALF_OPEN │
49+
└───────────┘
50+
```
51+
52+
## Architecture
53+
54+
```
55+
┌─────────────┐ SNS ┌────────────────────┐ DynamoDB ┌──────────────┐
56+
│ CloudWatch │───────────►│ Circuit Breaker │◄──────────────►│ Concurrency │
57+
│ Alarm │ │ Manager │ │ Table │
58+
└─────────────┘ └────────────────────┘ └──────────────┘
59+
│ ▲
60+
│ SNS │
61+
▼ │
62+
┌──────────────┐ │
63+
│ AlertsTopic │ │
64+
│ (notify ops) │ │
65+
└──────────────┘ │
66+
67+
┌─────────────┐ │
68+
│ SQS │────────────►┌─────────────────┐ check state before │
69+
│ Queue │ │ Queue Processor │─────processing──────────┘
70+
└─────────────┘ └─────────────────┘
71+
72+
│ if CLOSED or HALF_OPEN
73+
74+
┌─────────────────┐
75+
│ Step Functions │
76+
│ Workflow │
77+
└─────────────────┘
78+
```
79+
80+
- **BedrockServiceOutageAlarm**: CloudWatch MetricMath alarm on Bedrock error metrics (see [Alarm threshold](#alarm-threshold)).
81+
- **CircuitBreakerManager**: Lambda triggered by the alarm's SNS topic and by a 5-minute EventBridge schedule. Manages state transitions and publishes notifications.
82+
- **ConcurrencyTable**: Existing DynamoDB table — circuit breaker state is stored on the `circuit_breaker` partition key.
83+
- **QueueProcessor**: Reads state before starting workflows. If OPEN, it returns without starting the Step Functions execution, leaving the message in SQS.
84+
- **WorkflowTracker**: On a successful workflow completion, if state is HALF_OPEN, transitions to CLOSED.
85+
86+
## Alarm threshold
87+
88+
The `BedrockServiceOutageAlarm` uses MetricMath to sum the Bedrock error categories you opt into and compares the total to `CircuitBreakerFailureThreshold`.
89+
90+
Expression:
91+
92+
```
93+
<SU> * FILL(m1, 0) + <Thr> * FILL(m2, 0) + <QL> * FILL(m3, 0)
94+
```
95+
96+
Where each coefficient is `1` if the corresponding trigger is enabled and `0` otherwise. Metrics `m1`/`m2`/`m3` are `BedrockServiceUnavailable`, `BedrockThrottling`, and `BedrockQuotaLimit` under the stack namespace.
97+
98+
| Parameter | Default | Description |
99+
|-----------|---------|-------------|
100+
| `CircuitBreakerEnabled` | `false` | Master switch. Set to `true` to provision the alarm, SNS topic, manager Lambda, and traffic gate. |
101+
| `CircuitBreakerTriggerServiceUnavailable` | `true` | Count 503 `ServiceUnavailableException` errors toward threshold. |
102+
| `CircuitBreakerTriggerThrottling` | `false` | Count `ThrottlingException`, `TooManyRequestsException`, `RequestLimitExceeded`. |
103+
| `CircuitBreakerTriggerQuotaLimit` | `false` | Count `ServiceQuotaExceededException`. |
104+
| `CircuitBreakerFailureThreshold` | `3` | Combined error count per 5-minute period to breach. |
105+
| `CircuitBreakerEvaluationPeriods` | `1` | Consecutive periods that must breach. |
106+
| `CircuitBreakerRecoveryTimeoutSeconds` | `300` | Seconds before automatic OPEN → HALF_OPEN transition. |
107+
| `CircuitBreakerErrorHandlerArn` | (empty) | Optional Lambda ARN invoked on state changes for custom handling. |
108+
109+
**Default behavior when enabled**: 3 or more `ServiceUnavailableException` errors in a single 5-minute window open the breaker. Throttling and quota-limit errors are not counted by default because those usually indicate client-side load issues, not a Bedrock outage. Enable additional triggers to protect against sustained throttling or quota exhaustion.
110+
111+
## Error categories
112+
113+
The Bedrock client in `idp_common` emits category-specific CloudWatch metrics under the stack namespace whenever it catches a retryable error:
114+
115+
| Metric | Bedrock exception code(s) | Typical cause |
116+
|--------|---------------------------|---------------|
117+
| `BedrockServiceUnavailable` | `ServiceUnavailableException` (503) | Bedrock service degradation or regional outage |
118+
| `BedrockThrottling` | `ThrottlingException`, `TooManyRequestsException`, `RequestLimitExceeded` | Client-side throughput limits reached |
119+
| `BedrockQuotaLimit` | `ServiceQuotaExceededException` | Account quota exhausted for the model |
120+
121+
These metrics are emitted unconditionally, independent of `CircuitBreakerEnabled`, so you can observe Bedrock error rates even when the circuit breaker is disabled.
122+
123+
## Enabling the circuit breaker
124+
125+
Set `CircuitBreakerEnabled=true` at deploy time:
126+
127+
```bash
128+
aws cloudformation deploy \
129+
--stack-name my-idp-stack \
130+
--template-file template.yaml \
131+
--parameter-overrides \
132+
CircuitBreakerEnabled=true \
133+
CircuitBreakerFailureThreshold=3 \
134+
CircuitBreakerTriggerThrottling=true
135+
```
136+
137+
Or via the IDP CLI / console. When `CircuitBreakerEnabled=false` (the default), none of the circuit breaker resources are provisioned — no alarm, no SNS topic, no manager Lambda — and the Queue Processor skips the state check entirely.
138+
139+
## Tuning guidance
140+
141+
For GovCloud or other environments experiencing intermittent Bedrock outages:
142+
143+
```
144+
CircuitBreakerFailureThreshold: 3
145+
CircuitBreakerEvaluationPeriods: 1 # 5-minute window
146+
CircuitBreakerRecoveryTimeoutSeconds: 300 # 5 minutes before probing
147+
```
148+
149+
For stable environments where you want the breaker as a safety net only:
150+
151+
```
152+
CircuitBreakerFailureThreshold: 5
153+
CircuitBreakerEvaluationPeriods: 2 # 10-minute window
154+
CircuitBreakerRecoveryTimeoutSeconds: 600 # 10 minutes
155+
```
156+
157+
## Web UI
158+
159+
When `CircuitBreakerEnabled=true`, the document list header shows a live status badge that reflects the current breaker state via an AppSync subscription:
160+
161+
| Badge | State | Meaning |
162+
|-------|-------|---------|
163+
| Green "Circuit: closed" | CLOSED | Normal operation |
164+
| Blue "Circuit: recovering" | HALF_OPEN | Probing recovery |
165+
| Red "Circuit: Bedrock outage" | OPEN (automatic) | Opened by `BedrockServiceOutageAlarm`; hover for `lastError` |
166+
| Red "Circuit: manually paused" | OPEN (manual) | Opened via the admin **Pause processing** control; hover for the reason |
167+
168+
When `CircuitBreakerEnabled=false` the badge is hidden entirely.
169+
170+
Click the badge to open a details panel showing `state`, `openedAt`, `lastCheckedAt`, `failureCount`, `recoveryAttempts`, and `lastError`. Users in the **Admin** Cognito group additionally see three controls:
171+
172+
- **Pause processing** — forces OPEN (available when state is CLOSED or HALF_OPEN). Use before planned Bedrock changes or to quiesce the pipeline.
173+
- **Resume processing** — forces CLOSED and resets failure/recovery counters. Use to clear a stuck OPEN state.
174+
- **Probe recovery** — forces HALF_OPEN (available when state is OPEN). Use to test recovery before the automatic timeout.
175+
176+
Each control requires a **reason** that is persisted to DynamoDB (`lastError` field for pause; also logged) and broadcast over the existing SNS alerts topic. All transitions — including automatic ones from CloudWatch alarms, the scheduled health check, and the `HALF_OPEN → CLOSED` transition triggered by a successful workflow completion — fan out to every connected browser in real time.
177+
178+
Non-admins can view the panel but do not see the control buttons.
179+
180+
## Manual operations
181+
182+
Reset the circuit breaker (force CLOSED):
183+
184+
```bash
185+
aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \
186+
--payload '{"action": "reset"}' response.json
187+
```
188+
189+
Check current state:
190+
191+
```bash
192+
aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \
193+
--payload '{"action": "get_state"}' response.json
194+
```
195+
196+
Or read state directly from DynamoDB:
197+
198+
```bash
199+
aws dynamodb get-item \
200+
--table-name <ConcurrencyTable> \
201+
--key '{"counter_id": {"S": "circuit_breaker"}}'
202+
```
203+
204+
## Observability
205+
206+
CloudWatch metrics emitted by the circuit breaker (under the stack namespace):
207+
208+
- `CircuitBreakerOpened` — incremented each time the breaker transitions to OPEN
209+
- `CircuitBreakerHalfOpen` — incremented on transition to HALF_OPEN
210+
- `CircuitBreakerClosed` — incremented on transition to CLOSED
211+
212+
The `AlertsTopic` receives SNS notifications on every state transition so operators can subscribe email, SMS, or PagerDuty endpoints.

0 commit comments

Comments
 (0)