|
| 1 | +--- |
| 2 | +title: "Circuit Breaker" |
| 3 | +--- |
| 4 | + |
| 5 | +Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. |
| 6 | +SPDX-License-Identifier: MIT-0 |
| 7 | + |
| 8 | +# Circuit Breaker |
| 9 | + |
| 10 | +Protects the IDP pipeline from cascading failures when Amazon Bedrock is degraded or unavailable. When Bedrock starts returning errors at a configurable rate, the circuit breaker **opens** and new workflows stop starting. Messages stay in SQS instead of fanning out into Lambda retries that would eventually time out or burn through the Step Functions retry budget. Once Bedrock recovers, the breaker transitions through a **half-open** probe state back to **closed** and normal processing resumes. |
| 11 | + |
| 12 | +## Why it exists |
| 13 | + |
| 14 | +Without the circuit breaker, a Bedrock outage produces this chain: |
| 15 | + |
| 16 | +1. Workflows start normally. |
| 17 | +2. Every Bedrock call hits the in-client retry loop (up to 7 attempts, exponential backoff up to 5 minutes). |
| 18 | +3. Step Functions retries another 8 times on failure. |
| 19 | +4. Individual executions hang for up to 15 minutes before failing. |
| 20 | +5. Meanwhile new documents keep getting pulled from SQS and starting more doomed workflows, which waste Lambda concurrency and inflate cost. |
| 21 | + |
| 22 | +The circuit breaker short-circuits that cycle by refusing to start new workflows while Bedrock is unhealthy, so messages stay in the queue and process cleanly after recovery. |
| 23 | + |
| 24 | +## States |
| 25 | + |
| 26 | +| State | Behavior | |
| 27 | +|-------|----------| |
| 28 | +| **CLOSED** | Normal operation. All requests processed. | |
| 29 | +| **OPEN** | Bedrock unavailable. Queue Processor returns messages to SQS for retry. | |
| 30 | +| **HALF_OPEN** | Testing recovery. Limited traffic allowed through. First successful workflow closes the breaker; first new alarm reopens it. | |
| 31 | + |
| 32 | +### Transitions |
| 33 | + |
| 34 | +``` |
| 35 | + ┌──────────┐ |
| 36 | + │ CLOSED │◄─────────── Successful workflow in HALF_OPEN |
| 37 | + └────┬─────┘ OR alarm returns to OK |
| 38 | + │ |
| 39 | + │ CloudWatch alarm fires |
| 40 | + ▼ |
| 41 | + ┌──────────┐ |
| 42 | + │ OPEN │ ◄── Alarm fires during HALF_OPEN |
| 43 | + └────┬─────┘ |
| 44 | + │ |
| 45 | + │ Recovery timeout OR alarm OK |
| 46 | + ▼ |
| 47 | + ┌───────────┐ |
| 48 | + │ HALF_OPEN │ |
| 49 | + └───────────┘ |
| 50 | +``` |
| 51 | + |
| 52 | +## Architecture |
| 53 | + |
| 54 | +``` |
| 55 | +┌─────────────┐ SNS ┌────────────────────┐ DynamoDB ┌──────────────┐ |
| 56 | +│ CloudWatch │───────────►│ Circuit Breaker │◄──────────────►│ Concurrency │ |
| 57 | +│ Alarm │ │ Manager │ │ Table │ |
| 58 | +└─────────────┘ └────────────────────┘ └──────────────┘ |
| 59 | + │ ▲ |
| 60 | + │ SNS │ |
| 61 | + ▼ │ |
| 62 | + ┌──────────────┐ │ |
| 63 | + │ AlertsTopic │ │ |
| 64 | + │ (notify ops) │ │ |
| 65 | + └──────────────┘ │ |
| 66 | + │ |
| 67 | +┌─────────────┐ │ |
| 68 | +│ SQS │────────────►┌─────────────────┐ check state before │ |
| 69 | +│ Queue │ │ Queue Processor │─────processing──────────┘ |
| 70 | +└─────────────┘ └─────────────────┘ |
| 71 | + │ |
| 72 | + │ if CLOSED or HALF_OPEN |
| 73 | + ▼ |
| 74 | + ┌─────────────────┐ |
| 75 | + │ Step Functions │ |
| 76 | + │ Workflow │ |
| 77 | + └─────────────────┘ |
| 78 | +``` |
| 79 | + |
| 80 | +- **BedrockServiceOutageAlarm**: CloudWatch MetricMath alarm on Bedrock error metrics (see [Alarm threshold](#alarm-threshold)). |
| 81 | +- **CircuitBreakerManager**: Lambda triggered by the alarm's SNS topic and by a 5-minute EventBridge schedule. Manages state transitions and publishes notifications. |
| 82 | +- **ConcurrencyTable**: Existing DynamoDB table — circuit breaker state is stored on the `circuit_breaker` partition key. |
| 83 | +- **QueueProcessor**: Reads state before starting workflows. If OPEN, it returns without starting the Step Functions execution, leaving the message in SQS. |
| 84 | +- **WorkflowTracker**: On a successful workflow completion, if state is HALF_OPEN, transitions to CLOSED. |
| 85 | + |
| 86 | +## Alarm threshold |
| 87 | + |
| 88 | +The `BedrockServiceOutageAlarm` uses MetricMath to sum the Bedrock error categories you opt into and compares the total to `CircuitBreakerFailureThreshold`. |
| 89 | + |
| 90 | +Expression: |
| 91 | + |
| 92 | +``` |
| 93 | +<SU> * FILL(m1, 0) + <Thr> * FILL(m2, 0) + <QL> * FILL(m3, 0) |
| 94 | +``` |
| 95 | + |
| 96 | +Where each coefficient is `1` if the corresponding trigger is enabled and `0` otherwise. Metrics `m1`/`m2`/`m3` are `BedrockServiceUnavailable`, `BedrockThrottling`, and `BedrockQuotaLimit` under the stack namespace. |
| 97 | + |
| 98 | +| Parameter | Default | Description | |
| 99 | +|-----------|---------|-------------| |
| 100 | +| `CircuitBreakerEnabled` | `false` | Master switch. Set to `true` to provision the alarm, SNS topic, manager Lambda, and traffic gate. | |
| 101 | +| `CircuitBreakerTriggerServiceUnavailable` | `true` | Count 503 `ServiceUnavailableException` errors toward threshold. | |
| 102 | +| `CircuitBreakerTriggerThrottling` | `false` | Count `ThrottlingException`, `TooManyRequestsException`, `RequestLimitExceeded`. | |
| 103 | +| `CircuitBreakerTriggerQuotaLimit` | `false` | Count `ServiceQuotaExceededException`. | |
| 104 | +| `CircuitBreakerFailureThreshold` | `3` | Combined error count per 5-minute period to breach. | |
| 105 | +| `CircuitBreakerEvaluationPeriods` | `1` | Consecutive periods that must breach. | |
| 106 | +| `CircuitBreakerRecoveryTimeoutSeconds` | `300` | Seconds before automatic OPEN → HALF_OPEN transition. | |
| 107 | +| `CircuitBreakerErrorHandlerArn` | (empty) | Optional Lambda ARN invoked on state changes for custom handling. | |
| 108 | + |
| 109 | +**Default behavior when enabled**: 3 or more `ServiceUnavailableException` errors in a single 5-minute window open the breaker. Throttling and quota-limit errors are not counted by default because those usually indicate client-side load issues, not a Bedrock outage. Enable additional triggers to protect against sustained throttling or quota exhaustion. |
| 110 | + |
| 111 | +## Error categories |
| 112 | + |
| 113 | +The Bedrock client in `idp_common` emits category-specific CloudWatch metrics under the stack namespace whenever it catches a retryable error: |
| 114 | + |
| 115 | +| Metric | Bedrock exception code(s) | Typical cause | |
| 116 | +|--------|---------------------------|---------------| |
| 117 | +| `BedrockServiceUnavailable` | `ServiceUnavailableException` (503) | Bedrock service degradation or regional outage | |
| 118 | +| `BedrockThrottling` | `ThrottlingException`, `TooManyRequestsException`, `RequestLimitExceeded` | Client-side throughput limits reached | |
| 119 | +| `BedrockQuotaLimit` | `ServiceQuotaExceededException` | Account quota exhausted for the model | |
| 120 | + |
| 121 | +These metrics are emitted unconditionally, independent of `CircuitBreakerEnabled`, so you can observe Bedrock error rates even when the circuit breaker is disabled. |
| 122 | + |
| 123 | +## Enabling the circuit breaker |
| 124 | + |
| 125 | +Set `CircuitBreakerEnabled=true` at deploy time: |
| 126 | + |
| 127 | +```bash |
| 128 | +aws cloudformation deploy \ |
| 129 | + --stack-name my-idp-stack \ |
| 130 | + --template-file template.yaml \ |
| 131 | + --parameter-overrides \ |
| 132 | + CircuitBreakerEnabled=true \ |
| 133 | + CircuitBreakerFailureThreshold=3 \ |
| 134 | + CircuitBreakerTriggerThrottling=true |
| 135 | +``` |
| 136 | + |
| 137 | +Or via the IDP CLI / console. When `CircuitBreakerEnabled=false` (the default), none of the circuit breaker resources are provisioned — no alarm, no SNS topic, no manager Lambda — and the Queue Processor skips the state check entirely. |
| 138 | + |
| 139 | +## Tuning guidance |
| 140 | + |
| 141 | +For GovCloud or other environments experiencing intermittent Bedrock outages: |
| 142 | + |
| 143 | +``` |
| 144 | +CircuitBreakerFailureThreshold: 3 |
| 145 | +CircuitBreakerEvaluationPeriods: 1 # 5-minute window |
| 146 | +CircuitBreakerRecoveryTimeoutSeconds: 300 # 5 minutes before probing |
| 147 | +``` |
| 148 | + |
| 149 | +For stable environments where you want the breaker as a safety net only: |
| 150 | + |
| 151 | +``` |
| 152 | +CircuitBreakerFailureThreshold: 5 |
| 153 | +CircuitBreakerEvaluationPeriods: 2 # 10-minute window |
| 154 | +CircuitBreakerRecoveryTimeoutSeconds: 600 # 10 minutes |
| 155 | +``` |
| 156 | + |
| 157 | +## Web UI |
| 158 | + |
| 159 | +When `CircuitBreakerEnabled=true`, the document list header shows a live status badge that reflects the current breaker state via an AppSync subscription: |
| 160 | + |
| 161 | +| Badge | State | Meaning | |
| 162 | +|-------|-------|---------| |
| 163 | +| Green "Circuit: closed" | CLOSED | Normal operation | |
| 164 | +| Blue "Circuit: recovering" | HALF_OPEN | Probing recovery | |
| 165 | +| Red "Circuit: Bedrock outage" | OPEN (automatic) | Opened by `BedrockServiceOutageAlarm`; hover for `lastError` | |
| 166 | +| Red "Circuit: manually paused" | OPEN (manual) | Opened via the admin **Pause processing** control; hover for the reason | |
| 167 | + |
| 168 | +When `CircuitBreakerEnabled=false` the badge is hidden entirely. |
| 169 | + |
| 170 | +Click the badge to open a details panel showing `state`, `openedAt`, `lastCheckedAt`, `failureCount`, `recoveryAttempts`, and `lastError`. Users in the **Admin** Cognito group additionally see three controls: |
| 171 | + |
| 172 | +- **Pause processing** — forces OPEN (available when state is CLOSED or HALF_OPEN). Use before planned Bedrock changes or to quiesce the pipeline. |
| 173 | +- **Resume processing** — forces CLOSED and resets failure/recovery counters. Use to clear a stuck OPEN state. |
| 174 | +- **Probe recovery** — forces HALF_OPEN (available when state is OPEN). Use to test recovery before the automatic timeout. |
| 175 | + |
| 176 | +Each control requires a **reason** that is persisted to DynamoDB (`lastError` field for pause; also logged) and broadcast over the existing SNS alerts topic. All transitions — including automatic ones from CloudWatch alarms, the scheduled health check, and the `HALF_OPEN → CLOSED` transition triggered by a successful workflow completion — fan out to every connected browser in real time. |
| 177 | + |
| 178 | +Non-admins can view the panel but do not see the control buttons. |
| 179 | + |
| 180 | +## Manual operations |
| 181 | + |
| 182 | +Reset the circuit breaker (force CLOSED): |
| 183 | + |
| 184 | +```bash |
| 185 | +aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \ |
| 186 | + --payload '{"action": "reset"}' response.json |
| 187 | +``` |
| 188 | + |
| 189 | +Check current state: |
| 190 | + |
| 191 | +```bash |
| 192 | +aws lambda invoke --function-name <CircuitBreakerManagerFunctionName> \ |
| 193 | + --payload '{"action": "get_state"}' response.json |
| 194 | +``` |
| 195 | + |
| 196 | +Or read state directly from DynamoDB: |
| 197 | + |
| 198 | +```bash |
| 199 | +aws dynamodb get-item \ |
| 200 | + --table-name <ConcurrencyTable> \ |
| 201 | + --key '{"counter_id": {"S": "circuit_breaker"}}' |
| 202 | +``` |
| 203 | + |
| 204 | +## Observability |
| 205 | + |
| 206 | +CloudWatch metrics emitted by the circuit breaker (under the stack namespace): |
| 207 | + |
| 208 | +- `CircuitBreakerOpened` — incremented each time the breaker transitions to OPEN |
| 209 | +- `CircuitBreakerHalfOpen` — incremented on transition to HALF_OPEN |
| 210 | +- `CircuitBreakerClosed` — incremented on transition to CLOSED |
| 211 | + |
| 212 | +The `AlertsTopic` receives SNS notifications on every state transition so operators can subscribe email, SMS, or PagerDuty endpoints. |
0 commit comments