From 878f2177d2e92b88b5ba4edd90495670ff73451f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Wed, 16 Apr 2025 23:54:25 +0200 Subject: [PATCH 01/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 313 +++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 252 insertions(+), 61 deletions(-) diff --git a/README.md b/README.md index 8fad1929..20e5c28f 100644 --- a/README.md +++ b/README.md @@ -1,88 +1,279 @@ -# Lambda in Private VPC +# 🚀 Lambda in Private VPC -**Status:** Work in Progress +![License](https://img.shields.io/github/license/Hack23/lambda-in-private-vpc.svg) +[![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/Hack23/lambda-in-private-vpc/badge)](https://securityscorecards.dev/viewer/?uri=github.com/Hack23/lambda-in-private-vpc) +[![CI/CD](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml/badge.svg)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml) +[![Scorecard Security](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml/badge.svg?branch=main)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml) -This project shows how to build a highly available system that runs in multiple AWS regions at the same time. It uses AWS Resilience Hub to ensure compliance with policies for Recovery Time Objective (RTO) and Recovery Point Objective (RPO), which help to minimize downtime and data loss in case of failures at the application, availability zone, or region level. This ensures high availability and fault tolerance for your applications. +> **Description:** A highly available system spanning **Ireland** and **Frankfurt** AWS regions, enforcing RTO/RPO via AWS Resilience Hub, chaos‑tested by FIS, and fronted by API Gateway with Route 53 failover & WAF protection. -## Badges +--- -[![License](https://img.shields.io/github/license/Hack23/lambda-in-private-vpc.svg)](https://github.com/Hack23/lambda-in-private-vpc/raw/master/LICENSE.md) [![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/Hack23/lambda-in-private-vpc/badge)](https://scorecard.dev/viewer/?uri=github.com/Hack23/lambda-in-private-vpc) -[![Verify and Deploy](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml/badge.svg)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml) -[![Scorecard supply-chain security](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml/badge.svg?branch=main)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml) +## 📋 Table of Contents -## CloudFormation Templates +- [🧠 Project Mindmap](#-project-mindmap) +- [🌐 Overview](#-overview) +- [📐 Architecture](#-architecture) +- [🔗 Network Topology](#-network-topology) +- [🚦 CI/CD Workflow](#-ci-cd-workflow) +- [🚨 Disaster Recovery Flow](#-disaster-recovery-flow) +- [🛡️ Resilience Hub Policy](#️-resilience-hub-policy) +- [🖼️ Screenshots](#️-screenshots) +- [📦 Templates](#-templates) +- [🛠️ Tech Stack](#️-tech-stack) +- [📖 Runbooks](#-runbooks) +- [🔗 References](#-references) +- [📄 License](#-license) -The project includes several AWS CloudFormation templates that automate the creation and management of the necessary AWS resources: +--- -- `app.yml`: This template sets up an application named "lambda-vpc" with a ResilienceHub ResiliencyPolicy. The application includes AWS Lambda functions, API Gateway Rest APIs, and DynamoDB Global Tables. -- `disaster-recovery.yml`: This template sets up a disaster recovery test using AWS Fault Injection Simulator (FIS). The experiments include denying access to Lambda on API Gateway, deleting a DynamoDB table, and recovering a DynamoDB table from a point-in-time recovery (PITR) or a backup. -- `template.yml`: This template deploys a Lambda function in a private VPC with internet access. The function can access resources in the VPC and make outbound calls to the internet. -- `route53.yml`: This template sets up DNS records in Amazon Route 53 for two API Gateway Rest APIs. The DNS records are configured for failover routing, which means that if one API becomes unavailable, traffic will be routed to the other API. +## 🧠 Project Mindmap -## Concepts +```mermaid +mindmap + root((Lambda in Private VPC)) + Infra + VPC + Subnets + Endpoints + ACLs & SGs + Compute + LambdaHealth + LambdaCRUD + API + API_Gateway + CustomDomain + Route53 + Resilience + ResilienceHub + FIS + WAFv2 + CI_CD + Linting + SecurityScans + Deploy + Data + DynamoDBGlobal + DeadLetterSNS +``` -Learn more about AWS Resilience Hub concepts and understand the key terms and principles involved in building resilient applications [here](https://docs.aws.amazon.com/resilience-hub/latest/userguide/concepts-terms.html). +--- -[Disaster Recovery (DR) Architecture on AWS, Part I: Strategies for Recovery in the Cloud -](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/) -[Disaster Recovery (DR) Architecture on AWS, Part IV: Multi-site Active/Active](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/) +## 🌐 Overview -## About Hack23 +Run Lambda inside private subnets in two regions, with: -- Website: [www.hack23.com](https://www.hack23.com/) -- LinkedIn: [in/jamessorling](https://www.linkedin.com/in/jamessorling) +- **No public access**: VPC Endpoints for S3, EC2, DynamoDB +- **Multi-region failover**: Route 53 weighted DNS +- **Resiliency**: AWS Resilience Hub policies & AWS FIS chaos +- **Layer7 Security**: AWS WAFv2 rules +- **CI/CD**: GitHub Actions with CFN lint, cfn-nag, Checkov, ZAP, scoring, and cross‑account deploys -## Runbooks +--- -- [DynamoDB Runbook](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-ref-ddb.html) - Automates the management of DynamoDB tables and indexes. -- [Lambda Runbook](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-ref-lam.html) - Helps manage Lambda functions, layers, and aliases. -- [Application Bridge Runbook](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-ref-abp.html) - Supports management of Amazon App Runner services and custom domains. -- [IAM Runbook](https://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/automation-ref-iam.html) - Facilitates IAM user, group, role, and policy management. +## 📐 Architecture -## Architecture Diagrams +```mermaid +flowchart LR + subgraph Ireland [eu-west-1] + VPC1[VPC: 10.1.0.0/16] + Subs1["Private Subnets A/B/C"] + EP1["S3/EC2/DDB Endpoints"] + Lambdas1["Lambda (Health & CRUD)"] + APIGW1["API Gateway"] + end -- ![Infrastructure](cloudformation/template.png) - Depicts the overall infrastructure, including AWS services and components. -- ![DNS Route53](cloudformation/route53.png) - Shows the Route 53 configuration for DNS routing and failover. -- ![Web Application Firewall](cloudformation/waf.png) - Displays the setup of the Web Application Firewall for securing your application. -- ![Disaster Recovery](cloudformation/disaster-recovery.png) - Illustrates the disaster recovery strategy for the application. + subgraph Frankfurt [eu-central-1] + VPC2[VPC: 10.5.0.0/16] + Subs2["Private Subnets A/B/C"] + EP2["S3/EC2/DDB Endpoints"] + Lambdas2["Lambda (Health & CRUD)"] + APIGW2["API Gateway"] + end -## Resilience Hub Screenshots + Ireland --> Subs1 --> Lambdas1 --> EP1 + Subs1 --> APIGW1 + Frankfurt --> Subs2 --> Lambdas2 --> EP2 + Subs2 --> APIGW2 -- ![Resilience Hub Policy](ResilienceHubPolicy.png) - Overview of the policy settings in AWS Resilience Hub. -- ![Application](ResiliencyHub-App.png) - The application setup and components in AWS Resilience Hub. -- ![App Recommendation 1](ResiliencyHub-App-rec1.png) - First set of recommendations for improving application resiliency. -- ![App Recommendation 2](ResiliencyHub-App-rec2.png) - Second set of recommendations for enhancing application resiliency. -- ![Region](ResHub-region.png) - Regional recommendations + APIGW1 -. Failover .-> Route53 + APIGW2 -. Failover .-> Route53 + classDef region fill:#f9f,stroke:#333,stroke-width:1px; + class Ireland,Frankfurt region +``` -## Tech Stack -Hack23/lambda-in-private-vpc is built on the following main stack: +--- -- GitHub Actions [GitHub Actions](https://github.com/features/actions) – Continuous Integration +## 🔗 Network Topology -Full tech stack [here](/techstack.md) +```mermaid +graph TD + VPC[VPC
10.1.0.0/16] + subgraph Subnets["Private Subnets"] + S1[10.1.0.0/24] + S2[10.1.1.0/24] + S3[10.1.2.0/24] + end + VPC --> Subnets + Subnets --> ACL[Network ACL] + Subnets --> SG_L[Lambda SG] + subgraph Endpoints["VPC Endpoints"] + EP_S3[S3] + EP_EC2[EC2] + EP_DDB[DynamoDB] + end + Subnets --> Endpoints +``` -## Relevant Links +--- -- [Route53 Application Recovery Controller](https://aws.amazon.com/route53/application-recovery-controller/) - Service for managing and testing application recovery across AWS Regions. -- [Route53 Resolver DNS Firewall](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-dns-firewall.html) - A managed DNS firewall service to protect applications from malicious DNS activity. -- [SLA MAX Calculator](https://github.com/mikaelvesavuori/slamax) and [Cloud SLA](https://github.com/mikaelvesavuori/cloud-sla) - Tools for calculating and comparing cloud service SLAs. +## 🚦 CI/CD Workflow -For more information on AWS service level agreements, visit the [AWS SLA page](https://aws.amazon.com/legal/service-level-agreements/). +```mermaid +flowchart TD + A[Push or Dispatch] --> B{Lint & Security} + B --> C[cfn-lint] + B --> D[cfn-nag] + B --> E[Checkov] + B --> F[StandardLint] + F --> G[DependencyReview] + E --> H[Scorecard] + C --> I[ZAP API Scan] + H --> J{Deploy Jobs} + J --> K[Deploy Ireland] + K --> L[Collect Outputs] + L --> M[Deploy Frankfurt] + M --> N[Route53 Stack] + N --> O[DR Stack] + O --> P[ResilienceHub App] + P --> Q[Tag & Release] +``` -## Additional Documentation +--- -- [CHANGELOG.md](CHANGELOG.md) -- [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md) -- [CONTRIBUTING.md](CONTRIBUTING.md) -- [LICENSE.md](LICENSE.md) -- [SECURITY.md](SECURITY.md) -- [AlarmRecommendation-apigateway/alarm/AlarmRecommendation-apigateway-Alarm-172017021075-eu-west-1.json](AlarmRecommendation-apigateway/alarm/AlarmRecommendation-apigateway-Alarm-172017021075-eu-west-1.json) -- [AlarmRecommendation-apigateway/alarm/AlarmRecommendation-apigateway-Alarm-172017021075-eu-west-2.json](AlarmRecommendation-apigateway/alarm/AlarmRecommendation-apigateway-Alarm-172017021075-eu-west-2.json) -- [AlarmRecommendation-apigateway/manifest.json](AlarmRecommendation-apigateway/manifest.json) -- [AlarmRecommendation-apigateway/README.md](AlarmRecommendation-apigateway/README.md) -- [SopRecommendation-apigateway/sop/SopRecommendation-apigateway-Sop-172017021075-eu-west-1.json](SopRecommendation-apigateway/sop/SopRecommendation-apigateway-Sop-172017021075-eu-west-1.json) -- [SopRecommendation-apigateway/manifest.json](SopRecommendation-apigateway/manifest.json) -- [SopRecommendation-apigateway/README.md](SopRecommendation-apigateway/README.md) +## 🚨 Disaster Recovery Flow -## License -This project is licensed under the Apache License 2.0. +```mermaid +sequenceDiagram + participant U as User + participant R53 as Route53 + participant GW as API_Gateway + participant L as Lambda + participant D as DynamoDB + U->>R53: GET /v1/healthcheck + R53->>GW: Route to region + GW->>L: Invoke healthcheck + L-->>U: "OK" + Note over FIS: Inject failures + FIS->>GW: Deny invoke + FIS->>D: Delete table + alt Recovery + R53->>GW: Failover to backup + GW->>L: Invoke fallback + end +``` + +--- + +## 🛡️ Resilience Hub Policy + +```mermaid +stateDiagram-v2 + [*] --> Region + Region --> AZ + AZ --> Hardware + Hardware --> Software + Software --> [*] + + state Region { + RTO: 3600s + RPO: 5s + } + state AZ { + RTO: 1s + RPO: 1s + } + state Hardware { + RTO: 1s + RPO: 1s + } + state Software { + RTO: 5400s + RPO: 300s + } +``` + +--- + +## 🖼️ Screenshots + +### Resilience Hub + +![Policy](ResilienceHubPolicy.png) +![App](ResiliencyHub-App.png) +![Rec1](ResiliencyHub-App-rec1.png) +![Rec2](ResiliencyHub-App-rec2.png) +![Region View](ResHub-region.png) + +### Infrastructure Diagrams + +| Diagram | Preview | +|---------------------|------------------------------------------| +| Core Infra | ![Infra](cloudformation/template.png) | +| Route 53 DNS | ![Route53](cloudformation/route53.png) | +| Application Firewall| ![WAF](cloudformation/waf.png) | + +--- + +## 📦 Templates + +```bash +cloudformation/ +├─ template.yml # VPC, subnets, Lambdas, API, DynamoDB +├─ route53.yml # DNS failover records +├─ app.yml # Resilience Hub App & Policy +├─ disaster-recovery.yml # FIS experiments +└─ waf.yml # WAFv2 rules +``` + +--- + +## 🛠️ Tech Stack + +- **Infra as Code:** CloudFormation +- **Serverless:** Lambda (Node.js 20.x) +- **API:** API Gateway (Regional, Custom Domain) +- **Storage:** DynamoDB Global Tables +- **Networking:** VPC, Private Subnets, Endpoints, ACLs, SGs +- **DNS:** Route 53 Weighted Failover +- **Resiliency:** AWS Resilience Hub, FIS +- **Security:** WAFv2, IAM Roles & Policies +- **CI/CD:** GitHub Actions, cfn-lint, cfn-nag, Checkov, ZAP, Scorecard + +Details: [techstack.md](./techstack.md) + +--- + +## 📖 Runbooks + +- **DynamoDB** – SSM Automation for tables/indexes +- **Lambda** – SSM Automation for functions & aliases +- **App Runner** – Manage App Runner & domains +- **IAM** – Automate IAM user/group/role operations + +--- + +## 🔗 References + +- DR I: https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/ +- DR IV: https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/ +- Resilience Hub: https://docs.aws.amazon.com/resilience-hub/latest/userguide/ +- Route 53 ARC: https://aws.amazon.com/route53/application-recovery-controller/ +- DNS Firewall: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-dns-firewall.html/ +- SLA Tools: https://github.com/mikaelvesavuori/slamax | https://github.com/mikaelvesavuori/cloud-sla + +--- + +## 📄 License + +This project is licensed under the **Apache License 2.0**. See [LICENSE.md](LICENSE.md). From be140d0ab9ff540719fa99652f038c30f1cf9460 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Wed, 16 Apr 2025 23:58:51 +0200 Subject: [PATCH 02/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 33 ++++++++++----------------------- 1 file changed, 10 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 20e5c28f..628df6a4 100644 --- a/README.md +++ b/README.md @@ -175,32 +175,19 @@ sequenceDiagram --- -## 🛡️ Resilience Hub Policy +### 🛡️ Resilience Hub Policy ```mermaid stateDiagram-v2 - [*] --> Region - Region --> AZ - AZ --> Hardware - Hardware --> Software - Software --> [*] - - state Region { - RTO: 3600s - RPO: 5s - } - state AZ { - RTO: 1s - RPO: 1s - } - state Hardware { - RTO: 1s - RPO: 1s - } - state Software { - RTO: 5400s - RPO: 300s - } + [*] --> Region + Region: RTO = 3600s\nRPO = 5s + Region --> AZ + AZ: RTO = 1s\nRPO = 1s + AZ --> Hardware + Hardware: RTO = 1s\nRPO = 1s + Hardware --> Software + Software: RTO = 5400s\nRPO = 300s + Software --> [*] ``` --- From 764d4bb4fde02b85818cf3211fda7df1cfe2d505 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 00:05:38 +0200 Subject: [PATCH 03/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 123 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 100 insertions(+), 23 deletions(-) diff --git a/README.md b/README.md index 628df6a4..3a99466e 100644 --- a/README.md +++ b/README.md @@ -32,33 +32,110 @@ ```mermaid mindmap root((Lambda in Private VPC)) - Infra - VPC - Subnets - Endpoints - ACLs & SGs - Compute - LambdaHealth - LambdaCRUD - API - API_Gateway - CustomDomain - Route53 - Resilience - ResilienceHub - FIS - WAFv2 - CI_CD - Linting - SecurityScans - Deploy - Data - DynamoDBGlobal - DeadLetterSNS + Infra((Infrastructure)) + VPC((VPC)) + Subnets((Subnets)) + Endpoints((VPC Endpoints)) + Networking((ACLs & SGs)) + Compute((Compute)) + HealthLambda((Healthcheck Lambda)) + CrudLambda((CRUD Lambda)) + API((API Layer)) + Gateway((API Gateway)) + Domain((Custom Domain)) + DNS((Route 53 Failover)) + Resilience((Resilience & DR)) + ResHub((AWS Resilience Hub)) + RTO_RPO((RTO & RPO Policies)) + HA((High Availability)) + DR((Disaster Recovery)) + DR_Strategies((Recovery Strategies)) + BackupRestore((Backup & Restore)) + PilotLight((Pilot Light)) + WarmStandby((Warm Standby)) + MultiSite((Multi-site Active-Active)) + BCP((Business Continuity Plan)) + Data((Data)) + DynamoDB((Global Table)) + DLQ((Dead‑Letter SNS)) + Security((Security)) + WAF((AWS WAFv2)) + IAM((IAM Roles & Policies)) + NetworkACL((Network ACLs)) + SecurityGroup((Security Groups)) + CI_CD((CI/CD & Scanning)) + Linting((cfn-lint)) + CNag((cfn-nag)) + Checkov((Checkov)) + ZAP((ZAP API Scan)) + Scorecard((OSSF Scorecard)) + Actions((GitHub Actions)) + Docs((Documentation)) + Runbooks((Runbooks)) + DRPlan((DR Plan)) + BCPPlan((BCP Plan)) + TechStack((Tech Stack)) + + classDef root fill:#ffcc00,stroke:#333,stroke-width:2px; + classDef Infra,Compute,API,Resilience,Data,Security,CI_CD,Docs fill:#00ccff,stroke:#333; + classDef DR_Strategies,RTO_RPO,HA,DR,BCP fill:#ff6666,stroke:#333; + classDef VPC,Subnets,Endpoints,Networking fill:#99ee99,stroke:#333; + classDef HealthLambda,CrudLambda fill:#cc99ff,stroke:#333; + classDef Gateway,Domain,DNS fill:#ff99cc,stroke:#333; + classDef DynamoDB,DLQ fill:#ffcc99,stroke:#333; + classDef WAF,IAM,NetworkACL,SecurityGroup fill:#ff9966,stroke:#333; + classDef Linting,CNag,Checkov,ZAP,Scorecard,Actions fill:#99ccff,stroke:#333; + classDef Runbooks,DRPlan,BCPPlan,TechStack fill:#ccccff,stroke:#333; ``` --- +## 🚧 Disaster Recovery Strategies + +This section outlines the four main AWS disaster recovery patterns supported by this project: + +```mermaid +flowchart TB + style DR fill:#f9f,stroke:#333,stroke-width:2px + DR[Disaster Recovery Strategies] + + DR --> BR[Backup & Restore] + DR --> PL[Pilot Light] + DR --> WS[Warm Standby] + DR --> MS[Multi-site Active-Active] + + subgraph BR_Info [Backup & Restore] + direction LR + BR1>Data & Snapshots] + BR2>Restore in New Region] + end + + subgraph PL_Info [Pilot Light] + direction LR + PL1>Minimal Infra Always On] + PL2>Scale Up On Demand] + end + + subgraph WS_Info [Warm Standby] + direction LR + WS1>Scaled-Down Prod Copy] + WS2>Instant Scale to Prod] + end + + subgraph MS_Info [Multi-site Active-Active] + direction LR + MS1>Full Production in All Regions] + MS2>Global Load Balancing] + end +``` + +- **Backup & Restore**: Periodic backups of configuration and data; recovery time depends on restore duration. +- **Pilot Light**: Core components running in standby; scale up non-critical services when needed. +- **Warm Standby**: Fully functional but scaled-down duplicate environment; fast failover. +- **Multi-site Active-Active**: Full environments in all regions; automatic global traffic distribution. + +--- + ## 🌐 Overview Run Lambda inside private subnets in two regions, with: From 618f56803cead4325a2843e26d86e19d5f2b285d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 00:08:58 +0200 Subject: [PATCH 04/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 100 ++++++++++++++++++++++++++++-------------------------- 1 file changed, 52 insertions(+), 48 deletions(-) diff --git a/README.md b/README.md index 3a99466e..e54d2fd6 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ ```mermaid mindmap - root((Lambda in Private VPC)) + Root((Lambda in Private VPC)) Infra((Infrastructure)) VPC((VPC)) Subnets((Subnets)) @@ -44,27 +44,27 @@ mindmap Gateway((API Gateway)) Domain((Custom Domain)) DNS((Route 53 Failover)) + Data((Data)) + DynamoDB((Global Table)) + DLQ((Dead‑Letter SNS)) + Security((Security)) + WAF((AWS WAFv2)) + IAM((IAM Roles & Policies)) + NetworkACL((Network ACLs)) + SG((Security Groups)) Resilience((Resilience & DR)) ResHub((AWS Resilience Hub)) RTO_RPO((RTO & RPO Policies)) HA((High Availability)) DR((Disaster Recovery)) - DR_Strategies((Recovery Strategies)) + Strategies((Recovery Strategies)) BackupRestore((Backup & Restore)) PilotLight((Pilot Light)) WarmStandby((Warm Standby)) MultiSite((Multi-site Active-Active)) BCP((Business Continuity Plan)) - Data((Data)) - DynamoDB((Global Table)) - DLQ((Dead‑Letter SNS)) - Security((Security)) - WAF((AWS WAFv2)) - IAM((IAM Roles & Policies)) - NetworkACL((Network ACLs)) - SecurityGroup((Security Groups)) CI_CD((CI/CD & Scanning)) - Linting((cfn-lint)) + Lint((cfn-lint)) CNag((cfn-nag)) Checkov((Checkov)) ZAP((ZAP API Scan)) @@ -75,58 +75,62 @@ mindmap DRPlan((DR Plan)) BCPPlan((BCP Plan)) TechStack((Tech Stack)) - - classDef root fill:#ffcc00,stroke:#333,stroke-width:2px; - classDef Infra,Compute,API,Resilience,Data,Security,CI_CD,Docs fill:#00ccff,stroke:#333; - classDef DR_Strategies,RTO_RPO,HA,DR,BCP fill:#ff6666,stroke:#333; - classDef VPC,Subnets,Endpoints,Networking fill:#99ee99,stroke:#333; - classDef HealthLambda,CrudLambda fill:#cc99ff,stroke:#333; - classDef Gateway,Domain,DNS fill:#ff99cc,stroke:#333; - classDef DynamoDB,DLQ fill:#ffcc99,stroke:#333; - classDef WAF,IAM,NetworkACL,SecurityGroup fill:#ff9966,stroke:#333; - classDef Linting,CNag,Checkov,ZAP,Scorecard,Actions fill:#99ccff,stroke:#333; - classDef Runbooks,DRPlan,BCPPlan,TechStack fill:#ccccff,stroke:#333; + +classDef Root fill:#ffdd57,stroke:#333,stroke-width:2px; +classDef Infra,Compute,API,Data,Security,Resilience,CI_CD,Docs fill:#88ccff,stroke:#333,stroke-width:1px; +classDef DR,Strategies,BCP fill:#ff6b6b,stroke:#c92a2a,stroke-width:1px; +classDef RTO_RPO,HA fill:#ffa94d,stroke:#e8590c,stroke-width:1px; +classDef VPC,Subnets,Endpoints,Networking fill:#63e6be,stroke:#228be6; +classDef HealthLambda,CrudLambda fill:#b197fc,stroke:#5f3dc4; +classDef Gateway,Domain,DNS fill:#ff8787,stroke:#c2255c; +classDef DynamoDB,DLQ fill:#ffe066,stroke:#f08c00; +classDef WAF,IAM,NetworkACL,SG fill:#fab005,stroke:#b36200; +classDef Lint,CNag,Checkov,ZAP,Scorecard,Actions fill:#74c0fc,stroke:#364fc7; +classDef Runbooks,DRPlan,BCPPlan,TechStack fill:#d0ebff,stroke:#1c7ed6; ``` --- ## 🚧 Disaster Recovery Strategies -This section outlines the four main AWS disaster recovery patterns supported by this project: - ```mermaid flowchart TB - style DR fill:#f9f,stroke:#333,stroke-width:2px - DR[Disaster Recovery Strategies] - - DR --> BR[Backup & Restore] - DR --> PL[Pilot Light] - DR --> WS[Warm Standby] - DR --> MS[Multi-site Active-Active] - - subgraph BR_Info [Backup & Restore] - direction LR - BR1>Data & Snapshots] - BR2>Restore in New Region] + subgraph DR [Disaster Recovery Strategies] + direction TB + BR([Backup & Restore]) + PL([Pilot Light]) + WS([Warm Standby]) + MS([Multi-site Active-Active]) + end + + subgraph BR_info [Backup & Restore] + BR1([Periodic backups of data & configs]) + BR2([Restore in alternate region]) end - subgraph PL_Info [Pilot Light] - direction LR - PL1>Minimal Infra Always On] - PL2>Scale Up On Demand] + subgraph PL_info [Pilot Light] + PL1([Minimal core infra always on]) + PL2([Scale up apps on demand]) end - subgraph WS_Info [Warm Standby] - direction LR - WS1>Scaled-Down Prod Copy] - WS2>Instant Scale to Prod] + subgraph WS_info [Warm Standby] + WS1([Scaled-down prod copy]) + WS2([Instant scale to full capacity]) end - subgraph MS_Info [Multi-site Active-Active] - direction LR - MS1>Full Production in All Regions] - MS2>Global Load Balancing] + subgraph MS_info [Multi-site Active-Active] + MS1([Full production in each region]) + MS2([Global load balancing]) end + + DR --> BR --> BR_info + DR --> PL --> PL_info + DR --> WS --> WS_info + DR --> MS --> MS_info + + classDef DR fill:#fa5252,stroke:#c92a2a,stroke-width:2px; + classDef BR,PL,WS,MS fill:#ff922b,stroke:#b94500,stroke-width:1px; + classDef BR1,BR2,PL1,PL2,WS1,WS2,MS1,MS2 fill:#ffd43b,stroke:#b48c06; ``` - **Backup & Restore**: Periodic backups of configuration and data; recovery time depends on restore duration. From 1c7776c4d549333a1f8f31dd66dfef1586dc377e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 00:12:24 +0200 Subject: [PATCH 05/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 115 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 68 insertions(+), 47 deletions(-) diff --git a/README.md b/README.md index e54d2fd6..d18a27a5 100644 --- a/README.md +++ b/README.md @@ -31,64 +31,85 @@ ```mermaid mindmap - Root((Lambda in Private VPC)) - Infra((Infrastructure)) + root((Lambda in Private VPC)) + Infrastructure((Infrastructure)) VPC((VPC)) Subnets((Subnets)) Endpoints((VPC Endpoints)) - Networking((ACLs & SGs)) + ACLs_SGs((ACLs & SGs)) + FlowLogs((Flow Logs)) Compute((Compute)) - HealthLambda((Healthcheck Lambda)) + HealthLambda((Health Lambda)) CrudLambda((CRUD Lambda)) - API((API Layer)) - Gateway((API Gateway)) - Domain((Custom Domain)) - DNS((Route 53 Failover)) + Concurrency((Reserved Concurrency)) + API_DNS((API & DNS)) + APIGateway((API Gateway)) + CustomDomain((Custom Domain)) + Route53((Route 53 Failover)) + HealthChecks((Health Checks)) Data((Data)) DynamoDB((Global Table)) - DLQ((Dead‑Letter SNS)) + DeadLetter((DeadLetter SNS)) + BackupRestore((Backup & Restore)) Security((Security)) - WAF((AWS WAFv2)) - IAM((IAM Roles & Policies)) - NetworkACL((Network ACLs)) - SG((Security Groups)) - Resilience((Resilience & DR)) - ResHub((AWS Resilience Hub)) - RTO_RPO((RTO & RPO Policies)) + WAF((WAFv2)) + IAM((IAM Roles)) + KMS((KMS Encryption)) + Scans((Security Scans)) + CFNLint((cfn‑lint)) + CFNNag((cfn‑nag)) + Checkov((Checkov)) + ZAP((ZAP API Scan)) + Scorecard((OSSF Scorecard)) + Resilience_DR((Resilience & DR)) + ResHub((Resilience Hub)) + RTO_RPO((RTO & RPO)) HA((High Availability)) - DR((Disaster Recovery)) - Strategies((Recovery Strategies)) - BackupRestore((Backup & Restore)) - PilotLight((Pilot Light)) - WarmStandby((Warm Standby)) - MultiSite((Multi-site Active-Active)) - BCP((Business Continuity Plan)) - CI_CD((CI/CD & Scanning)) - Lint((cfn-lint)) - CNag((cfn-nag)) - Checkov((Checkov)) - ZAP((ZAP API Scan)) - Scorecard((OSSF Scorecard)) - Actions((GitHub Actions)) - Docs((Documentation)) + DRStrategies((DR Strategies)) + BR((Backup & Restore)) + PL((Pilot Light)) + WS((Warm Standby)) + MS((Multi‑site Active‑Active)) + BCP((BCP Plan)) + CI_CD((CI/CD)) + Linting((Linting)) + SecurityScans((Security Scans)) + Deploy((Deploy Workflows)) + Ireland((Ireland)) + Frankfurt((Frankfurt)) + AuxStacks((Route53/DR/ResHub)) + Release((Tag & Release)) + Observability((Observability)) + CWLogs((CloudWatch Logs)) + Alarms((Alarms)) + XRay((X‑Ray)) + Documentation((Documentation)) + Readme((README.md)) Runbooks((Runbooks)) - DRPlan((DR Plan)) - BCPPlan((BCP Plan)) - TechStack((Tech Stack)) - -classDef Root fill:#ffdd57,stroke:#333,stroke-width:2px; -classDef Infra,Compute,API,Data,Security,Resilience,CI_CD,Docs fill:#88ccff,stroke:#333,stroke-width:1px; -classDef DR,Strategies,BCP fill:#ff6b6b,stroke:#c92a2a,stroke-width:1px; -classDef RTO_RPO,HA fill:#ffa94d,stroke:#e8590c,stroke-width:1px; -classDef VPC,Subnets,Endpoints,Networking fill:#63e6be,stroke:#228be6; -classDef HealthLambda,CrudLambda fill:#b197fc,stroke:#5f3dc4; -classDef Gateway,Domain,DNS fill:#ff8787,stroke:#c2255c; -classDef DynamoDB,DLQ fill:#ffe066,stroke:#f08c00; -classDef WAF,IAM,NetworkACL,SG fill:#fab005,stroke:#b36200; -classDef Lint,CNag,Checkov,ZAP,Scorecard,Actions fill:#74c0fc,stroke:#364fc7; -classDef Runbooks,DRPlan,BCPPlan,TechStack fill:#d0ebff,stroke:#1c7ed6; + Contributing((CONTRIBUTING.md)) + Changelog((CHANGELOG.md)) + TechStack((techstack.md)) + ArchDocs((Architecture Docs)) + TeamRoles((Team & Roles)) + Maintainer((Maintainers)) + Contributor((Contributors)) + Reviewer((Reviewers)) + IncidentCommander((Incident Commander)) + CostBudget((Cost & Budget)) + AWSCosts((AWS Costs)) + DataTransfer((Data Transfer)) + LambdaTime((Lambda Exec Time)) + CFNBudget((CFN Budget Alerts)) + Notifications((Notifications)) + SNSTopics((SNS Topics)) + EmailSubs((Email Subscriptions)) + Slack((Slack Integration)) + +classDef root fill:#ffeb3b,stroke:#333,stroke-width:3px; +classDef Infrastructure,Compute,API_DNS,Data,Security,Resilience_DR,CI_CD,Observability,Documentation,TeamRoles,CostBudget,Notifications fill:#90caf9,stroke:#333,stroke-width:1px; +classDef DRStrategies,BR,PL,WS,MS fill:#e57373,stroke:#c62828,stroke-width:1px; +classDef Scans,CFNLint,CFNNag,Checkov,ZAP,Scorecard fill:#ffb74d,stroke:#e65100,stroke-width:1px; ``` - --- ## 🚧 Disaster Recovery Strategies From 1648d9fae57ca192267c1167621531c09ba32209 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 00:15:41 +0200 Subject: [PATCH 06/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 444 +++++++++++++++++++++++------------------------------- 1 file changed, 192 insertions(+), 252 deletions(-) diff --git a/README.md b/README.md index d18a27a5..72d2bc9f 100644 --- a/README.md +++ b/README.md @@ -1,25 +1,24 @@ # 🚀 Lambda in Private VPC -![License](https://img.shields.io/github/license/Hack23/lambda-in-private-vpc.svg) +[![License](https://img.shields.io/github/license/Hack23/lambda-in-private-vpc.svg)](LICENSE.md) [![OpenSSF Scorecard](https://api.securityscorecards.dev/projects/github.com/Hack23/lambda-in-private-vpc/badge)](https://securityscorecards.dev/viewer/?uri=github.com/Hack23/lambda-in-private-vpc) [![CI/CD](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml/badge.svg)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml) [![Scorecard Security](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml/badge.svg?branch=main)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml) -> **Description:** A highly available system spanning **Ireland** and **Frankfurt** AWS regions, enforcing RTO/RPO via AWS Resilience Hub, chaos‑tested by FIS, and fronted by API Gateway with Route 53 failover & WAF protection. +> **Description:** Run AWS Lambda in private subnets across Ireland and Frankfurt, with multi‑region high availability, RTO/RPO enforcement, chaos testing, DNS failover, and WAF protection. --- ## 📋 Table of Contents - [🧠 Project Mindmap](#-project-mindmap) -- [🌐 Overview](#-overview) - [📐 Architecture](#-architecture) - [🔗 Network Topology](#-network-topology) - [🚦 CI/CD Workflow](#-ci-cd-workflow) -- [🚨 Disaster Recovery Flow](#-disaster-recovery-flow) +- [🚨 Disaster Recovery Strategies](#-disaster-recovery-strategies) +- [🔒 Business Continuity Plan](#-business-continuity-plan) - [🛡️ Resilience Hub Policy](#️-resilience-hub-policy) -- [🖼️ Screenshots](#️-screenshots) -- [📦 Templates](#-templates) +- [📦 CloudFormation Templates](#-cloudformation-templates) - [🛠️ Tech Stack](#️-tech-stack) - [📖 Runbooks](#-runbooks) - [🔗 References](#-references) @@ -33,143 +32,61 @@ mindmap root((Lambda in Private VPC)) Infrastructure((Infrastructure)) - VPC((VPC)) - Subnets((Subnets)) - Endpoints((VPC Endpoints)) - ACLs_SGs((ACLs & SGs)) - FlowLogs((Flow Logs)) + VPC + Subnets + Endpoints + ACLs & SGs + FlowLogs Compute((Compute)) - HealthLambda((Health Lambda)) - CrudLambda((CRUD Lambda)) - Concurrency((Reserved Concurrency)) - API_DNS((API & DNS)) - APIGateway((API Gateway)) - CustomDomain((Custom Domain)) - Route53((Route 53 Failover)) - HealthChecks((Health Checks)) + HealthLambda + CrudLambda + API & DNS((API & DNS)) + APIGateway + CustomDomain + Route53 + HealthChecks Data((Data)) - DynamoDB((Global Table)) - DeadLetter((DeadLetter SNS)) - BackupRestore((Backup & Restore)) + DynamoDB_GlobalTable + DeadLetter_SNS Security((Security)) - WAF((WAFv2)) - IAM((IAM Roles)) - KMS((KMS Encryption)) - Scans((Security Scans)) - CFNLint((cfn‑lint)) - CFNNag((cfn‑nag)) - Checkov((Checkov)) - ZAP((ZAP API Scan)) - Scorecard((OSSF Scorecard)) + WAFv2 + IAM_Roles + KMS + Scans + CFN_lint + CFN_nag + Checkov + ZAP + Scorecard Resilience_DR((Resilience & DR)) - ResHub((Resilience Hub)) - RTO_RPO((RTO & RPO)) - HA((High Availability)) - DRStrategies((DR Strategies)) - BR((Backup & Restore)) - PL((Pilot Light)) - WS((Warm Standby)) - MS((Multi‑site Active‑Active)) - BCP((BCP Plan)) + ResHub + RTO_RPO + HA + DR_Strategies + BackupRestore + PilotLight + WarmStandby + MultiSite + BCP CI_CD((CI/CD)) - Linting((Linting)) - SecurityScans((Security Scans)) - Deploy((Deploy Workflows)) - Ireland((Ireland)) - Frankfurt((Frankfurt)) - AuxStacks((Route53/DR/ResHub)) - Release((Tag & Release)) + Linting + SecurityScans + Deploy + Ireland + Frankfurt + AuxStacks + Release Observability((Observability)) - CWLogs((CloudWatch Logs)) - Alarms((Alarms)) - XRay((X‑Ray)) - Documentation((Documentation)) - Readme((README.md)) - Runbooks((Runbooks)) - Contributing((CONTRIBUTING.md)) - Changelog((CHANGELOG.md)) - TechStack((techstack.md)) - ArchDocs((Architecture Docs)) - TeamRoles((Team & Roles)) - Maintainer((Maintainers)) - Contributor((Contributors)) - Reviewer((Reviewers)) - IncidentCommander((Incident Commander)) - CostBudget((Cost & Budget)) - AWSCosts((AWS Costs)) - DataTransfer((Data Transfer)) - LambdaTime((Lambda Exec Time)) - CFNBudget((CFN Budget Alerts)) - Notifications((Notifications)) - SNSTopics((SNS Topics)) - EmailSubs((Email Subscriptions)) - Slack((Slack Integration)) - -classDef root fill:#ffeb3b,stroke:#333,stroke-width:3px; -classDef Infrastructure,Compute,API_DNS,Data,Security,Resilience_DR,CI_CD,Observability,Documentation,TeamRoles,CostBudget,Notifications fill:#90caf9,stroke:#333,stroke-width:1px; -classDef DRStrategies,BR,PL,WS,MS fill:#e57373,stroke:#c62828,stroke-width:1px; -classDef Scans,CFNLint,CFNNag,Checkov,ZAP,Scorecard fill:#ffb74d,stroke:#e65100,stroke-width:1px; + CW_Logs + Alarms + XRay + Documentation((Docs)) + README + Runbooks + Contributing + Changelog + TechStack ``` ---- - -## 🚧 Disaster Recovery Strategies - -```mermaid -flowchart TB - subgraph DR [Disaster Recovery Strategies] - direction TB - BR([Backup & Restore]) - PL([Pilot Light]) - WS([Warm Standby]) - MS([Multi-site Active-Active]) - end - - subgraph BR_info [Backup & Restore] - BR1([Periodic backups of data & configs]) - BR2([Restore in alternate region]) - end - - subgraph PL_info [Pilot Light] - PL1([Minimal core infra always on]) - PL2([Scale up apps on demand]) - end - - subgraph WS_info [Warm Standby] - WS1([Scaled-down prod copy]) - WS2([Instant scale to full capacity]) - end - - subgraph MS_info [Multi-site Active-Active] - MS1([Full production in each region]) - MS2([Global load balancing]) - end - - DR --> BR --> BR_info - DR --> PL --> PL_info - DR --> WS --> WS_info - DR --> MS --> MS_info - - classDef DR fill:#fa5252,stroke:#c92a2a,stroke-width:2px; - classDef BR,PL,WS,MS fill:#ff922b,stroke:#b94500,stroke-width:1px; - classDef BR1,BR2,PL1,PL2,WS1,WS2,MS1,MS2 fill:#ffd43b,stroke:#b48c06; -``` - -- **Backup & Restore**: Periodic backups of configuration and data; recovery time depends on restore duration. -- **Pilot Light**: Core components running in standby; scale up non-critical services when needed. -- **Warm Standby**: Fully functional but scaled-down duplicate environment; fast failover. -- **Multi-site Active-Active**: Full environments in all regions; automatic global traffic distribution. - ---- - -## 🌐 Overview - -Run Lambda inside private subnets in two regions, with: - -- **No public access**: VPC Endpoints for S3, EC2, DynamoDB -- **Multi-region failover**: Route 53 weighted DNS -- **Resiliency**: AWS Resilience Hub policies & AWS FIS chaos -- **Layer7 Security**: AWS WAFv2 rules -- **CI/CD**: GitHub Actions with CFN lint, cfn-nag, Checkov, ZAP, scoring, and cross‑account deploys --- @@ -177,31 +94,28 @@ Run Lambda inside private subnets in two regions, with: ```mermaid flowchart LR - subgraph Ireland [eu-west-1] - VPC1[VPC: 10.1.0.0/16] - Subs1["Private Subnets A/B/C"] - EP1["S3/EC2/DDB Endpoints"] - Lambdas1["Lambda (Health & CRUD)"] - APIGW1["API Gateway"] + subgraph IR [Ireland (eu-west-1)] + V1[VPC 10.1.0.0/16] + S1[Subnets A/B/C] + EP1[Endpoints: S3/EC2/DDB] + L1[Lambdas] + G1[API Gateway] end - - subgraph Frankfurt [eu-central-1] - VPC2[VPC: 10.5.0.0/16] - Subs2["Private Subnets A/B/C"] - EP2["S3/EC2/DDB Endpoints"] - Lambdas2["Lambda (Health & CRUD)"] - APIGW2["API Gateway"] + subgraph FR [Frankfurt (eu-central-1)] + V2[VPC 10.5.0.0/16] + S2[Subnets A/B/C] + EP2[Endpoints: S3/EC2/DDB] + L2[Lambdas] + G2[API Gateway] end - - Ireland --> Subs1 --> Lambdas1 --> EP1 - Subs1 --> APIGW1 - Frankfurt --> Subs2 --> Lambdas2 --> EP2 - Subs2 --> APIGW2 - - APIGW1 -. Failover .-> Route53 - APIGW2 -. Failover .-> Route53 - classDef region fill:#f9f,stroke:#333,stroke-width:1px; - class Ireland,Frankfurt region + V1 --> S1 --> L1 --> EP1 + S1 --> G1 + V2 --> S2 --> L2 --> EP2 + S2 --> G2 + G1 -. Failover .-> DNS[Route 53] + G2 -. Failover .-> DNS + classDef region fill:#e0f7fa,stroke:#006064; + class IR,FR region ``` --- @@ -210,21 +124,19 @@ flowchart LR ```mermaid graph TD - VPC[VPC
10.1.0.0/16] - subgraph Subnets["Private Subnets"] + VPC[VPC: 10.1.0.0/16] + subgraph Private_Subnets S1[10.1.0.0/24] S2[10.1.1.0/24] S3[10.1.2.0/24] end - VPC --> Subnets - Subnets --> ACL[Network ACL] - Subnets --> SG_L[Lambda SG] - subgraph Endpoints["VPC Endpoints"] - EP_S3[S3] - EP_EC2[EC2] - EP_DDB[DynamoDB] - end - Subnets --> Endpoints + VPC --> Private_Subnets + Private_Subnets --> ACL[Network ACL] + Private_Subnets --> SG[Security Groups] + Private_Subnets --> Endpoints{VPC Endpoints} + Endpoints --> EP_S3[S3] + Endpoints --> EP_EC2[EC2] + Endpoints --> EP_DDB[DynamoDB] ``` --- @@ -233,136 +145,164 @@ graph TD ```mermaid flowchart TD - A[Push or Dispatch] --> B{Lint & Security} + A[Dispatch/Push] --> B{Lint & Scan} B --> C[cfn-lint] B --> D[cfn-nag] B --> E[Checkov] B --> F[StandardLint] - F --> G[DependencyReview] - E --> H[Scorecard] - C --> I[ZAP API Scan] - H --> J{Deploy Jobs} - J --> K[Deploy Ireland] - K --> L[Collect Outputs] - L --> M[Deploy Frankfurt] - M --> N[Route53 Stack] - N --> O[DR Stack] - O --> P[ResilienceHub App] - P --> Q[Tag & Release] + B --> G[Scorecard] + G --> H[ZAP API Scan] + H --> I[Configure AWS creds (eu-west-1)] + I --> J[Deploy Core → Ireland] + J --> K[Collect Outputs] + K --> L[Configure AWS creds (eu-central-1)] + L --> M[Deploy Core → Frankfurt] + M --> N[Aux Stacks → Route 53/DR/ResHub] + N --> O[Tag & Release] ``` --- -## 🚨 Disaster Recovery Flow +## 🚨 Disaster Recovery Strategies ```mermaid -sequenceDiagram - participant U as User - participant R53 as Route53 - participant GW as API_Gateway - participant L as Lambda - participant D as DynamoDB - U->>R53: GET /v1/healthcheck - R53->>GW: Route to region - GW->>L: Invoke healthcheck - L-->>U: "OK" - Note over FIS: Inject failures - FIS->>GW: Deny invoke - FIS->>D: Delete table - alt Recovery - R53->>GW: Failover to backup - GW->>L: Invoke fallback +flowchart TB + DR[Disaster Recovery Patterns] + DR --> BR[Backup & Restore] + DR --> PL[Pilot Light] + DR --> WS[Warm Standby] + DR --> MS[Multi‑site Active‑Active] + + subgraph Info1 [Backup & Restore] + BR1[Backups & Snapshots] + BR2[Restore in Alt Region] + end + subgraph Info2 [Pilot Light] + PL1[Core Infra On] + PL2[Scale-on-Demand] + end + subgraph Info3 [Warm Standby] + WS1[Scaled-Down Prod] + WS2[Instant Scale] + end + subgraph Info4 [Multi‑site] + MS1[Full Prod Everywhere] + MS2[Global LB] end + DR --> Info1 & Info2 & Info3 & Info4 +``` + +--- + +## 🔒 Business Continuity Plan + +```mermaid +mindmap + root((BCP Plan)) + ImpactAnalysis + Financial + Operational + Reputational + Regulatory + RecoveryObjectives + RTO + RPO + MTTR + Uptime + Strategies + BackupRestore + PilotLight + WarmStandby + MultiSite + Communication + Stakeholders + Channels + Templates + Testing + Quarterly + SemiAnnual + Annual +``` + +```mermaid +timeline + title RTO/RPO & Uptime Targets + section RTO (Recovery Time) + Infra & API : 1h + Core Lambdas : 2h + Data Services : 4h + section RPO (Data Loss) + Transient Logs : 5m + User Data : 15m + Config & State : 1h + section Uptime + App Endpoint : 99.9% + DNS Failover : 99.99% + Monitoring : 24/7 ``` --- -### 🛡️ Resilience Hub Policy +## 🛡️ Resilience Hub Policy ```mermaid stateDiagram-v2 [*] --> Region - Region: RTO = 3600s\nRPO = 5s + Region: RTO=3600s\nRPO=5s Region --> AZ - AZ: RTO = 1s\nRPO = 1s + AZ: RTO=1s\nRPO=1s AZ --> Hardware - Hardware: RTO = 1s\nRPO = 1s + Hardware: RTO=1s\nRPO=1s Hardware --> Software - Software: RTO = 5400s\nRPO = 300s + Software: RTO=5400s\nRPO=300s Software --> [*] ``` --- -## 🖼️ Screenshots - -### Resilience Hub - -![Policy](ResilienceHubPolicy.png) -![App](ResiliencyHub-App.png) -![Rec1](ResiliencyHub-App-rec1.png) -![Rec2](ResiliencyHub-App-rec2.png) -![Region View](ResHub-region.png) - -### Infrastructure Diagrams - -| Diagram | Preview | -|---------------------|------------------------------------------| -| Core Infra | ![Infra](cloudformation/template.png) | -| Route 53 DNS | ![Route53](cloudformation/route53.png) | -| Application Firewall| ![WAF](cloudformation/waf.png) | - ---- - -## 📦 Templates +## 📦 CloudFormation Templates -```bash -cloudformation/ -├─ template.yml # VPC, subnets, Lambdas, API, DynamoDB -├─ route53.yml # DNS failover records -├─ app.yml # Resilience Hub App & Policy -├─ disaster-recovery.yml # FIS experiments -└─ waf.yml # WAFv2 rules -``` +- **template.yml** – VPC, subnets, endpoints, Lambdas, API, DynamoDB +- **route53.yml** – Route 53 weighted/failover records +- **app.yml** – AWS Resilience Hub App & Policy +- **disaster-recovery.yml** – AWS FIS experiments +- **waf.yml** – AWS WAFv2 WebACL --- ## 🛠️ Tech Stack -- **Infra as Code:** CloudFormation -- **Serverless:** Lambda (Node.js 20.x) -- **API:** API Gateway (Regional, Custom Domain) -- **Storage:** DynamoDB Global Tables -- **Networking:** VPC, Private Subnets, Endpoints, ACLs, SGs -- **DNS:** Route 53 Weighted Failover -- **Resiliency:** AWS Resilience Hub, FIS -- **Security:** WAFv2, IAM Roles & Policies -- **CI/CD:** GitHub Actions, cfn-lint, cfn-nag, Checkov, ZAP, Scorecard +- AWS CloudFormation +- AWS Lambda (Node.js 20.x) +- Amazon API Gateway (Regional) +- DynamoDB Global Tables +- VPC Endpoints & Security +- AWS Resilience Hub & FIS +- AWS WAFv2 & IAM +- GitHub Actions, cfn-lint, cfn-nag, Checkov, ZAP, Scorecard -Details: [techstack.md](./techstack.md) +Details: [techstack.md](techstack.md) --- ## 📖 Runbooks -- **DynamoDB** – SSM Automation for tables/indexes -- **Lambda** – SSM Automation for functions & aliases -- **App Runner** – Manage App Runner & domains -- **IAM** – Automate IAM user/group/role operations +- **DynamoDB** – AWS Systems Manager +- **Lambda** – AWS Systems Manager +- **App Runner** – AWS App Runner +- **IAM** – AWS IAM Automation --- ## 🔗 References +- AWS Resilience Hub: https://docs.aws.amazon.com/resilience-hub/latest/userguide/ - DR I: https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/ - DR IV: https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/ -- Resilience Hub: https://docs.aws.amazon.com/resilience-hub/latest/userguide/ -- Route 53 ARC: https://aws.amazon.com/route53/application-recovery-controller/ -- DNS Firewall: https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver-dns-firewall.html/ -- SLA Tools: https://github.com/mikaelvesavuori/slamax | https://github.com/mikaelvesavuori/cloud-sla +- AWS SLA: https://aws.amazon.com/legal/service-level-agreements/ --- ## 📄 License -This project is licensed under the **Apache License 2.0**. See [LICENSE.md](LICENSE.md). +Apache License 2.0 – see [LICENSE.md](LICENSE.md) From 1dfdb8fa1f0f01cae8e721f5fe7f7c252d31e67b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 00:41:29 +0200 Subject: [PATCH 07/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 594 +++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 426 insertions(+), 168 deletions(-) diff --git a/README.md b/README.md index 72d2bc9f..f634ee0e 100644 --- a/README.md +++ b/README.md @@ -5,50 +5,48 @@ [![CI/CD](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml/badge.svg)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml) [![Scorecard Security](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml/badge.svg?branch=main)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml) -> **Description:** Run AWS Lambda in private subnets across Ireland and Frankfurt, with multi‑region high availability, RTO/RPO enforcement, chaos testing, DNS failover, and WAF protection. - ---- +> **Enterprise-grade multi‑region active/active architecture** with automated failover, comprehensive disaster recovery, and strict RTO/RPO enforcement for mission-critical applications. ## 📋 Table of Contents -- [🧠 Project Mindmap](#-project-mindmap) -- [📐 Architecture](#-architecture) -- [🔗 Network Topology](#-network-topology) -- [🚦 CI/CD Workflow](#-ci-cd-workflow) -- [🚨 Disaster Recovery Strategies](#-disaster-recovery-strategies) -- [🔒 Business Continuity Plan](#-business-continuity-plan) -- [🛡️ Resilience Hub Policy](#️-resilience-hub-policy) -- [📦 CloudFormation Templates](#-cloudformation-templates) -- [🛠️ Tech Stack](#️-tech-stack) -- [📖 Runbooks](#-runbooks) -- [🔗 References](#-references) +- [🧠 Project Overview](#-project-overview) +- [📐 Architecture](#-architecture) +- [🔗 Network Topology](#-network-topology) +- [🚦 CI/CD Pipeline](#-cicd-pipeline) +- [🚨 Disaster Recovery Framework](#-disaster-recovery-framework) +- [⏱️ Business Continuity Planning](#️-business-continuity-planning) +- [🔒 Security & Compliance](#-security--compliance) +- [📦 Infrastructure as Code](#-infrastructure-as-code) +- [🛠️ Tech Stack](#️-tech-stack) +- [📖 Runbooks](#-runbooks) +- [🔗 References](#-references) - [📄 License](#-license) ---- +## 🧠 Project Overview -## 🧠 Project Mindmap +This project implements a highly resilient, secure serverless architecture using AWS Lambda in private VPCs across multiple regions. It's designed for enterprise-grade applications requiring stringent security, high availability, and disaster recovery capabilities. ```mermaid mindmap root((Lambda in Private VPC)) - Infrastructure((Infrastructure)) + Infrastructure VPC Subnets Endpoints ACLs & SGs FlowLogs - Compute((Compute)) + Compute HealthLambda CrudLambda - API & DNS((API & DNS)) + API & DNS APIGateway CustomDomain Route53 HealthChecks - Data((Data)) + Data DynamoDB_GlobalTable DeadLetter_SNS - Security((Security)) + Security WAFv2 IAM_Roles KMS @@ -58,7 +56,7 @@ mindmap Checkov ZAP Scorecard - Resilience_DR((Resilience & DR)) + Resilience_DR ResHub RTO_RPO HA @@ -68,7 +66,7 @@ mindmap WarmStandby MultiSite BCP - CI_CD((CI/CD)) + CI_CD Linting SecurityScans Deploy @@ -76,154 +74,308 @@ mindmap Frankfurt AuxStacks Release - Observability((Observability)) + Observability CW_Logs Alarms XRay - Documentation((Docs)) + Documentation README Runbooks - Contributing - Changelog TechStack ``` ---- - ## 📐 Architecture +The architecture implements a multi-region active/active design with isolated private VPCs, comprehensive security controls, and automated failover capabilities. + ```mermaid -flowchart LR - subgraph IR [Ireland (eu-west-1)] - V1[VPC 10.1.0.0/16] - S1[Subnets A/B/C] - EP1[Endpoints: S3/EC2/DDB] - L1[Lambdas] - G1[API Gateway] - end - subgraph FR [Frankfurt (eu-central-1)] - V2[VPC 10.5.0.0/16] - S2[Subnets A/B/C] - EP2[Endpoints: S3/EC2/DDB] - L2[Lambdas] - G2[API Gateway] - end - V1 --> S1 --> L1 --> EP1 - S1 --> G1 - V2 --> S2 --> L2 --> EP2 - S2 --> G2 - G1 -. Failover .-> DNS[Route 53] - G2 -. Failover .-> DNS - classDef region fill:#e0f7fa,stroke:#006064; - class IR,FR region +flowchart TB + subgraph "Ireland (eu-west-1)" + subgraph "Ireland VPC (10.1.0.0/16)" + IPS[Private Subnets] + IEP[VPC Endpoints] + ISG[Security Groups] + + IPS --> IEP + IPS --> ISG + end + + IL[Lambda Functions] --> IPS + IAPI[API Gateway] --> IL + IDT[DynamoDB Table] --> IL + IL --> IDT + end + + subgraph "Frankfurt (eu-central-1)" + subgraph "Frankfurt VPC (10.5.0.0/16)" + FPS[Private Subnets] + FEP[VPC Endpoints] + FSG[Security Groups] + + FPS --> FEP + FPS --> FSG + end + + FL[Lambda Functions] --> FPS + FAPI[API Gateway] --> FL + FDT[DynamoDB Table] --> FL + FL --> FDT + end + + DR[Route 53] --> IAPI + DR --> FAPI + + IDT <-.-> FDT + + WAF[WAF] --> IAPI + WAF --> FAPI + + subgraph "Monitoring & Resilience" + CW[CloudWatch] + AH[AWS Resilience Hub] + XR[X-Ray] + + CW --> IL + CW --> FL + AH --> IL + AH --> FL + XR --> IL + XR --> FL + end + + classDef primary fill:#e1f5fe,stroke:#0277bd,stroke-width:2px; + classDef secondary fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + classDef security fill:#ffebee,stroke:#c62828,stroke-width:2px; + classDef resilience fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; + + class IPS,FPS,IEP,FEP,ISG,FSG primary; + class IL,FL,IAPI,FAPI,IDT,FDT secondary; + class WAF,DR security; + class CW,AH,XR resilience; ``` ---- +### Key Architecture Components + +- **Multi-Region Deployment**: Active/active setup in Ireland (eu-west-1) and Frankfurt (eu-central-1) +- **VPC Isolation**: Private subnets with no internet access for enhanced security +- **VPC Endpoints**: Secure AWS service access without internet exposure +- **Global Data Replication**: DynamoDB global tables with multi-region consistency +- **Intelligent Routing**: Route 53 health checks with automated failover +- **Identity & Access**: Fine-grained IAM roles following least privilege principle +- **Application Protection**: WAFv2 rules to protect API endpoints +- **Comprehensive Monitoring**: CloudWatch, X-Ray, and custom health checks ## 🔗 Network Topology +Each region implements a secure network topology with private subnets, strict network controls, and comprehensive logging. + ```mermaid -graph TD - VPC[VPC: 10.1.0.0/16] - subgraph Private_Subnets - S1[10.1.0.0/24] - S2[10.1.1.0/24] - S3[10.1.2.0/24] - end - VPC --> Private_Subnets - Private_Subnets --> ACL[Network ACL] - Private_Subnets --> SG[Security Groups] - Private_Subnets --> Endpoints{VPC Endpoints} - Endpoints --> EP_S3[S3] - Endpoints --> EP_EC2[EC2] - Endpoints --> EP_DDB[DynamoDB] +graph TB + subgraph "VPC Architecture" + VPC["VPC (10.1.0.0/16)"] --> PS["Private Subnets"] + + subgraph "Private Subnets" + PS1["Subnet 1 (10.1.0.0/24)"] + PS2["Subnet 2 (10.1.1.0/24)"] + PS3["Subnet 3 (10.1.2.0/24)"] + end + + PS --> ACL["Network ACLs"] + PS --> SG["Security Groups"] + PS --> EP["VPC Endpoints"] + + subgraph "VPC Endpoints" + S3["S3 Gateway"] + DDB["DynamoDB Gateway"] + EC2["EC2 Interface"] + CW["CloudWatch Interface"] + SSM["SSM Interface"] + KMS["KMS Interface"] + end + + FL["Flow Logs"] --> VPC + end + + Lambda["Lambda Functions"] --> PS + Lambda --> SG + Lambda --> EP + + classDef vpc fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; + classDef network fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; + classDef security fill:#ffebee,stroke:#c62828,stroke-width:2px; + classDef compute fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + + class VPC,PS,PS1,PS2,PS3 vpc; + class ACL,SG,FL network; + class EP,S3,DDB,EC2,CW,SSM,KMS security; + class Lambda compute; ``` ---- +### Network Security Features + +| Feature | Implementation | Purpose | +|---------|---------------|---------| +| **Private Subnets** | 3 AZs per region | Isolate compute resources from internet | +| **Security Groups** | Stateful, fine-grained | Control traffic at instance level | +| **Network ACLs** | Stateless, subnet-level | Additional layer of network security | +| **VPC Endpoints** | Gateway and Interface | Secure AWS service access | +| **Flow Logs** | VPC, subnet, and ENI levels | Network traffic visibility and auditing | +| **Transit Encryption** | TLS for all traffic | Data protection in transit | -## 🚦 CI/CD Workflow +## 🚦 CI/CD Pipeline + +The project implements a comprehensive CI/CD pipeline with security scanning, multi-region deployment, and automated verification. ```mermaid -flowchart TD - A[Dispatch/Push] --> B{Lint & Scan} - B --> C[cfn-lint] - B --> D[cfn-nag] - B --> E[Checkov] - B --> F[StandardLint] - B --> G[Scorecard] - G --> H[ZAP API Scan] - H --> I[Configure AWS creds (eu-west-1)] - I --> J[Deploy Core → Ireland] - J --> K[Collect Outputs] - K --> L[Configure AWS creds (eu-central-1)] - L --> M[Deploy Core → Frankfurt] - M --> N[Aux Stacks → Route 53/DR/ResHub] - N --> O[Tag & Release] +flowchart TB + start[Code Push/Dispatch] --> lint{Security Scanning} + + lint --> cfn["cfn-lint"] + lint --> nag["cfn-nag"] + lint --> chk["Checkov"] + lint --> score["Scorecard"] + lint --> zap["ZAP API Scan"] + + cfn --> configure_irl["Configure AWS (eu-west-1)"] + nag --> configure_irl + chk --> configure_irl + score --> configure_irl + zap --> configure_irl + + configure_irl --> deploy_irl["Deploy Core → Ireland"] + deploy_irl --> outputs["Collect Outputs"] + outputs --> configure_fra["Configure AWS (eu-central-1)"] + configure_fra --> deploy_fra["Deploy Core → Frankfurt"] + deploy_fra --> deploy_aux["Deploy Auxiliary Stacks"] + deploy_aux --> release["Tag & Release"] + + classDef start fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; + classDef security fill:#ffebee,stroke:#c62828,stroke-width:2px; + classDef deploy fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; + classDef release fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + + class start start; + class lint,cfn,nag,chk,score,zap security; + class configure_irl,deploy_irl,outputs,configure_fra,deploy_fra,deploy_aux deploy; + class release release; ``` ---- +### Pipeline Features -## 🚨 Disaster Recovery Strategies +- **Security-First Approach**: Multiple security scanning tools run before deployment +- **Infrastructure Validation**: Templates are validated before deployment +- **Multi-Region Coordination**: Sequential deployment to ensure proper resource creation +- **Output Management**: Cross-region resource information is shared between deployments +- **Automated Release**: Successful deployments trigger release creation with changelogs +- **Rollback Capability**: Failed deployments automatically roll back to previous state + +## 🚨 Disaster Recovery Framework + +The project implements multiple disaster recovery strategies to achieve resilience against various failure scenarios. ```mermaid flowchart TB - DR[Disaster Recovery Patterns] - DR --> BR[Backup & Restore] - DR --> PL[Pilot Light] - DR --> WS[Warm Standby] - DR --> MS[Multi‑site Active‑Active] - - subgraph Info1 [Backup & Restore] - BR1[Backups & Snapshots] - BR2[Restore in Alt Region] - end - subgraph Info2 [Pilot Light] - PL1[Core Infra On] - PL2[Scale-on-Demand] - end - subgraph Info3 [Warm Standby] - WS1[Scaled-Down Prod] - WS2[Instant Scale] - end - subgraph Info4 [Multi‑site] - MS1[Full Prod Everywhere] - MS2[Global LB] - end - DR --> Info1 & Info2 & Info3 & Info4 + DR["Disaster Recovery Strategies"] --> BR["Backup & Restore"] + DR --> PL["Pilot Light"] + DR --> WS["Warm Standby"] + DR --> MS["Multi-site Active/Active"] + + subgraph "Recovery Capability Evolution" + BR --> |"Evolve to"| PL + PL --> |"Evolve to"| WS + WS --> |"Evolve to"| MS + end + + subgraph "Recovery Metrics" + BR --- BRMetrics["RTO: Hours/Days
RPO: Hours
Cost: Low"] + PL --- PLMetrics["RTO: Hours
RPO: Minutes
Cost: Low-Medium"] + WS --- WSMetrics["RTO: Minutes
RPO: Minutes
Cost: Medium-High"] + MS --- MSMetrics["RTO: Near-zero
RPO: Near-zero
Cost: High"] + end + + subgraph "Implementation" + BR --- BRImpl["• Regular backups
• Restore procedures
• Recovery testing"] + PL --- PLImpl["• Core infra always running
• Scaled down resources
• Rapid scale-up capacity"] + WS --- WSImpl["• Fully functional standby
• Reduced capacity
• Auto-scaling ready"] + MS --- MSImpl["• Multiple full deployments
• All regions active
• Load balancing across regions"] + end + + classDef strategy fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; + classDef metrics fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; + classDef implementation fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + + class DR,BR,PL,WS,MS strategy; + class BRMetrics,PLMetrics,WSMetrics,MSMetrics metrics; + class BRImpl,PLImpl,WSImpl,MSImpl implementation; ``` ---- +### DR Strategy Comparison + +| Strategy | Recovery Time | Data Loss | Cost | Implementation | +|----------|---------------|-----------|------|---------------| +| **Backup & Restore** | Hours/Days | Hours | $ | Backups with documented restore procedures | +| **Pilot Light** | Hours | Minutes | $$ | Core infrastructure running with rapid scale-up | +| **Warm Standby** | Minutes | Minutes | $$$ | Scaled-down but functional standby environment | +| **Multi-site Active/Active** | Near-zero | Near-zero | $$$$ | Full production deployment in multiple regions | + +This project implements the **Multi-site Active/Active** approach for maximum resilience and minimal recovery time. -## 🔒 Business Continuity Plan +## ⏱️ Business Continuity Planning + +Business continuity is managed through comprehensive impact analysis, recovery objectives, and compliance documentation. ```mermaid mindmap - root((BCP Plan)) - ImpactAnalysis - Financial - Operational - Reputational - Regulatory - RecoveryObjectives - RTO - RPO - MTTR - Uptime - Strategies - BackupRestore - PilotLight - WarmStandby - MultiSite - Communication - Stakeholders - Channels - Templates - Testing - Quarterly - SemiAnnual - Annual + root((Business
Continuity)) + Impact Analysis + Financial Impact + Revenue loss + Recovery costs + Reputational damage + Operational Impact + Service interruption + Business process disruption + Decision-making capability + Regulatory Impact + Compliance violations + Reporting requirements + Audit considerations + Recovery Objectives + RTO (Recovery Time Objective) + Authentication: < 5 min + API Gateway: < 5 min + Lambda Functions: < 5 min + Data Access: < 5 min + RPO (Recovery Point Objective) + Transaction Data: Near-zero + Configuration Data: < 15 min + Log Data: < 60 min + MTTR (Mean Time To Recovery) + Infrastructure: < 15 min + Application: < 10 min + Data: < 5 min + Resilience Strategies + Active/Active Deployment + Automated Failover + Health Checks + Self-Healing Systems + Testing & Validation + Regular DR Drills + Automated Recovery Tests + Compliance Verification ``` +### Business Impact Analysis + +| Impact Category | Description | Mitigation Strategy | +|-----------------|-------------|---------------------| +| **Financial Impact** | Revenue loss during outages | Multi-region active/active to minimize downtime | +| **Operational Impact** | Business process disruption | Automated failover for service continuity | +| **Reputational Impact** | Customer trust erosion | Transparent monitoring and communication | +| **Regulatory Impact** | Compliance violations | Comprehensive logging and audit trails | + +### Recovery Objectives + ```mermaid timeline title RTO/RPO & Uptime Targets @@ -241,9 +393,61 @@ timeline Monitoring : 24/7 ``` ---- +## 🔒 Security & Compliance -## 🛡️ Resilience Hub Policy +The project implements comprehensive security controls and compliance mechanisms. + +```mermaid +mindmap + root((Security &
Compliance)) + Network Security + Private VPCs + Security Groups + Network ACLs + Flow Logging + VPC Endpoints + Identity & Access + IAM Roles + Least Privilege + Resource Policies + Temporary Credentials + Role Assumption + Data Protection + Encryption at Rest + Encryption in Transit + Key Management (KMS) + Backup Protection + Data Lifecycle + Application Security + Input Validation + Output Encoding + Dependency Scanning + Code Analysis + OWASP Top 10 Mitigations + Compliance Controls + NIST 800-53 + ISO 27001 + GDPR + PCI-DSS + SOC 2 + Security Testing + Static Analysis + Dynamic Testing + Infrastructure Scanning + Penetration Testing + Vulnerability Management +``` + +### Compliance Framework Mapping + +| Framework | Relevant Controls | Implementation | +|-----------|-------------------|----------------| +| **NIST SP 800-53 Rev. 5** | CP-2, CP-7, CP-9, CP-10, CP-4(2) | Multi-region deployment, automated recovery, regular testing | +| **NIST CSF 2.0** | RC.RP, RC.RP-4, PR.DS-9, ID.BE-5 | Recovery processes, RPO/RTO targets, backup protection, resilience requirements | +| **ISO 27001:2022** | A.17.1.x, A.17.2.1, A.12.3.1 | Continuity planning, availability management, information backup | +| **AWS Well-Architected** | REL01-09, SEC01-10 | Resilient architecture, security at all layers | + +### Resilience Hub Policy ```mermaid stateDiagram-v2 @@ -258,51 +462,105 @@ stateDiagram-v2 Software --> [*] ``` ---- +## 📦 Infrastructure as Code + +The project is defined entirely as Infrastructure as Code using AWS CloudFormation. -## 📦 CloudFormation Templates +```mermaid +flowchart TB + core["template.yml
Core Infrastructure"] --> vpc["VPC & Networking"] + core --> lambda["Lambda Functions"] + core --> api["API Gateway"] + core --> dynamo["DynamoDB"] + core --> iam["IAM Roles"] + core --> endpoints["VPC Endpoints"] + + subgraph "Additional Templates" + route53["route53.yml
DNS Configuration"] + resiliencehub["app.yml
Resilience Hub App"] + dr["disaster-recovery.yml
DR Components"] + waf["waf.yml
WAF Configuration"] + end + + core --> route53 + core --> resiliencehub + core --> dr + core --> waf + + classDef core fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; + classDef component fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; + classDef additional fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + + class core core; + class vpc,lambda,api,dynamo,iam,endpoints component; + class route53,resiliencehub,dr,waf additional; +``` -- **template.yml** – VPC, subnets, endpoints, Lambdas, API, DynamoDB -- **route53.yml** – Route 53 weighted/failover records -- **app.yml** – AWS Resilience Hub App & Policy -- **disaster-recovery.yml** – AWS FIS experiments -- **waf.yml** – AWS WAFv2 WebACL +### CloudFormation Templates ---- +| Template | Purpose | Key Resources | +|----------|---------|--------------| +| **template.yml** | Core infrastructure | VPC, Subnets, Lambda, API Gateway, DynamoDB | +| **route53.yml** | DNS configuration | Route 53 health checks, failover records | +| **app.yml** | Resilience configuration | AWS Resilience Hub app and policy definition | +| **disaster-recovery.yml** | DR testing | AWS FIS experiments for DR validation | +| **waf.yml** | Security rules | WAFv2 WebACL and rule sets | ## 🛠️ Tech Stack -- AWS CloudFormation -- AWS Lambda (Node.js 20.x) -- Amazon API Gateway (Regional) -- DynamoDB Global Tables -- VPC Endpoints & Security -- AWS Resilience Hub & FIS -- AWS WAFv2 & IAM -- GitHub Actions, cfn-lint, cfn-nag, Checkov, ZAP, Scorecard - -Details: [techstack.md](techstack.md) +The project leverages a modern, cloud-native technology stack: ---- +```mermaid +mindmap + root((Technology
Stack)) + Infrastructure + AWS CloudFormation + AWS VPC + AWS Route 53 + AWS KMS + AWS Resilience Hub + Compute + AWS Lambda (Node.js 20.x) + AWS API Gateway + Data + Amazon DynamoDB Global Tables + AWS S3 (for backups) + Security + AWS WAFv2 + AWS IAM + AWS Security Hub + DevOps + GitHub Actions + cfn-lint + cfn-nag + Checkov + ZAP + Scorecard + Monitoring + AWS CloudWatch + AWS X-Ray + Custom Dashboards +``` ## 📖 Runbooks -- **DynamoDB** – AWS Systems Manager -- **Lambda** – AWS Systems Manager -- **App Runner** – AWS App Runner -- **IAM** – AWS IAM Automation +Comprehensive documentation is available for operations and recovery: ---- +| Runbook | Purpose | Implementation | +|---------|---------|----------------| +| **[DynamoDB Runbook](runbooks/dynamodb.md)** | DynamoDB recovery | AWS Systems Manager automation | +| **[Lambda Runbook](runbooks/lambda.md)** | Lambda function recovery | AWS Systems Manager automation | +| **[API Gateway Runbook](runbooks/apigateway.md)** | API Gateway recovery | Recovery procedures | +| **[IAM Runbook](runbooks/iam.md)** | Identity management recovery | IAM automation workflows | ## 🔗 References -- AWS Resilience Hub: https://docs.aws.amazon.com/resilience-hub/latest/userguide/ -- DR I: https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/ -- DR IV: https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/ -- AWS SLA: https://aws.amazon.com/legal/service-level-agreements/ - ---- +- [AWS Resilience Hub Documentation](https://docs.aws.amazon.com/resilience-hub/latest/userguide/) +- [Disaster Recovery on AWS - Part I: Strategies for Recovery in the Cloud](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/) +- [Disaster Recovery on AWS - Part IV: Multi-site Active/Active](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/) +- [AWS Service Level Agreements](https://aws.amazon.com/legal/service-level-agreements/) +- [AWS Well-Architected Framework - Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html) ## 📄 License -Apache License 2.0 – see [LICENSE.md](LICENSE.md) +This project is licensed under the Apache License 2.0 - see [LICENSE.md](LICENSE.md) for details. From 9389e832ce83442f04814390b4f870aeac66f900 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 00:49:25 +0200 Subject: [PATCH 08/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 764 +++++++++++++++++++++--------------------------------- 1 file changed, 290 insertions(+), 474 deletions(-) diff --git a/README.md b/README.md index f634ee0e..4b150420 100644 --- a/README.md +++ b/README.md @@ -5,561 +5,377 @@ [![CI/CD](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml/badge.svg)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/main.yml) [![Scorecard Security](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml/badge.svg?branch=main)](https://github.com/Hack23/lambda-in-private-vpc/actions/workflows/scorecard.yml) -> **Enterprise-grade multi‑region active/active architecture** with automated failover, comprehensive disaster recovery, and strict RTO/RPO enforcement for mission-critical applications. +> **Enterprise-grade multi-region active/active architecture** with near-zero recovery time, comprehensive DNS failover, and AWS Resilience Hub policy compliance for mission-critical applications. ## 📋 Table of Contents -- [🧠 Project Overview](#-project-overview) -- [📐 Architecture](#-architecture) -- [🔗 Network Topology](#-network-topology) -- [🚦 CI/CD Pipeline](#-cicd-pipeline) -- [🚨 Disaster Recovery Framework](#-disaster-recovery-framework) -- [⏱️ Business Continuity Planning](#️-business-continuity-planning) -- [🔒 Security & Compliance](#-security--compliance) +- [📑 Project Overview](#-project-overview) +- [🏗️ Architecture](#️-architecture) +- [🔐 Network & Security](#-network--security) +- [🧪 Resilience Testing](#-resilience-testing) +- [⏱️ Recovery Objectives](#️-recovery-objectives) +- [🔄 CI/CD Pipeline](#-cicd-pipeline) - [📦 Infrastructure as Code](#-infrastructure-as-code) -- [🛠️ Tech Stack](#️-tech-stack) -- [📖 Runbooks](#-runbooks) -- [🔗 References](#-references) +- [📚 Documentation](#-documentation) - [📄 License](#-license) -## 🧠 Project Overview +## 📑 Project Overview -This project implements a highly resilient, secure serverless architecture using AWS Lambda in private VPCs across multiple regions. It's designed for enterprise-grade applications requiring stringent security, high availability, and disaster recovery capabilities. +This project implements a highly resilient serverless architecture with AWS Lambda functions deployed in private VPCs across multiple AWS regions (Ireland and Frankfurt). It features comprehensive security controls, automated failover mechanisms, and stringent disaster recovery capabilities through AWS Resilience Hub policy enforcement. ```mermaid mindmap root((Lambda in Private VPC)) Infrastructure - VPC - Subnets - Endpoints - ACLs & SGs - FlowLogs - Compute - HealthLambda - CrudLambda - API & DNS - APIGateway - CustomDomain - Route53 - HealthChecks - Data - DynamoDB_GlobalTable - DeadLetter_SNS + Dual-Region VPCs + Private Subnet Isolation + VPC Endpoints + DNS Firewall + Flow Logs + Compute & API + Lambda Functions + API Gateway + Custom Domain + Route 53 Failover + Health Checks Security - WAFv2 - IAM_Roles - KMS - Scans - CFN_lint - CFN_nag - Checkov - ZAP - Scorecard - Resilience_DR - ResHub - RTO_RPO - HA - DR_Strategies - BackupRestore - PilotLight - WarmStandby - MultiSite - BCP - CI_CD - Linting - SecurityScans - Deploy - Ireland - Frankfurt - AuxStacks - Release - Observability - CW_Logs - Alarms - XRay - Documentation - README - Runbooks - TechStack + Private DNS + WAFv2 Protection + Network ACLs + Security Groups + KMS Encryption + Resilience + Mission-Critical Policy + RTO/RPO Enforcement + Multi-Region Active/Active + Automated Failover + Chaos Engineering + Data Layer + Global Tables + Cross-Region Replication + Point-in-Time Recovery + Dead Letter Queues + CI/CD & Observability + Automated Deployment + Security Scanning + Alarms & Notifications + CloudWatch Monitoring + X-Ray Tracing ``` -## 📐 Architecture +## 🏗️ Architecture -The architecture implements a multi-region active/active design with isolated private VPCs, comprehensive security controls, and automated failover capabilities. +A true active/active multi-region architecture with isolated private subnets, global data replication, and automated failover systems. ```mermaid flowchart TB - subgraph "Ireland (eu-west-1)" - subgraph "Ireland VPC (10.1.0.0/16)" - IPS[Private Subnets] - IEP[VPC Endpoints] - ISG[Security Groups] - - IPS --> IEP - IPS --> ISG + subgraph "Multi-Region Active/Active Architecture" + subgraph "Ireland (eu-west-1)" + IR_VPC["VPC 10.1.0.0/16"] --> IR_SUBNETS["Private Subnets (3 AZs)"] + IR_SUBNETS --> IR_LAMBDA["Lambda Functions"] + IR_LAMBDA --> IR_DYNAMO["DynamoDB\nGlobal Table"] + IR_LAMBDA --> IR_API["API Gateway"] + IR_API --> IR_DOMAIN["Custom Domain"] + IR_VPC --> IR_FLOW["Flow Logs"] + IR_VPC --> IR_DNS["DNS Firewall"] + IR_SUBNETS --> IR_EP["VPC Endpoints"] end - IL[Lambda Functions] --> IPS - IAPI[API Gateway] --> IL - IDT[DynamoDB Table] --> IL - IL --> IDT - end - - subgraph "Frankfurt (eu-central-1)" - subgraph "Frankfurt VPC (10.5.0.0/16)" - FPS[Private Subnets] - FEP[VPC Endpoints] - FSG[Security Groups] - - FPS --> FEP - FPS --> FSG + subgraph "Frankfurt (eu-central-1)" + FR_VPC["VPC 10.5.0.0/16"] --> FR_SUBNETS["Private Subnets (3 AZs)"] + FR_SUBNETS --> FR_LAMBDA["Lambda Functions"] + FR_LAMBDA --> FR_DYNAMO["DynamoDB\nGlobal Table"] + FR_LAMBDA --> FR_API["API Gateway"] + FR_API --> FR_DOMAIN["Custom Domain"] + FR_VPC --> FR_FLOW["Flow Logs"] + FR_VPC --> FR_DNS["DNS Firewall"] + FR_SUBNETS --> FR_EP["VPC Endpoints"] end - FL[Lambda Functions] --> FPS - FAPI[API Gateway] --> FL - FDT[DynamoDB Table] --> FL - FL --> FDT - end - - DR[Route 53] --> IAPI - DR --> FAPI - - IDT <-.-> FDT - - WAF[WAF] --> IAPI - WAF --> FAPI - - subgraph "Monitoring & Resilience" - CW[CloudWatch] - AH[AWS Resilience Hub] - XR[X-Ray] + IR_DOMAIN -.-> R53["Route 53\nWeighted/Failover"] + FR_DOMAIN -.-> R53 + IR_DYNAMO <--> FR_DYNAMO + + WAF["WAF v2"] --> IR_API + WAF --> FR_API - CW --> IL - CW --> FL - AH --> IL - AH --> FL - XR --> IL - XR --> FL + HC["Health Checks"] --> IR_API + HC --> FR_API + HC -.-> R53 + + REH["AWS Resilience Hub\nMission Critical Policy"] --> IR_LAMBDA + REH --> FR_LAMBDA + REH --> IR_DYNAMO + REH --> FR_DYNAMO end - classDef primary fill:#e1f5fe,stroke:#0277bd,stroke-width:2px; - classDef secondary fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; - classDef security fill:#ffebee,stroke:#c62828,stroke-width:2px; - classDef resilience fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; - - class IPS,FPS,IEP,FEP,ISG,FSG primary; - class IL,FL,IAPI,FAPI,IDT,FDT secondary; - class WAF,DR security; - class CW,AH,XR resilience; + classDef ireland fill:#81c784,stroke:#2e7d32,stroke-width:2px,color:#000; + classDef frankfurt fill:#64b5f6,stroke:#1565c0,stroke-width:2px,color:#000; + classDef security fill:#ef5350,stroke:#c62828,stroke-width:2px,color:#fff; + classDef routing fill:#ffab40,stroke:#f57c00,stroke-width:2px,color:#000; + classDef resilience fill:#ba68c8,stroke:#7b1fa2,stroke-width:2px,color:#fff; + + class IR_VPC,IR_SUBNETS,IR_LAMBDA,IR_DYNAMO,IR_API,IR_DOMAIN,IR_FLOW,IR_DNS,IR_EP ireland; + class FR_VPC,FR_SUBNETS,FR_LAMBDA,FR_DYNAMO,FR_API,FR_DOMAIN,FR_FLOW,FR_DNS,FR_EP frankfurt; + class WAF,HC security; + class R53 routing; + class REH resilience; ``` -### Key Architecture Components - -- **Multi-Region Deployment**: Active/active setup in Ireland (eu-west-1) and Frankfurt (eu-central-1) -- **VPC Isolation**: Private subnets with no internet access for enhanced security -- **VPC Endpoints**: Secure AWS service access without internet exposure -- **Global Data Replication**: DynamoDB global tables with multi-region consistency -- **Intelligent Routing**: Route 53 health checks with automated failover -- **Identity & Access**: Fine-grained IAM roles following least privilege principle -- **Application Protection**: WAFv2 rules to protect API endpoints -- **Comprehensive Monitoring**: CloudWatch, X-Ray, and custom health checks +### Key Components -## 🔗 Network Topology +- **Isolated Private VPCs**: Dedicated VPCs in each region with no internet access +- **Multi-AZ Deployment**: 3 private subnets across availability zones for high availability +- **Private Network Controls**: Security groups, NACLs, and flow logs for comprehensive protection +- **Global Data Layer**: DynamoDB global tables with automatic multi-region replication +- **Intelligent Routing**: Route 53 health checks and weighted routing with automatic failover +- **API Gateway**: Regional endpoints with custom domain names and WAF protection -Each region implements a secure network topology with private subnets, strict network controls, and comprehensive logging. +## 🔐 Network & Security ```mermaid -graph TB - subgraph "VPC Architecture" - VPC["VPC (10.1.0.0/16)"] --> PS["Private Subnets"] +graph TD + subgraph "Network Security Architecture" + VPC["VPC (10.1.0.0/16)"] - subgraph "Private Subnets" - PS1["Subnet 1 (10.1.0.0/24)"] - PS2["Subnet 2 (10.1.1.0/24)"] - PS3["Subnet 3 (10.1.2.0/24)"] + subgraph "Private Subnet Security" + NACL["Network ACLs"] --> DENY["Deny RDP (3389)"] + NACL --> ALLOW_OUT["Allow HTTPS Outbound (443)"] + SGFVPC["VPC Endpoint SG"] + SGLMB["Lambda SG"] + + SGFVPC --> ALLOW_SG["Allow HTTPS from Lambda SG"] + SGLMB --> ALLOW_VPC["Allow HTTPS to VPC Endpoints"] end - PS --> ACL["Network ACLs"] - PS --> SG["Security Groups"] - PS --> EP["VPC Endpoints"] + VPC --> SUBNET["Private Subnets"] + SUBNET --> NACL + SUBNET --> SGFVPC + SUBNET --> SGLMB - subgraph "VPC Endpoints" - S3["S3 Gateway"] - DDB["DynamoDB Gateway"] - EC2["EC2 Interface"] - CW["CloudWatch Interface"] - SSM["SSM Interface"] - KMS["KMS Interface"] - end + VPC --> DNS_FW["DNS Firewall"] + DNS_FW --> ALLOW_AWS["Allow *.amazonaws.com"] + DNS_FW --> BLOCK_ALL["Block All Other Domains"] + + VPC --> FLOW["Flow Logs"] + FLOW --> CWLOGS["CloudWatch Logs"] - FL["Flow Logs"] --> VPC + KMS["KMS Encryption"] + KMS --> SNS_ENC["SNS Topic Encryption"] + KMS --> LOGS_ENC["Log Group Encryption"] + + WAF["WAFv2"] --> RULES["AWS Managed Rules"] + RULES --> IP_REP["IP Reputation List"] + RULES --> ANON_IP["Anonymous IP List"] + RULES --> COMMON["Common Rule Set"] + RULES --> BAD_IN["Known Bad Inputs"] + RULES --> LINUX["Linux Rule Set"] + RULES --> UNIX["Unix Rule Set"] end - Lambda["Lambda Functions"] --> PS - Lambda --> SG - Lambda --> EP - - classDef vpc fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; - classDef network fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; - classDef security fill:#ffebee,stroke:#c62828,stroke-width:2px; - classDef compute fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + classDef vpc fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; + classDef nacl fill:#e1f5fe,stroke:#0277bd,stroke-width:2px; + classDef sg fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + classDef dns fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px; + classDef encryption fill:#ffebee,stroke:#c62828,stroke-width:2px; + classDef waf fill:#fce4ec,stroke:#c2185b,stroke-width:2px; - class VPC,PS,PS1,PS2,PS3 vpc; - class ACL,SG,FL network; - class EP,S3,DDB,EC2,CW,SSM,KMS security; - class Lambda compute; + class VPC,SUBNET vpc; + class NACL,DENY,ALLOW_OUT nacl; + class SGFVPC,SGLMB,ALLOW_SG,ALLOW_VPC sg; + class DNS_FW,ALLOW_AWS,BLOCK_ALL dns; + class KMS,SNS_ENC,LOGS_ENC encryption; + class WAF,RULES,IP_REP,ANON_IP,COMMON,BAD_IN,LINUX,UNIX waf; ``` -### Network Security Features - -| Feature | Implementation | Purpose | -|---------|---------------|---------| -| **Private Subnets** | 3 AZs per region | Isolate compute resources from internet | -| **Security Groups** | Stateful, fine-grained | Control traffic at instance level | -| **Network ACLs** | Stateless, subnet-level | Additional layer of network security | -| **VPC Endpoints** | Gateway and Interface | Secure AWS service access | -| **Flow Logs** | VPC, subnet, and ENI levels | Network traffic visibility and auditing | -| **Transit Encryption** | TLS for all traffic | Data protection in transit | +### Security Features -## 🚦 CI/CD Pipeline +- **Private VPC Design**: No internet gateways, isolated subnets +- **DNS Firewall**: Allows only AWS domains, blocks all other outbound DNS queries +- **Network ACLs**: Customized ingress/egress rules with RDP blocking +- **Security Groups**: Least-privilege access between Lambda and VPC endpoints +- **Comprehensive WAF Protection**: Six AWS managed rule groups for API security +- **VPC Flow Logs**: Network traffic visibility and auditing with encrypted logs +- **KMS Encryption**: Custom KMS keys for SNS topics and CloudWatch logs +- **IAM Least Privilege**: Detailed IAM roles and policies for all components -The project implements a comprehensive CI/CD pipeline with security scanning, multi-region deployment, and automated verification. +## 🧪 Resilience Testing ```mermaid flowchart TB - start[Code Push/Dispatch] --> lint{Security Scanning} - - lint --> cfn["cfn-lint"] - lint --> nag["cfn-nag"] - lint --> chk["Checkov"] - lint --> score["Scorecard"] - lint --> zap["ZAP API Scan"] - - cfn --> configure_irl["Configure AWS (eu-west-1)"] - nag --> configure_irl - chk --> configure_irl - score --> configure_irl - zap --> configure_irl - - configure_irl --> deploy_irl["Deploy Core → Ireland"] - deploy_irl --> outputs["Collect Outputs"] - outputs --> configure_fra["Configure AWS (eu-central-1)"] - configure_fra --> deploy_fra["Deploy Core → Frankfurt"] - deploy_fra --> deploy_aux["Deploy Auxiliary Stacks"] - deploy_aux --> release["Tag & Release"] - - classDef start fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; - classDef security fill:#ffebee,stroke:#c62828,stroke-width:2px; - classDef deploy fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; - classDef release fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; - - class start start; - class lint,cfn,nag,chk,score,zap security; - class configure_irl,deploy_irl,outputs,configure_fra,deploy_fra,deploy_aux deploy; - class release release; -``` - -### Pipeline Features - -- **Security-First Approach**: Multiple security scanning tools run before deployment -- **Infrastructure Validation**: Templates are validated before deployment -- **Multi-Region Coordination**: Sequential deployment to ensure proper resource creation -- **Output Management**: Cross-region resource information is shared between deployments -- **Automated Release**: Successful deployments trigger release creation with changelogs -- **Rollback Capability**: Failed deployments automatically roll back to previous state - -## 🚨 Disaster Recovery Framework - -The project implements multiple disaster recovery strategies to achieve resilience against various failure scenarios. - -```mermaid -flowchart TB - DR["Disaster Recovery Strategies"] --> BR["Backup & Restore"] - DR --> PL["Pilot Light"] - DR --> WS["Warm Standby"] - DR --> MS["Multi-site Active/Active"] - - subgraph "Recovery Capability Evolution" - BR --> |"Evolve to"| PL - PL --> |"Evolve to"| WS - WS --> |"Evolve to"| MS - end - - subgraph "Recovery Metrics" - BR --- BRMetrics["RTO: Hours/Days
RPO: Hours
Cost: Low"] - PL --- PLMetrics["RTO: Hours
RPO: Minutes
Cost: Low-Medium"] - WS --- WSMetrics["RTO: Minutes
RPO: Minutes
Cost: Medium-High"] - MS --- MSMetrics["RTO: Near-zero
RPO: Near-zero
Cost: High"] - end - - subgraph "Implementation" - BR --- BRImpl["• Regular backups
• Restore procedures
• Recovery testing"] - PL --- PLImpl["• Core infra always running
• Scaled down resources
• Rapid scale-up capacity"] - WS --- WSImpl["• Fully functional standby
• Reduced capacity
• Auto-scaling ready"] - MS --- MSImpl["• Multiple full deployments
• All regions active
• Load balancing across regions"] + subgraph "Fault Injection Experiments" + DR["Disaster Recovery\nTest Framework"] + + DR --> API_FAIL["API Gateway\nLambda Access\nDenial"] + DR --> DDB_DEL["DynamoDB Table\nDeletion"] + DR --> PITR["Point-In-Time\nRecovery"] + DR --> BACKUP["Backup\nRestoration"] + + API_FAIL --> SSM_API["SSM Automation\nDeny Access"] + DDB_DEL --> SSM_DEL["SSM Automation\nDelete Table"] + PITR --> SSM_PITR["SSM Automation\nRestore PITR"] + BACKUP --> SSM_BACK["SSM Automation\nRestore Backup"] + + SSM_API --> MONITOR["Health Check\nMonitoring"] + SSM_DEL --> MONITOR + SSM_PITR --> MONITOR + SSM_BACK --> MONITOR + + MONITOR --> FAILOVER["Automatic\nRoute 53\nFailover"] + MONITOR --> RESTORE["Recovery\nAutomation"] end - - classDef strategy fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; - classDef metrics fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; - classDef implementation fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; - - class DR,BR,PL,WS,MS strategy; - class BRMetrics,PLMetrics,WSMetrics,MSMetrics metrics; - class BRImpl,PLImpl,WSImpl,MSImpl implementation; -``` - -### DR Strategy Comparison - -| Strategy | Recovery Time | Data Loss | Cost | Implementation | -|----------|---------------|-----------|------|---------------| -| **Backup & Restore** | Hours/Days | Hours | $ | Backups with documented restore procedures | -| **Pilot Light** | Hours | Minutes | $$ | Core infrastructure running with rapid scale-up | -| **Warm Standby** | Minutes | Minutes | $$$ | Scaled-down but functional standby environment | -| **Multi-site Active/Active** | Near-zero | Near-zero | $$$$ | Full production deployment in multiple regions | - -This project implements the **Multi-site Active/Active** approach for maximum resilience and minimal recovery time. - -## ⏱️ Business Continuity Planning -Business continuity is managed through comprehensive impact analysis, recovery objectives, and compliance documentation. - -```mermaid -mindmap - root((Business
Continuity)) - Impact Analysis - Financial Impact - Revenue loss - Recovery costs - Reputational damage - Operational Impact - Service interruption - Business process disruption - Decision-making capability - Regulatory Impact - Compliance violations - Reporting requirements - Audit considerations - Recovery Objectives - RTO (Recovery Time Objective) - Authentication: < 5 min - API Gateway: < 5 min - Lambda Functions: < 5 min - Data Access: < 5 min - RPO (Recovery Point Objective) - Transaction Data: Near-zero - Configuration Data: < 15 min - Log Data: < 60 min - MTTR (Mean Time To Recovery) - Infrastructure: < 15 min - Application: < 10 min - Data: < 5 min - Resilience Strategies - Active/Active Deployment - Automated Failover - Health Checks - Self-Healing Systems - Testing & Validation - Regular DR Drills - Automated Recovery Tests - Compliance Verification + classDef framework fill:#e1bee7,stroke:#8e24aa,stroke-width:2px; + classDef experiment fill:#bbdefb,stroke:#1976d2,stroke-width:2px; + classDef automation fill:#c8e6c9,stroke:#388e3c,stroke-width:2px; + classDef monitoring fill:#ffecb3,stroke:#ffa000,stroke-width:2px; + classDef recovery fill:#ffcdd2,stroke:#d32f2f,stroke-width:2px; + + class DR framework; + class API_FAIL,DDB_DEL,PITR,BACKUP experiment; + class SSM_API,SSM_DEL,SSM_PITR,SSM_BACK automation; + class MONITOR monitoring; + class FAILOVER,RESTORE recovery; ``` -### Business Impact Analysis +### Chaos Engineering Capabilities -| Impact Category | Description | Mitigation Strategy | -|-----------------|-------------|---------------------| -| **Financial Impact** | Revenue loss during outages | Multi-region active/active to minimize downtime | -| **Operational Impact** | Business process disruption | Automated failover for service continuity | -| **Reputational Impact** | Customer trust erosion | Transparent monitoring and communication | -| **Regulatory Impact** | Compliance violations | Comprehensive logging and audit trails | +- **AWS Fault Injection Service**: Predefined experiments to simulate failures +- **Lambda Access Denial**: Tests API Gateway resilience during Lambda failures +- **DynamoDB Failure Scenarios**: Table deletion and recovery testing +- **Point-in-Time Recovery**: Automated restore procedures with defined RPOs +- **Backup Restoration**: Complete data recovery from backups +- **SSM Automation Documents**: Pre-defined recovery runbooks for all scenarios -### Recovery Objectives +## ⏱️ Recovery Objectives -```mermaid -timeline - title RTO/RPO & Uptime Targets - section RTO (Recovery Time) - Infra & API : 1h - Core Lambdas : 2h - Data Services : 4h - section RPO (Data Loss) - Transient Logs : 5m - User Data : 15m - Config & State : 1h - section Uptime - App Endpoint : 99.9% - DNS Failover : 99.99% - Monitoring : 24/7 -``` - -## 🔒 Security & Compliance - -The project implements comprehensive security controls and compliance mechanisms. - -```mermaid -mindmap - root((Security &
Compliance)) - Network Security - Private VPCs - Security Groups - Network ACLs - Flow Logging - VPC Endpoints - Identity & Access - IAM Roles - Least Privilege - Resource Policies - Temporary Credentials - Role Assumption - Data Protection - Encryption at Rest - Encryption in Transit - Key Management (KMS) - Backup Protection - Data Lifecycle - Application Security - Input Validation - Output Encoding - Dependency Scanning - Code Analysis - OWASP Top 10 Mitigations - Compliance Controls - NIST 800-53 - ISO 27001 - GDPR - PCI-DSS - SOC 2 - Security Testing - Static Analysis - Dynamic Testing - Infrastructure Scanning - Penetration Testing - Vulnerability Management -``` - -### Compliance Framework Mapping - -| Framework | Relevant Controls | Implementation | -|-----------|-------------------|----------------| -| **NIST SP 800-53 Rev. 5** | CP-2, CP-7, CP-9, CP-10, CP-4(2) | Multi-region deployment, automated recovery, regular testing | -| **NIST CSF 2.0** | RC.RP, RC.RP-4, PR.DS-9, ID.BE-5 | Recovery processes, RPO/RTO targets, backup protection, resilience requirements | -| **ISO 27001:2022** | A.17.1.x, A.17.2.1, A.12.3.1 | Continuity planning, availability management, information backup | -| **AWS Well-Architected** | REL01-09, SEC01-10 | Resilient architecture, security at all layers | - -### Resilience Hub Policy +This architecture achieves stringent recovery time objectives (RTO) and recovery point objectives (RPO) through AWS Resilience Hub policy enforcement. ```mermaid stateDiagram-v2 - [*] --> Region - Region: RTO=3600s\nRPO=5s - Region --> AZ - AZ: RTO=1s\nRPO=1s - AZ --> Hardware - Hardware: RTO=1s\nRPO=1s - Hardware --> Software - Software: RTO=5400s\nRPO=300s - Software --> [*] + [*] --> MissionCritical + + state MissionCritical { + [*] --> Region + Region: "Regional Failure" + Region: RTO: 3600s (1h) + Region: RPO: 5s + + Region --> AZ + AZ: "AZ Failure" + AZ: RTO: 1s + AZ: RPO: 1s + + AZ --> Hardware + Hardware: "Hardware Failure" + Hardware: RTO: 1s + Hardware: RPO: 1s + + Hardware --> Software + Software: "Software Failure" + Software: RTO: 5400s (90m) + Software: RPO: 300s (5m) + + Software --> [*] + } ``` -## 📦 Infrastructure as Code +### Recovery Metrics + +| Failure Scenario | Recovery Time Objective | Recovery Point Objective | Implementation | +|------------------|--------------------------|--------------------------|----------------| +| **Regional Failure** | 3600s (1 hour) | 5s | Multi-region active/active with Route 53 failover | +| **Availability Zone Failure** | 1s | 1s | Multi-AZ deployment in each region | +| **Hardware Failure** | 1s | 1s | AWS managed infrastructure redundancy | +| **Software Failure** | 5400s (90 min) | 300s (5 min) | Automated recovery procedures and global tables | -The project is defined entirely as Infrastructure as Code using AWS CloudFormation. +## 🔄 CI/CD Pipeline ```mermaid -flowchart TB - core["template.yml
Core Infrastructure"] --> vpc["VPC & Networking"] - core --> lambda["Lambda Functions"] - core --> api["API Gateway"] - core --> dynamo["DynamoDB"] - core --> iam["IAM Roles"] - core --> endpoints["VPC Endpoints"] +flowchart TD + GH_PUSH["GitHub Push/Workflow Dispatch"] --> SEC_SCAN{"Security Scanning"} - subgraph "Additional Templates" - route53["route53.yml
DNS Configuration"] - resiliencehub["app.yml
Resilience Hub App"] - dr["disaster-recovery.yml
DR Components"] - waf["waf.yml
WAF Configuration"] - end + SEC_SCAN --> CFN_LINT["cfn-lint"] + SEC_SCAN --> CFN_NAG["cfn-nag"] + SEC_SCAN --> CHECKOV["Checkov"] + SEC_SCAN --> SCORECARD["Scorecard"] + SEC_SCAN --> ZAP["ZAP API Scan"] + + CFN_LINT & CFN_NAG & CHECKOV & SCORECARD & ZAP --> CONFIG_IR["Configure AWS (eu-west-1)"] - core --> route53 - core --> resiliencehub - core --> dr - core --> waf + CONFIG_IR --> DEPLOY_IR["Deploy Core → Ireland"] + DEPLOY_IR --> OUTPUTS["Collect Outputs"] + OUTPUTS --> CONFIG_FR["Configure AWS (eu-central-1)"] + CONFIG_FR --> DEPLOY_FR["Deploy Core → Frankfurt"] - classDef core fill:#e3f2fd,stroke:#1565c0,stroke-width:2px; - classDef component fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; - classDef additional fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; + DEPLOY_FR --> DEPLOY_AUX["Deploy Auxiliary Stacks"] + DEPLOY_AUX --> DEPLOY_R53["Route 53 Configuration"] + DEPLOY_AUX --> DEPLOY_WAF["WAF Configuration"] + DEPLOY_AUX --> DEPLOY_RHB["Resilience Hub App"] + DEPLOY_AUX --> DEPLOY_DR["Disaster Recovery Tests"] - class core core; - class vpc,lambda,api,dynamo,iam,endpoints component; - class route53,resiliencehub,dr,waf additional; + DEPLOY_R53 & DEPLOY_WAF & DEPLOY_RHB & DEPLOY_DR --> TAG["Tag & Release"] + + classDef github fill:#f8cecc,stroke:#b85450,stroke-width:2px; + classDef security fill:#d5e8d4,stroke:#82b366,stroke-width:2px; + classDef deploy fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px; + classDef auxiliary fill:#fff2cc,stroke:#d6b656,stroke-width:2px; + classDef release fill:#e1d5e7,stroke:#9673a6,stroke-width:2px; + + class GH_PUSH github; + class SEC_SCAN,CFN_LINT,CFN_NAG,CHECKOV,SCORECARD,ZAP security; + class CONFIG_IR,DEPLOY_IR,OUTPUTS,CONFIG_FR,DEPLOY_FR deploy; + class DEPLOY_AUX,DEPLOY_R53,DEPLOY_WAF,DEPLOY_RHB,DEPLOY_DR auxiliary; + class TAG release; ``` -### CloudFormation Templates +### Automated Workflow -| Template | Purpose | Key Resources | -|----------|---------|--------------| -| **template.yml** | Core infrastructure | VPC, Subnets, Lambda, API Gateway, DynamoDB | -| **route53.yml** | DNS configuration | Route 53 health checks, failover records | -| **app.yml** | Resilience configuration | AWS Resilience Hub app and policy definition | -| **disaster-recovery.yml** | DR testing | AWS FIS experiments for DR validation | -| **waf.yml** | Security rules | WAFv2 WebACL and rule sets | +1. **Security Scanning**: Multiple tools scan CloudFormation templates for issues +2. **Sequential Deployment**: Ireland deployment, followed by Frankfurt +3. **Cross-Region Integration**: Output collection and sharing between regions +4. **Auxiliary Resources**: Route 53, WAF, Resilience Hub, and DR test configuration +5. **Automated Release**: Version tagging and release notes generation -## 🛠️ Tech Stack +## 📦 Infrastructure as Code -The project leverages a modern, cloud-native technology stack: +This project is entirely defined as CloudFormation templates with comprehensive resource definitions. -```mermaid -mindmap - root((Technology
Stack)) - Infrastructure - AWS CloudFormation - AWS VPC - AWS Route 53 - AWS KMS - AWS Resilience Hub - Compute - AWS Lambda (Node.js 20.x) - AWS API Gateway - Data - Amazon DynamoDB Global Tables - AWS S3 (for backups) - Security - AWS WAFv2 - AWS IAM - AWS Security Hub - DevOps - GitHub Actions - cfn-lint - cfn-nag - Checkov - ZAP - Scorecard - Monitoring - AWS CloudWatch - AWS X-Ray - Custom Dashboards -``` +### Template Structure + +| Template | Purpose | Key Components | +|----------|---------|----------------| +| **template.yml** | Core Infrastructure | VPCs, Subnets, Lambda, API Gateway, DynamoDB, DNS Firewall, Security Groups | +| **route53.yml** | DNS Configuration | Weighted A/AAAA records, Health check integration, Failover configuration | +| **app.yml** | Resilience Hub | Mission Critical policy definition, RTO/RPO targets, Multi-resource mapping | +| **disaster-recovery.yml** | DR Testing | FIS experiments, SSM automation documents, Recovery procedures | +| **waf.yml** | Security Rules | WAF WebACL with AWS managed rules for API protection | + +### Key Resource Highlights + +- **DNS Firewall**: Allows only AWS domains, blocks all others +- **Private DNS Configuration**: Secure VPC DNS settings +- **Network ACLs**: Custom ingress/egress rules +- **Health Checks**: Route 53 health checks for API endpoints +- **WAF Protection**: Six AWS managed rule groups +- **Global Tables**: Cross-region data replication +- **IAM Roles**: Least privilege principle implementation -## 📖 Runbooks +## 📚 Documentation -Comprehensive documentation is available for operations and recovery: +### Runbooks -| Runbook | Purpose | Implementation | -|---------|---------|----------------| -| **[DynamoDB Runbook](runbooks/dynamodb.md)** | DynamoDB recovery | AWS Systems Manager automation | -| **[Lambda Runbook](runbooks/lambda.md)** | Lambda function recovery | AWS Systems Manager automation | -| **[API Gateway Runbook](runbooks/apigateway.md)** | API Gateway recovery | Recovery procedures | -| **[IAM Runbook](runbooks/iam.md)** | Identity management recovery | IAM automation workflows | +- **DynamoDB Recovery**: Automated systems manager runbook +- **Lambda Function Recovery**: Automated systems manager runbook +- **API Gateway Recovery**: Step-by-step recovery procedures +- **IAM Automation**: Identity and access management workflows -## 🔗 References +### References - [AWS Resilience Hub Documentation](https://docs.aws.amazon.com/resilience-hub/latest/userguide/) - [Disaster Recovery on AWS - Part I: Strategies for Recovery in the Cloud](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/) - [Disaster Recovery on AWS - Part IV: Multi-site Active/Active](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/) - [AWS Service Level Agreements](https://aws.amazon.com/legal/service-level-agreements/) -- [AWS Well-Architected Framework - Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html) ## 📄 License From b462fe1bd987fe5e45fc4039ce47d87b79ce43a5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 00:55:32 +0200 Subject: [PATCH 09/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 606 ++++++++++++++++++++++++++++++------------------------ 1 file changed, 338 insertions(+), 268 deletions(-) diff --git a/README.md b/README.md index 4b150420..77581bb1 100644 --- a/README.md +++ b/README.md @@ -9,373 +9,443 @@ ## 📋 Table of Contents -- [📑 Project Overview](#-project-overview) -- [🏗️ Architecture](#️-architecture) -- [🔐 Network & Security](#-network--security) -- [🧪 Resilience Testing](#-resilience-testing) -- [⏱️ Recovery Objectives](#️-recovery-objectives) -- [🔄 CI/CD Pipeline](#-cicd-pipeline) -- [📦 Infrastructure as Code](#-infrastructure-as-code) +- [🌟 Project Overview](#-project-overview) +- [🏗️ Architecture Design](#️-architecture-design) +- [🔐 Security & Network Controls](#-security--network-controls) +- [⚡ Resilience Framework](#-resilience-framework) +- [🧪 Chaos Engineering](#-chaos-engineering) +- [🔄 CI/CD Automation](#-cicd-automation) +- [🔧 Infrastructure as Code](#-infrastructure-as-code) - [📚 Documentation](#-documentation) - [📄 License](#-license) -## 📑 Project Overview +## 🌟 Project Overview This project implements a highly resilient serverless architecture with AWS Lambda functions deployed in private VPCs across multiple AWS regions (Ireland and Frankfurt). It features comprehensive security controls, automated failover mechanisms, and stringent disaster recovery capabilities through AWS Resilience Hub policy enforcement. ```mermaid mindmap - root((Lambda in Private VPC)) - Infrastructure - Dual-Region VPCs - Private Subnet Isolation - VPC Endpoints - DNS Firewall - Flow Logs - Compute & API - Lambda Functions - API Gateway - Custom Domain - Route 53 Failover - Health Checks - Security - Private DNS - WAFv2 Protection - Network ACLs - Security Groups - KMS Encryption - Resilience - Mission-Critical Policy - RTO/RPO Enforcement - Multi-Region Active/Active - Automated Failover - Chaos Engineering - Data Layer - Global Tables - Cross-Region Replication - Point-in-Time Recovery - Dead Letter Queues - CI/CD & Observability - Automated Deployment - Security Scanning - Alarms & Notifications - CloudWatch Monitoring - X-Ray Tracing + root((("Lambda in
Private VPC"))) + Infrastructure["🏢 Infrastructure"]:::infra + ["Multi-Region VPCs"] + ["Private Subnets"] + ["VPC Endpoints"] + ["DNS Firewall"] + ["Flow Logs"] + Security["🔒 Security"]:::security + ["Private DNS"] + ["WAF Protection"] + ["Network ACLs"] + ["IAM Least Privilege"] + ["KMS Encryption"] + Resilience["🛡️ Resilience"]:::resilience + ["Mission-Critical Policy"] + ["RTO/RPO Enforcement"] + ["Multi-Region Active/Active"] + ["Automatic Failover"] + ["Chaos Engineering Tests"] + Data["💾 Data Layer"]:::data + ["DynamoDB Global Tables"] + ["Cross-Region Replication"] + ["Point-in-Time Recovery"] + ["Backup/Restore Automation"] + ["Dead Letter Queues"] + Compute["⚙️ Compute & API"]:::compute + ["Lambda Functions"] + ["API Gateway"] + ["Custom Domain"] + ["Route 53 Failover"] + ["Health Checks"] + CI_CD["🔄 CI/CD & Observability"]:::cicd + ["Security Scanning"] + ["Automated Deployment"] + ["CloudWatch Monitoring"] + ["X-Ray Tracing"] + ["Alarm Notifications"] + +classDef infra fill:#388e3c,color:#ffffff,stroke:#1b5e20,stroke-width:2px +classDef security fill:#d32f2f,color:#ffffff,stroke:#b71c1c,stroke-width:2px +classDef resilience fill:#7b1fa2,color:#ffffff,stroke:#4a148c,stroke-width:2px +classDef data fill:#1976d2,color:#ffffff,stroke:#0d47a1,stroke-width:2px +classDef compute fill:#f57c00,color:#ffffff,stroke:#e65100,stroke-width:2px +classDef cicd fill:#5d4037,color:#ffffff,stroke:#3e2723,stroke-width:2px ``` -## 🏗️ Architecture +### Key Resilience Metrics + +- **99.99% Uptime** through multi-region active/active architecture +- **Near-zero RPO** with DynamoDB global tables and cross-region replication +- **Region-level RTO of 1 hour** enforced by AWS Resilience Hub policy +- **Comprehensive security controls** with private VPCs and WAF protection +- **Automated failover** through Route 53 health checks and weighted routing +- **Mission-critical compliance** with industry best practices and standards + +## 🏗️ Architecture Design A true active/active multi-region architecture with isolated private subnets, global data replication, and automated failover systems. ```mermaid flowchart TB subgraph "Multi-Region Active/Active Architecture" - subgraph "Ireland (eu-west-1)" - IR_VPC["VPC 10.1.0.0/16"] --> IR_SUBNETS["Private Subnets (3 AZs)"] - IR_SUBNETS --> IR_LAMBDA["Lambda Functions"] - IR_LAMBDA --> IR_DYNAMO["DynamoDB\nGlobal Table"] - IR_LAMBDA --> IR_API["API Gateway"] - IR_API --> IR_DOMAIN["Custom Domain"] - IR_VPC --> IR_FLOW["Flow Logs"] - IR_VPC --> IR_DNS["DNS Firewall"] - IR_SUBNETS --> IR_EP["VPC Endpoints"] + subgraph "Ireland (eu-west-1)":::ireland + IR_VPC["VPC 10.1.0.0/16"] + IR_SUBNETS["Private Subnets (3 AZs)"] + IR_LAMBDA["Lambda Functions"] + IR_DYNAMO["DynamoDB Global Table"] + IR_API["API Gateway"] + IR_DOMAIN["Custom Domain"] + IR_DNS["DNS Firewall"] + IR_EP["VPC Endpoints"] + + IR_VPC --> IR_SUBNETS + IR_SUBNETS --> IR_LAMBDA + IR_LAMBDA --> IR_DYNAMO + IR_LAMBDA --> IR_API + IR_API --> IR_DOMAIN + IR_VPC --> IR_DNS + IR_SUBNETS --> IR_EP end - subgraph "Frankfurt (eu-central-1)" - FR_VPC["VPC 10.5.0.0/16"] --> FR_SUBNETS["Private Subnets (3 AZs)"] - FR_SUBNETS --> FR_LAMBDA["Lambda Functions"] - FR_LAMBDA --> FR_DYNAMO["DynamoDB\nGlobal Table"] - FR_LAMBDA --> FR_API["API Gateway"] - FR_API --> FR_DOMAIN["Custom Domain"] - FR_VPC --> FR_FLOW["Flow Logs"] - FR_VPC --> FR_DNS["DNS Firewall"] - FR_SUBNETS --> FR_EP["VPC Endpoints"] + subgraph "Frankfurt (eu-central-1)":::frankfurt + FR_VPC["VPC 10.5.0.0/16"] + FR_SUBNETS["Private Subnets (3 AZs)"] + FR_LAMBDA["Lambda Functions"] + FR_DYNAMO["DynamoDB Global Table"] + FR_API["API Gateway"] + FR_DOMAIN["Custom Domain"] + FR_DNS["DNS Firewall"] + FR_EP["VPC Endpoints"] + + FR_VPC --> FR_SUBNETS + FR_SUBNETS --> FR_LAMBDA + FR_LAMBDA --> FR_DYNAMO + FR_LAMBDA --> FR_API + FR_API --> FR_DOMAIN + FR_VPC --> FR_DNS + FR_SUBNETS --> FR_EP end - IR_DOMAIN -.-> R53["Route 53\nWeighted/Failover"] + IR_DOMAIN -.-> R53["Route 53 Weighted/Failover"]:::routing FR_DOMAIN -.-> R53 IR_DYNAMO <--> FR_DYNAMO - WAF["WAF v2"] --> IR_API + WAF["WAF v2"]:::security --> IR_API WAF --> FR_API - HC["Health Checks"] --> IR_API + HC["Health Checks"]:::monitoring --> IR_API HC --> FR_API HC -.-> R53 - REH["AWS Resilience Hub\nMission Critical Policy"] --> IR_LAMBDA + REH["AWS Resilience Hub
Mission Critical Policy"]:::resilience --> IR_LAMBDA REH --> FR_LAMBDA REH --> IR_DYNAMO REH --> FR_DYNAMO end - classDef ireland fill:#81c784,stroke:#2e7d32,stroke-width:2px,color:#000; - classDef frankfurt fill:#64b5f6,stroke:#1565c0,stroke-width:2px,color:#000; - classDef security fill:#ef5350,stroke:#c62828,stroke-width:2px,color:#fff; - classDef routing fill:#ffab40,stroke:#f57c00,stroke-width:2px,color:#000; - classDef resilience fill:#ba68c8,stroke:#7b1fa2,stroke-width:2px,color:#fff; - - class IR_VPC,IR_SUBNETS,IR_LAMBDA,IR_DYNAMO,IR_API,IR_DOMAIN,IR_FLOW,IR_DNS,IR_EP ireland; - class FR_VPC,FR_SUBNETS,FR_LAMBDA,FR_DYNAMO,FR_API,FR_DOMAIN,FR_FLOW,FR_DNS,FR_EP frankfurt; - class WAF,HC security; - class R53 routing; - class REH resilience; + classDef ireland fill:#4CAF50,stroke:#2E7D32,stroke-width:3px,color:#ffffff + classDef frankfurt fill:#2196F3,stroke:#1565C0,stroke-width:3px,color:#ffffff + classDef security fill:#F44336,stroke:#D32F2F,stroke-width:3px,color:#ffffff + classDef routing fill:#FF9800,stroke:#F57C00,stroke-width:3px,color:#ffffff + classDef resilience fill:#9C27B0,stroke:#7B1FA2,stroke-width:3px,color:#ffffff + classDef monitoring fill:#FFC107,stroke:#FFA000,stroke-width:3px,color:#000000 ``` -### Key Components +### Key Architecture Components -- **Isolated Private VPCs**: Dedicated VPCs in each region with no internet access -- **Multi-AZ Deployment**: 3 private subnets across availability zones for high availability -- **Private Network Controls**: Security groups, NACLs, and flow logs for comprehensive protection -- **Global Data Layer**: DynamoDB global tables with automatic multi-region replication -- **Intelligent Routing**: Route 53 health checks and weighted routing with automatic failover -- **API Gateway**: Regional endpoints with custom domain names and WAF protection +| Component | Implementation | Purpose | +|-----------|---------------|---------| +| **Private VPC Infrastructure** | Dedicated VPCs in each region (10.1.0.0/16 & 10.5.0.0/16) | Network isolation and security | +| **Multi-AZ Deployment** | 3 subnets across availability zones per region | High availability within each region | +| **VPC Endpoints** | Interface & Gateway endpoints for S3, EC2, DynamoDB | Secure AWS service access without internet exposure | +| **DNS Firewall** | Allow *.amazonaws.com, block all others | Control outbound DNS traffic from VPC | +| **API Gateway** | Regional endpoints with custom domain names | Exposing Lambda functions securely | +| **Lambda Functions** | Node.js 20.x with VPC configuration | Serverless compute in private subnets | +| **Global Tables** | DynamoDB with multi-region replication | Consistent data across regions with near-zero RPO | +| **Route 53 Routing** | Weighted records with health check failover | Intelligent traffic distribution across regions | -## 🔐 Network & Security +## 🔐 Security & Network Controls ```mermaid graph TD - subgraph "Network Security Architecture" - VPC["VPC (10.1.0.0/16)"] + subgraph "Comprehensive Security Framework" + VPC["🏢 VPC Security"]:::vpc + NW["🔌 Network Controls"]:::network + IAM["🔑 Identity & Access"]:::iam + DATA["🔒 Data Protection"]:::data + APP["🛡️ Application Security"]:::app - subgraph "Private Subnet Security" - NACL["Network ACLs"] --> DENY["Deny RDP (3389)"] - NACL --> ALLOW_OUT["Allow HTTPS Outbound (443)"] - SGFVPC["VPC Endpoint SG"] - SGLMB["Lambda SG"] - - SGFVPC --> ALLOW_SG["Allow HTTPS from Lambda SG"] - SGLMB --> ALLOW_VPC["Allow HTTPS to VPC Endpoints"] - end - - VPC --> SUBNET["Private Subnets"] - SUBNET --> NACL - SUBNET --> SGFVPC - SUBNET --> SGLMB + VPC --> DNS_FW["DNS Firewall
Allow AWS domains only"] + VPC --> FLOW["Flow Logs
Network traffic auditing"] + VPC --> PDNS["Private DNS
Secure name resolution"] - VPC --> DNS_FW["DNS Firewall"] - DNS_FW --> ALLOW_AWS["Allow *.amazonaws.com"] - DNS_FW --> BLOCK_ALL["Block All Other Domains"] + NW --> NACL["Network ACLs
Stateless filtering"] + NW --> SG["Security Groups
Stateful filtering"] + NW --> DENY["Explicit denials
Block RDP (3389)"] - VPC --> FLOW["Flow Logs"] - FLOW --> CWLOGS["CloudWatch Logs"] + IAM --> ROLES["Fine-grained roles
Least privilege"] + IAM --> POLICY["Resource-based policies"] + IAM --> TEMP["Temporary credentials"] - KMS["KMS Encryption"] - KMS --> SNS_ENC["SNS Topic Encryption"] - KMS --> LOGS_ENC["Log Group Encryption"] + DATA --> KMS["KMS Encryption
Custom keys"] + DATA --> ENC_SNS["Encrypted SNS topics"] + DATA --> ENC_LOG["Encrypted log groups"] - WAF["WAFv2"] --> RULES["AWS Managed Rules"] - RULES --> IP_REP["IP Reputation List"] - RULES --> ANON_IP["Anonymous IP List"] - RULES --> COMMON["Common Rule Set"] - RULES --> BAD_IN["Known Bad Inputs"] - RULES --> LINUX["Linux Rule Set"] - RULES --> UNIX["Unix Rule Set"] + APP --> WAF_IP["WAF IP reputation list"] + APP --> WAF_ANON["WAF Anonymous IP protection"] + APP --> WAF_CRS["WAF Common Rule Set"] + APP --> WAF_BAD["WAF Known Bad Inputs"] + APP --> WAF_OS["WAF OS protection rules"] end - - classDef vpc fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px; - classDef nacl fill:#e1f5fe,stroke:#0277bd,stroke-width:2px; - classDef sg fill:#fff8e1,stroke:#ff8f00,stroke-width:2px; - classDef dns fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px; - classDef encryption fill:#ffebee,stroke:#c62828,stroke-width:2px; - classDef waf fill:#fce4ec,stroke:#c2185b,stroke-width:2px; - - class VPC,SUBNET vpc; - class NACL,DENY,ALLOW_OUT nacl; - class SGFVPC,SGLMB,ALLOW_SG,ALLOW_VPC sg; - class DNS_FW,ALLOW_AWS,BLOCK_ALL dns; - class KMS,SNS_ENC,LOGS_ENC encryption; - class WAF,RULES,IP_REP,ANON_IP,COMMON,BAD_IN,LINUX,UNIX waf; + + classDef vpc fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF + classDef network fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF + classDef iam fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF + classDef data fill:#7B1FA2,stroke:#4A148C,stroke-width:2px,color:#FFFFFF + classDef app fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF ``` -### Security Features +### Network Security Features + +| Security Control | Implementation | Details | +|------------------|----------------|---------| +| **Private VPC Design** | No internet gateways or NAT gateways | Complete isolation from public internet | +| **DNS Firewall Rules** | Two rules (Allow AWS, Block All) | Only permits *.amazonaws.com domains | +| **Custom Network ACLs** | Inbound/outbound rule sets | Blocks RDP (3389), limits outbound to HTTPS (443) | +| **Security Group Rules** | Precise traffic control | Lambda-to-endpoints only, no other traffic | +| **VPC Flow Logs** | Integration with CloudWatch | Network traffic visibility with encrypted storage | +| **WAF Protection** | Six managed rule groups | IP reputation, anonymous IP, common attacks, Linux/Unix protection | +| **KMS Encryption** | Custom key with automatic rotation | Encrypts SNS topics, CloudWatch logs | +| **IAM Least Privilege** | Scoped down permissions | Specific roles and permissions for each component | -- **Private VPC Design**: No internet gateways, isolated subnets -- **DNS Firewall**: Allows only AWS domains, blocks all other outbound DNS queries -- **Network ACLs**: Customized ingress/egress rules with RDP blocking -- **Security Groups**: Least-privilege access between Lambda and VPC endpoints -- **Comprehensive WAF Protection**: Six AWS managed rule groups for API security -- **VPC Flow Logs**: Network traffic visibility and auditing with encrypted logs -- **KMS Encryption**: Custom KMS keys for SNS topics and CloudWatch logs -- **IAM Least Privilege**: Detailed IAM roles and policies for all components +## ⚡ Resilience Framework -## 🧪 Resilience Testing +The AWS Resilience Hub integration enforces strict recovery time objectives (RTO) and recovery point objectives (RPO) through policy compliance and automated assessment. ```mermaid -flowchart TB - subgraph "Fault Injection Experiments" - DR["Disaster Recovery\nTest Framework"] +graph TD + subgraph "Mission Critical Resilience Framework" + POLICY["Mission Critical Policy"]:::policy - DR --> API_FAIL["API Gateway\nLambda Access\nDenial"] - DR --> DDB_DEL["DynamoDB Table\nDeletion"] - DR --> PITR["Point-In-Time\nRecovery"] - DR --> BACKUP["Backup\nRestoration"] + subgraph "Failure Domains" + REGION["Regional Failure"]:::region + AZ["AZ Failure"]:::az + HW["Hardware Failure"]:::hardware + SW["Software Failure"]:::software + end - API_FAIL --> SSM_API["SSM Automation\nDeny Access"] - DDB_DEL --> SSM_DEL["SSM Automation\nDelete Table"] - PITR --> SSM_PITR["SSM Automation\nRestore PITR"] - BACKUP --> SSM_BACK["SSM Automation\nRestore Backup"] + POLICY --> REGION + POLICY --> AZ + POLICY --> HW + POLICY --> SW - SSM_API --> MONITOR["Health Check\nMonitoring"] - SSM_DEL --> MONITOR - SSM_PITR --> MONITOR - SSM_BACK --> MONITOR + REGION --> REG_RTO["RTO: 3600s (1h)"]:::rto + REGION --> REG_RPO["RPO: 5s"]:::rpo - MONITOR --> FAILOVER["Automatic\nRoute 53\nFailover"] - MONITOR --> RESTORE["Recovery\nAutomation"] + AZ --> AZ_RTO["RTO: 1s"]:::rto + AZ --> AZ_RPO["RPO: 1s"]:::rpo + + HW --> HW_RTO["RTO: 1s"]:::rto + HW --> HW_RPO["RPO: 1s"]:::rpo + + SW --> SW_RTO["RTO: 5400s (90m)"]:::rto + SW --> SW_RPO["RPO: 300s (5m)"]:::rpo end - - classDef framework fill:#e1bee7,stroke:#8e24aa,stroke-width:2px; - classDef experiment fill:#bbdefb,stroke:#1976d2,stroke-width:2px; - classDef automation fill:#c8e6c9,stroke:#388e3c,stroke-width:2px; - classDef monitoring fill:#ffecb3,stroke:#ffa000,stroke-width:2px; - classDef recovery fill:#ffcdd2,stroke:#d32f2f,stroke-width:2px; - class DR framework; - class API_FAIL,DDB_DEL,PITR,BACKUP experiment; - class SSM_API,SSM_DEL,SSM_PITR,SSM_BACK automation; - class MONITOR monitoring; - class FAILOVER,RESTORE recovery; + subgraph "Implementation Components" + REG_RTO --> MULTI_REG["Multi-region active/active"]:::impl + REG_RPO --> DDB_GLOB["DynamoDB global tables"]:::impl + + AZ_RTO & AZ_RPO --> MULTI_AZ["Multi-AZ deployment"]:::impl + + HW_RTO & HW_RPO --> AWS_INFRA["AWS infrastructure redundancy"]:::impl + + SW_RTO --> AUTO_RECOVER["Automated recovery procedures"]:::impl + SW_RPO --> BACKUP_STRAT["Comprehensive backup strategy"]:::impl + end + + classDef policy fill:#7B1FA2,stroke:#4A148C,stroke-width:3px,color:#FFFFFF + classDef region fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF + classDef az fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF + classDef hardware fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF + classDef software fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF + classDef rto fill:#FFC107,stroke:#FFA000,stroke-width:2px,color:#000000 + classDef rpo fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#FFFFFF + classDef impl fill:#607D8B,stroke:#455A64,stroke-width:2px,color:#FFFFFF ``` -### Chaos Engineering Capabilities +### Recovery Time & Point Objectives -- **AWS Fault Injection Service**: Predefined experiments to simulate failures -- **Lambda Access Denial**: Tests API Gateway resilience during Lambda failures -- **DynamoDB Failure Scenarios**: Table deletion and recovery testing -- **Point-in-Time Recovery**: Automated restore procedures with defined RPOs -- **Backup Restoration**: Complete data recovery from backups -- **SSM Automation Documents**: Pre-defined recovery runbooks for all scenarios +| Failure Domain | RTO | RPO | Implementation Strategy | +|----------------|-----|-----|------------------------| +| **Regional** | 3600s (1 hour) | 5s | Multi-region active/active with Route 53 failover, Global Tables | +| **Availability Zone** | 1s | 1s | Multi-AZ deployment with automatic failover | +| **Hardware** | 1s | 1s | AWS managed infrastructure redundancy | +| **Software** | 5400s (90 min) | 300s (5 min) | Automated recovery procedures, backup/restore, chaos testing | -## ⏱️ Recovery Objectives +## 🧪 Chaos Engineering -This architecture achieves stringent recovery time objectives (RTO) and recovery point objectives (RPO) through AWS Resilience Hub policy enforcement. +The architecture includes comprehensive disaster recovery testing using AWS Fault Injection Service (FIS) to validate resilience capabilities. ```mermaid -stateDiagram-v2 - [*] --> MissionCritical - - state MissionCritical { - [*] --> Region - Region: "Regional Failure" - Region: RTO: 3600s (1h) - Region: RPO: 5s +flowchart TD + subgraph "Chaos Engineering Framework" + DR["Fault Injection Service
Experiments"]:::framework + + subgraph "API Resilience Tests" + API_FAIL["Lambda Access
Denial"]:::experiment + API_FAIL --> SSM_IAM["IAM Policy
Injection"]:::automation + SSM_IAM --> DENY_LAMBDA["Deny Lambda
Access"]:::action + end - Region --> AZ - AZ: "AZ Failure" - AZ: RTO: 1s - AZ: RPO: 1s + subgraph "Data Layer Tests" + DDB_DEL["DynamoDB
Table Deletion"]:::experiment + DDB_DEL --> SSM_DEL["Table Delete
Automation"]:::automation + + PITR["Point-In-Time
Recovery Test"]:::experiment + PITR --> SSM_PITR["PITR Restore
Automation"]:::automation + + BACKUP["Backup
Restoration Test"]:::experiment + BACKUP --> SSM_BACK["Backup Restore
Automation"]:::automation + end - AZ --> Hardware - Hardware: "Hardware Failure" - Hardware: RTO: 1s - Hardware: RPO: 1s + DR --> API_FAIL + DR --> DDB_DEL + DR --> PITR + DR --> BACKUP - Hardware --> Software - Software: "Software Failure" - Software: RTO: 5400s (90m) - Software: RPO: 300s (5m) + subgraph "Recovery Monitoring" + MONITOR["Health Check
Monitoring"]:::monitoring + FAILOVER["Route 53
Failover"]:::recovery + RESTORE["Recovery
Procedures"]:::recovery + end - Software --> [*] - } + SSM_IAM & SSM_DEL & SSM_PITR & SSM_BACK --> MONITOR + MONITOR --> FAILOVER + MONITOR --> RESTORE + end + + classDef framework fill:#7B1FA2,stroke:#4A148C,stroke-width:3px,color:#FFFFFF + classDef experiment fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF + classDef automation fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF + classDef action fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF + classDef monitoring fill:#FFC107,stroke:#FFA000,stroke-width:2px,color:#000000 + classDef recovery fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF ``` -### Recovery Metrics +### Chaos Test Scenarios -| Failure Scenario | Recovery Time Objective | Recovery Point Objective | Implementation | -|------------------|--------------------------|--------------------------|----------------| -| **Regional Failure** | 3600s (1 hour) | 5s | Multi-region active/active with Route 53 failover | -| **Availability Zone Failure** | 1s | 1s | Multi-AZ deployment in each region | -| **Hardware Failure** | 1s | 1s | AWS managed infrastructure redundancy | -| **Software Failure** | 5400s (90 min) | 300s (5 min) | Automated recovery procedures and global tables | +| Test Scenario | Implementation | Success Metrics | Recovery Method | +|---------------|----------------|-----------------|-----------------| +| **API Gateway Lambda Access Denial** | IAM deny policy injection via SSM | Health check recovery time < RTO | Automatic failover to other region | +| **DynamoDB Table Deletion** | Scheduled table deletion via SSM | Table recreation time < RTO | Automated restore from backup or PITR | +| **Point-In-Time Recovery** | SSM automation document execution | Data recovery with RPO validation | Restoration to specified timestamp | +| **Backup Restoration** | SSM automation with backup ARN | Backup validation and integrity check | Full table recovery from backup | +| **Route 53 Health Check Validation** | Health check failure trigger | Weighted routing adjustment < RTO | Automatic traffic redistribution | -## 🔄 CI/CD Pipeline +## 🔄 CI/CD Automation ```mermaid -flowchart TD - GH_PUSH["GitHub Push/Workflow Dispatch"] --> SEC_SCAN{"Security Scanning"} +flowchart LR + GH_PUSH["GitHub Push/
Workflow Dispatch"]:::trigger --> SEC_SCAN{"Security
Scanning"}:::security - SEC_SCAN --> CFN_LINT["cfn-lint"] - SEC_SCAN --> CFN_NAG["cfn-nag"] - SEC_SCAN --> CHECKOV["Checkov"] - SEC_SCAN --> SCORECARD["Scorecard"] - SEC_SCAN --> ZAP["ZAP API Scan"] + SEC_SCAN --> CFN_LINT["cfn-lint"]:::scan + SEC_SCAN --> CFN_NAG["cfn-nag"]:::scan + SEC_SCAN --> CHECKOV["Checkov"]:::scan + SEC_SCAN --> SCORECARD["Scorecard"]:::scan + SEC_SCAN --> ZAP["ZAP API
Scan"]:::scan - CFN_LINT & CFN_NAG & CHECKOV & SCORECARD & ZAP --> CONFIG_IR["Configure AWS (eu-west-1)"] + CFN_LINT & CFN_NAG & CHECKOV & SCORECARD & ZAP --> CONFIG_IR["Configure AWS
(eu-west-1)"]:::deploy - CONFIG_IR --> DEPLOY_IR["Deploy Core → Ireland"] - DEPLOY_IR --> OUTPUTS["Collect Outputs"] - OUTPUTS --> CONFIG_FR["Configure AWS (eu-central-1)"] - CONFIG_FR --> DEPLOY_FR["Deploy Core → Frankfurt"] + CONFIG_IR --> DEPLOY_IR["Deploy Core
Ireland"]:::deploy + DEPLOY_IR --> OUTPUTS["Collect
Outputs"]:::deploy + OUTPUTS --> CONFIG_FR["Configure AWS
(eu-central-1)"]:::deploy + CONFIG_FR --> DEPLOY_FR["Deploy Core
Frankfurt"]:::deploy - DEPLOY_FR --> DEPLOY_AUX["Deploy Auxiliary Stacks"] - DEPLOY_AUX --> DEPLOY_R53["Route 53 Configuration"] - DEPLOY_AUX --> DEPLOY_WAF["WAF Configuration"] - DEPLOY_AUX --> DEPLOY_RHB["Resilience Hub App"] - DEPLOY_AUX --> DEPLOY_DR["Disaster Recovery Tests"] + DEPLOY_FR --> DEPLOY_AUX["Deploy
Auxiliary Stacks"]:::aux - DEPLOY_R53 & DEPLOY_WAF & DEPLOY_RHB & DEPLOY_DR --> TAG["Tag & Release"] + DEPLOY_AUX --> DEPLOY_R53["Route 53
Configuration"]:::aux + DEPLOY_AUX --> DEPLOY_WAF["WAF
Configuration"]:::aux + DEPLOY_AUX --> DEPLOY_RHB["Resilience Hub
App"]:::aux + DEPLOY_AUX --> DEPLOY_DR["Disaster
Recovery Tests"]:::aux - classDef github fill:#f8cecc,stroke:#b85450,stroke-width:2px; - classDef security fill:#d5e8d4,stroke:#82b366,stroke-width:2px; - classDef deploy fill:#dae8fc,stroke:#6c8ebf,stroke-width:2px; - classDef auxiliary fill:#fff2cc,stroke:#d6b656,stroke-width:2px; - classDef release fill:#e1d5e7,stroke:#9673a6,stroke-width:2px; + DEPLOY_R53 & DEPLOY_WAF & DEPLOY_RHB & DEPLOY_DR --> TAG["Tag &
Release"]:::release - class GH_PUSH github; - class SEC_SCAN,CFN_LINT,CFN_NAG,CHECKOV,SCORECARD,ZAP security; - class CONFIG_IR,DEPLOY_IR,OUTPUTS,CONFIG_FR,DEPLOY_FR deploy; - class DEPLOY_AUX,DEPLOY_R53,DEPLOY_WAF,DEPLOY_RHB,DEPLOY_DR auxiliary; - class TAG release; + classDef trigger fill:#D32F2F,stroke:#B71C1C,stroke-width:3px,color:#FFFFFF + classDef security fill:#7B1FA2,stroke:#4A148C,stroke-width:2px,color:#FFFFFF + classDef scan fill:#2E7D32,stroke:#1B5E20,stroke-width:2px,color:#FFFFFF + classDef deploy fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF + classDef aux fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF + classDef release fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#FFFFFF ``` -### Automated Workflow +### CI/CD Pipeline Features -1. **Security Scanning**: Multiple tools scan CloudFormation templates for issues -2. **Sequential Deployment**: Ireland deployment, followed by Frankfurt -3. **Cross-Region Integration**: Output collection and sharing between regions -4. **Auxiliary Resources**: Route 53, WAF, Resilience Hub, and DR test configuration -5. **Automated Release**: Version tagging and release notes generation +- **Pre-Commit Security Validation**: Multiple scanning tools analyze infrastructure templates +- **Sequential Multi-Region Deployment**: Ireland (primary) followed by Frankfurt (secondary) +- **Cross-Region Resource Integration**: Output collection and sharing between deployments +- **Auxiliary Resource Configuration**: Route 53, WAF, Resilience Hub, and Disaster Recovery +- **Automated Version Management**: Git tagging and release notes generation +- **Rollback Capability**: Automatic reversal on deployment failures -## 📦 Infrastructure as Code +## 🔧 Infrastructure as Code -This project is entirely defined as CloudFormation templates with comprehensive resource definitions. +This project is entirely defined using CloudFormation templates with comprehensive resource definitions for each component. ### Template Structure -| Template | Purpose | Key Components | -|----------|---------|----------------| -| **template.yml** | Core Infrastructure | VPCs, Subnets, Lambda, API Gateway, DynamoDB, DNS Firewall, Security Groups | -| **route53.yml** | DNS Configuration | Weighted A/AAAA records, Health check integration, Failover configuration | -| **app.yml** | Resilience Hub | Mission Critical policy definition, RTO/RPO targets, Multi-resource mapping | -| **disaster-recovery.yml** | DR Testing | FIS experiments, SSM automation documents, Recovery procedures | -| **waf.yml** | Security Rules | WAF WebACL with AWS managed rules for API protection | +| Template | Description | Key Resources | +|----------|-------------|---------------| +| **template.yml** | Core Infrastructure | VPCs, Subnets, Lambda Functions, API Gateway, DynamoDB, DNS Firewall, Security Groups, Network ACLs, Flow Logs, KMS Keys | +| **route53.yml** | DNS Configuration | Weighted A/AAAA Records, Health Check Integration, Failover Configuration, Domain Name Integration | +| **app.yml** | Resilience Hub | Mission Critical Policy Definition, RTO/RPO Targets, Multi-Resource Mapping, Assessment Schedule | +| **disaster-recovery.yml** | DR Testing | FIS Experiments, SSM Automation Documents, IAM Roles & Policies, Recovery Procedures, Health Checks | +| **waf.yml** | Security Rules | WAF WebACL, AWS Managed Rule Groups, API Gateway Association | -### Key Resource Highlights +### Notable Infrastructure Features -- **DNS Firewall**: Allows only AWS domains, blocks all others -- **Private DNS Configuration**: Secure VPC DNS settings -- **Network ACLs**: Custom ingress/egress rules -- **Health Checks**: Route 53 health checks for API endpoints -- **WAF Protection**: Six AWS managed rule groups -- **Global Tables**: Cross-region data replication -- **IAM Roles**: Least privilege principle implementation +- **DNS Firewall Integration**: Fully configured Route 53 DNS Firewall allowing only AWS domains +- **Private DNS Configuration**: Secure VPC DNS settings with customized resolution +- **Comprehensive Network Controls**: Custom ACLs and security groups with explicit deny rules +- **Health Check System**: Multiple Route 53 health checks for various service components +- **Advanced WAF Protection**: Six AWS managed rule groups including IP reputation and known attacks +- **Global DynamoDB Tables**: Cross-region replication with point-in-time recovery +- **Principle of Least Privilege**: Narrowly scoped IAM roles and permissions for all resources ## 📚 Documentation -### Runbooks +### Comprehensive Runbooks + +- **DynamoDB Recovery Runbook**: Automated Systems Manager procedures for: + - Point-in-Time Recovery + - Backup Restoration + - Table Recreation + - Cross-Region Synchronization + +- **Lambda Function Recovery Runbook**: Procedures covering: + - Version Management + - Provisioned Concurrency Adjustment + - Memory/Execution Time Optimization + - Error Handling and Retry Logic + +- **API Gateway Recovery Runbook**: Workflow documentation for: + - Endpoint Restoration + - Custom Domain Reconfiguration + - WAF Integration Recovery + - Route 53 Health Check Adjustments -- **DynamoDB Recovery**: Automated systems manager runbook -- **Lambda Function Recovery**: Automated systems manager runbook -- **API Gateway Recovery**: Step-by-step recovery procedures -- **IAM Automation**: Identity and access management workflows +- **IAM Automation Runbook**: Procedures for: + - Role and Policy Recovery + - Permission Boundary Enforcement + - Trust Relationship Verification + - Cross-Account Access Management -### References +### Recommended Reference Documentation - [AWS Resilience Hub Documentation](https://docs.aws.amazon.com/resilience-hub/latest/userguide/) -- [Disaster Recovery on AWS - Part I: Strategies for Recovery in the Cloud](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-i-strategies-for-recovery-in-the-cloud/) -- [Disaster Recovery on AWS - Part IV: Multi-site Active/Active](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/) -- [AWS Service Level Agreements](https://aws.amazon.com/legal/service-level-agreements/) +- [Disaster Recovery on AWS - Multi-site Active/Active](https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/) +- [AWS Well-Architected Framework - Reliability Pillar](https://docs.aws.amazon.com/wellarchitected/latest/reliability-pillar/welcome.html) +- [AWS Best Practices for DDoS Resiliency](https://d1.awsstatic.com/whitepapers/Security/DDoS_White_Paper.pdf) +- [Route 53 Application Recovery Controller](https://aws.amazon.com/route53/application-recovery-controller/) ## 📄 License From b8b84a6dc6c68d80ae66fdb9e8f551ab483a9a2d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?James=20Pether=20S=C3=B6rling?= Date: Thu, 17 Apr 2025 01:01:17 +0200 Subject: [PATCH 10/10] Update README.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: James Pether Sörling --- README.md | 180 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 106 insertions(+), 74 deletions(-) diff --git a/README.md b/README.md index 77581bb1..30bae72f 100644 --- a/README.md +++ b/README.md @@ -25,50 +25,43 @@ This project implements a highly resilient serverless architecture with AWS Lamb ```mermaid mindmap - root((("Lambda in
Private VPC"))) - Infrastructure["🏢 Infrastructure"]:::infra + root((Lambda in Private VPC)) + Infrastructure["🏢 Infrastructure"] ["Multi-Region VPCs"] ["Private Subnets"] ["VPC Endpoints"] ["DNS Firewall"] ["Flow Logs"] - Security["🔒 Security"]:::security + Security["🔒 Security"] ["Private DNS"] ["WAF Protection"] ["Network ACLs"] ["IAM Least Privilege"] ["KMS Encryption"] - Resilience["🛡️ Resilience"]:::resilience + Resilience["🛡️ Resilience"] ["Mission-Critical Policy"] ["RTO/RPO Enforcement"] ["Multi-Region Active/Active"] ["Automatic Failover"] ["Chaos Engineering Tests"] - Data["💾 Data Layer"]:::data + Data["💾 Data Layer"] ["DynamoDB Global Tables"] ["Cross-Region Replication"] ["Point-in-Time Recovery"] ["Backup/Restore Automation"] ["Dead Letter Queues"] - Compute["⚙️ Compute & API"]:::compute + Compute["⚙️ Compute & API"] ["Lambda Functions"] ["API Gateway"] ["Custom Domain"] ["Route 53 Failover"] ["Health Checks"] - CI_CD["🔄 CI/CD & Observability"]:::cicd + CI_CD["🔄 CI/CD & Observability"] ["Security Scanning"] ["Automated Deployment"] ["CloudWatch Monitoring"] ["X-Ray Tracing"] ["Alarm Notifications"] - -classDef infra fill:#388e3c,color:#ffffff,stroke:#1b5e20,stroke-width:2px -classDef security fill:#d32f2f,color:#ffffff,stroke:#b71c1c,stroke-width:2px -classDef resilience fill:#7b1fa2,color:#ffffff,stroke:#4a148c,stroke-width:2px -classDef data fill:#1976d2,color:#ffffff,stroke:#0d47a1,stroke-width:2px -classDef compute fill:#f57c00,color:#ffffff,stroke:#e65100,stroke-width:2px -classDef cicd fill:#5d4037,color:#ffffff,stroke:#3e2723,stroke-width:2px ``` ### Key Resilience Metrics @@ -87,7 +80,7 @@ A true active/active multi-region architecture with isolated private subnets, gl ```mermaid flowchart TB subgraph "Multi-Region Active/Active Architecture" - subgraph "Ireland (eu-west-1)":::ireland + subgraph "Ireland (eu-west-1)" IR_VPC["VPC 10.1.0.0/16"] IR_SUBNETS["Private Subnets (3 AZs)"] IR_LAMBDA["Lambda Functions"] @@ -106,7 +99,7 @@ flowchart TB IR_SUBNETS --> IR_EP end - subgraph "Frankfurt (eu-central-1)":::frankfurt + subgraph "Frankfurt (eu-central-1)" FR_VPC["VPC 10.5.0.0/16"] FR_SUBNETS["Private Subnets (3 AZs)"] FR_LAMBDA["Lambda Functions"] @@ -125,18 +118,18 @@ flowchart TB FR_SUBNETS --> FR_EP end - IR_DOMAIN -.-> R53["Route 53 Weighted/Failover"]:::routing + IR_DOMAIN -.-> R53["Route 53 Weighted/Failover"] FR_DOMAIN -.-> R53 IR_DYNAMO <--> FR_DYNAMO - WAF["WAF v2"]:::security --> IR_API + WAF["WAF v2"] --> IR_API WAF --> FR_API - HC["Health Checks"]:::monitoring --> IR_API + HC["Health Checks"] --> IR_API HC --> FR_API HC -.-> R53 - REH["AWS Resilience Hub
Mission Critical Policy"]:::resilience --> IR_LAMBDA + REH["AWS Resilience Hub
Mission Critical Policy"] --> IR_LAMBDA REH --> FR_LAMBDA REH --> IR_DYNAMO REH --> FR_DYNAMO @@ -148,6 +141,13 @@ flowchart TB classDef routing fill:#FF9800,stroke:#F57C00,stroke-width:3px,color:#ffffff classDef resilience fill:#9C27B0,stroke:#7B1FA2,stroke-width:3px,color:#ffffff classDef monitoring fill:#FFC107,stroke:#FFA000,stroke-width:3px,color:#000000 + + class IR_VPC,IR_SUBNETS,IR_LAMBDA,IR_DYNAMO,IR_API,IR_DOMAIN,IR_DNS,IR_EP ireland + class FR_VPC,FR_SUBNETS,FR_LAMBDA,FR_DYNAMO,FR_API,FR_DOMAIN,FR_DNS,FR_EP frankfurt + class WAF security + class R53 routing + class REH resilience + class HC monitoring ``` ### Key Architecture Components @@ -168,11 +168,11 @@ flowchart TB ```mermaid graph TD subgraph "Comprehensive Security Framework" - VPC["🏢 VPC Security"]:::vpc - NW["🔌 Network Controls"]:::network - IAM["🔑 Identity & Access"]:::iam - DATA["🔒 Data Protection"]:::data - APP["🛡️ Application Security"]:::app + VPC["🏢 VPC Security"] + NW["🔌 Network Controls"] + IAM["🔑 Identity & Access"] + DATA["🔒 Data Protection"] + APP["🛡️ Application Security"] VPC --> DNS_FW["DNS Firewall
Allow AWS domains only"] VPC --> FLOW["Flow Logs
Network traffic auditing"] @@ -202,6 +202,12 @@ graph TD classDef iam fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF classDef data fill:#7B1FA2,stroke:#4A148C,stroke-width:2px,color:#FFFFFF classDef app fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF + + class VPC,DNS_FW,FLOW,PDNS vpc + class NW,NACL,SG,DENY network + class IAM,ROLES,POLICY,TEMP iam + class DATA,KMS,ENC_SNS,ENC_LOG data + class APP,WAF_IP,WAF_ANON,WAF_CRS,WAF_BAD,WAF_OS app ``` ### Network Security Features @@ -224,13 +230,13 @@ The AWS Resilience Hub integration enforces strict recovery time objectives (RTO ```mermaid graph TD subgraph "Mission Critical Resilience Framework" - POLICY["Mission Critical Policy"]:::policy + POLICY["Mission Critical Policy"] subgraph "Failure Domains" - REGION["Regional Failure"]:::region - AZ["AZ Failure"]:::az - HW["Hardware Failure"]:::hardware - SW["Software Failure"]:::software + REGION["Regional Failure"] + AZ["AZ Failure"] + HW["Hardware Failure"] + SW["Software Failure"] end POLICY --> REGION @@ -238,29 +244,29 @@ graph TD POLICY --> HW POLICY --> SW - REGION --> REG_RTO["RTO: 3600s (1h)"]:::rto - REGION --> REG_RPO["RPO: 5s"]:::rpo + REGION --> REG_RTO["RTO: 3600s (1h)"] + REGION --> REG_RPO["RPO: 5s"] - AZ --> AZ_RTO["RTO: 1s"]:::rto - AZ --> AZ_RPO["RPO: 1s"]:::rpo + AZ --> AZ_RTO["RTO: 1s"] + AZ --> AZ_RPO["RPO: 1s"] - HW --> HW_RTO["RTO: 1s"]:::rto - HW --> HW_RPO["RPO: 1s"]:::rpo + HW --> HW_RTO["RTO: 1s"] + HW --> HW_RPO["RPO: 1s"] - SW --> SW_RTO["RTO: 5400s (90m)"]:::rto - SW --> SW_RPO["RPO: 300s (5m)"]:::rpo + SW --> SW_RTO["RTO: 5400s (90m)"] + SW --> SW_RPO["RPO: 300s (5m)"] end subgraph "Implementation Components" - REG_RTO --> MULTI_REG["Multi-region active/active"]:::impl - REG_RPO --> DDB_GLOB["DynamoDB global tables"]:::impl + REG_RTO --> MULTI_REG["Multi-region active/active"] + REG_RPO --> DDB_GLOB["DynamoDB global tables"] - AZ_RTO & AZ_RPO --> MULTI_AZ["Multi-AZ deployment"]:::impl + AZ_RTO & AZ_RPO --> MULTI_AZ["Multi-AZ deployment"] - HW_RTO & HW_RPO --> AWS_INFRA["AWS infrastructure redundancy"]:::impl + HW_RTO & HW_RPO --> AWS_INFRA["AWS infrastructure redundancy"] - SW_RTO --> AUTO_RECOVER["Automated recovery procedures"]:::impl - SW_RPO --> BACKUP_STRAT["Comprehensive backup strategy"]:::impl + SW_RTO --> AUTO_RECOVER["Automated recovery procedures"] + SW_RPO --> BACKUP_STRAT["Comprehensive backup strategy"] end classDef policy fill:#7B1FA2,stroke:#4A148C,stroke-width:3px,color:#FFFFFF @@ -271,6 +277,15 @@ graph TD classDef rto fill:#FFC107,stroke:#FFA000,stroke-width:2px,color:#000000 classDef rpo fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#FFFFFF classDef impl fill:#607D8B,stroke:#455A64,stroke-width:2px,color:#FFFFFF + + class POLICY policy + class REGION region + class AZ az + class HW hardware + class SW software + class REG_RTO,AZ_RTO,HW_RTO,SW_RTO rto + class REG_RPO,AZ_RPO,HW_RPO,SW_RPO rpo + class MULTI_REG,DDB_GLOB,MULTI_AZ,AWS_INFRA,AUTO_RECOVER,BACKUP_STRAT impl ``` ### Recovery Time & Point Objectives @@ -289,23 +304,23 @@ The architecture includes comprehensive disaster recovery testing using AWS Faul ```mermaid flowchart TD subgraph "Chaos Engineering Framework" - DR["Fault Injection Service
Experiments"]:::framework + DR["Fault Injection Service
Experiments"] subgraph "API Resilience Tests" - API_FAIL["Lambda Access
Denial"]:::experiment - API_FAIL --> SSM_IAM["IAM Policy
Injection"]:::automation - SSM_IAM --> DENY_LAMBDA["Deny Lambda
Access"]:::action + API_FAIL["Lambda Access
Denial"] + API_FAIL --> SSM_IAM["IAM Policy
Injection"] + SSM_IAM --> DENY_LAMBDA["Deny Lambda
Access"] end subgraph "Data Layer Tests" - DDB_DEL["DynamoDB
Table Deletion"]:::experiment - DDB_DEL --> SSM_DEL["Table Delete
Automation"]:::automation + DDB_DEL["DynamoDB
Table Deletion"] + DDB_DEL --> SSM_DEL["Table Delete
Automation"] - PITR["Point-In-Time
Recovery Test"]:::experiment - PITR --> SSM_PITR["PITR Restore
Automation"]:::automation + PITR["Point-In-Time
Recovery Test"] + PITR --> SSM_PITR["PITR Restore
Automation"] - BACKUP["Backup
Restoration Test"]:::experiment - BACKUP --> SSM_BACK["Backup Restore
Automation"]:::automation + BACKUP["Backup
Restoration Test"] + BACKUP --> SSM_BACK["Backup Restore
Automation"] end DR --> API_FAIL @@ -314,9 +329,9 @@ flowchart TD DR --> BACKUP subgraph "Recovery Monitoring" - MONITOR["Health Check
Monitoring"]:::monitoring - FAILOVER["Route 53
Failover"]:::recovery - RESTORE["Recovery
Procedures"]:::recovery + MONITOR["Health Check
Monitoring"] + FAILOVER["Route 53
Failover"] + RESTORE["Recovery
Procedures"] end SSM_IAM & SSM_DEL & SSM_PITR & SSM_BACK --> MONITOR @@ -330,6 +345,13 @@ flowchart TD classDef action fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF classDef monitoring fill:#FFC107,stroke:#FFA000,stroke-width:2px,color:#000000 classDef recovery fill:#D32F2F,stroke:#B71C1C,stroke-width:2px,color:#FFFFFF + + class DR framework + class API_FAIL,DDB_DEL,PITR,BACKUP experiment + class SSM_IAM,SSM_DEL,SSM_PITR,SSM_BACK automation + class DENY_LAMBDA action + class MONITOR monitoring + class FAILOVER,RESTORE recovery ``` ### Chaos Test Scenarios @@ -346,29 +368,29 @@ flowchart TD ```mermaid flowchart LR - GH_PUSH["GitHub Push/
Workflow Dispatch"]:::trigger --> SEC_SCAN{"Security
Scanning"}:::security + GH_PUSH["GitHub Push/
Workflow Dispatch"] --> SEC_SCAN{"Security
Scanning"} - SEC_SCAN --> CFN_LINT["cfn-lint"]:::scan - SEC_SCAN --> CFN_NAG["cfn-nag"]:::scan - SEC_SCAN --> CHECKOV["Checkov"]:::scan - SEC_SCAN --> SCORECARD["Scorecard"]:::scan - SEC_SCAN --> ZAP["ZAP API
Scan"]:::scan + SEC_SCAN --> CFN_LINT["cfn-lint"] + SEC_SCAN --> CFN_NAG["cfn-nag"] + SEC_SCAN --> CHECKOV["Checkov"] + SEC_SCAN --> SCORECARD["Scorecard"] + SEC_SCAN --> ZAP["ZAP API
Scan"] - CFN_LINT & CFN_NAG & CHECKOV & SCORECARD & ZAP --> CONFIG_IR["Configure AWS
(eu-west-1)"]:::deploy + CFN_LINT & CFN_NAG & CHECKOV & SCORECARD & ZAP --> CONFIG_IR["Configure AWS
(eu-west-1)"] - CONFIG_IR --> DEPLOY_IR["Deploy Core
Ireland"]:::deploy - DEPLOY_IR --> OUTPUTS["Collect
Outputs"]:::deploy - OUTPUTS --> CONFIG_FR["Configure AWS
(eu-central-1)"]:::deploy - CONFIG_FR --> DEPLOY_FR["Deploy Core
Frankfurt"]:::deploy + CONFIG_IR --> DEPLOY_IR["Deploy Core
Ireland"] + DEPLOY_IR --> OUTPUTS["Collect
Outputs"] + OUTPUTS --> CONFIG_FR["Configure AWS
(eu-central-1)"] + CONFIG_FR --> DEPLOY_FR["Deploy Core
Frankfurt"] - DEPLOY_FR --> DEPLOY_AUX["Deploy
Auxiliary Stacks"]:::aux + DEPLOY_FR --> DEPLOY_AUX["Deploy
Auxiliary Stacks"] - DEPLOY_AUX --> DEPLOY_R53["Route 53
Configuration"]:::aux - DEPLOY_AUX --> DEPLOY_WAF["WAF
Configuration"]:::aux - DEPLOY_AUX --> DEPLOY_RHB["Resilience Hub
App"]:::aux - DEPLOY_AUX --> DEPLOY_DR["Disaster
Recovery Tests"]:::aux + DEPLOY_AUX --> DEPLOY_R53["Route 53
Configuration"] + DEPLOY_AUX --> DEPLOY_WAF["WAF
Configuration"] + DEPLOY_AUX --> DEPLOY_RHB["Resilience Hub
App"] + DEPLOY_AUX --> DEPLOY_DR["Disaster
Recovery Tests"] - DEPLOY_R53 & DEPLOY_WAF & DEPLOY_RHB & DEPLOY_DR --> TAG["Tag &
Release"]:::release + DEPLOY_R53 & DEPLOY_WAF & DEPLOY_RHB & DEPLOY_DR --> TAG["Tag &
Release"] classDef trigger fill:#D32F2F,stroke:#B71C1C,stroke-width:3px,color:#FFFFFF classDef security fill:#7B1FA2,stroke:#4A148C,stroke-width:2px,color:#FFFFFF @@ -376,6 +398,13 @@ flowchart LR classDef deploy fill:#1565C0,stroke:#0D47A1,stroke-width:2px,color:#FFFFFF classDef aux fill:#F57C00,stroke:#E65100,stroke-width:2px,color:#FFFFFF classDef release fill:#9C27B0,stroke:#7B1FA2,stroke-width:2px,color:#FFFFFF + + class GH_PUSH trigger + class SEC_SCAN security + class CFN_LINT,CFN_NAG,CHECKOV,SCORECARD,ZAP scan + class CONFIG_IR,DEPLOY_IR,OUTPUTS,CONFIG_FR,DEPLOY_FR deploy + class DEPLOY_AUX,DEPLOY_R53,DEPLOY_WAF,DEPLOY_RHB,DEPLOY_DR aux + class TAG release ``` ### CI/CD Pipeline Features @@ -450,3 +479,6 @@ This project is entirely defined using CloudFormation templates with comprehensi ## 📄 License This project is licensed under the Apache License 2.0 - see [LICENSE.md](LICENSE.md) for details. + +--- +*Last updated: 2025-04-16*