Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ This repository contains deployable code samples demonstrating how generative AI
| AI Lambda Runtime Migration Assistant | Operations Automation | Discover, assess, and transform Lambda functions running deprecated runtimes using Amazon Bedrock AgentCore and Nova 2 Lite with a React dashboard | [operations-automation/ai-lambda-runtime-migration/](operations-automation/ai-lambda-runtime-migration/README.md) |
| Natural Language Chaos Engineering with AWS FIS | Resilience | Transform natural language descriptions into validated AWS FIS experiment templates with current capabilities and intelligent caching | [resilience/ai-chaos-engineering-with-fis/](resilience/ai-chaos-engineering-with-fis/README.md) |
| Intelligent EKS Incident Investigation with Amazon DevOps Agent | Observability | Automatically detect, investigate, and diagnose EKS infrastructure incidents using Amazon DevOps Agent — reducing mean time to resolution from hours to minutes | [observability/eks-investigation-devops-agent/](observability/eks-investigation-devops-agent/README.md) |
| Intelligent AWS Site-to-Site VPN Tunnel Investigation with Amazon DevOps Agent | Observability | Automatically detect, investigate, and diagnose Site-to-Site VPN tunnel failures with BGP routing using Amazon DevOps Agent — reducing mean time to resolution from hours to minutes | [observability/aws-site-to-site-vpn-tunnel-investigation-devops-agent/](observability/aws-site-to-site-vpn-tunnel-investigation-devops-agent/README.md) |

## Roadmap (Coming Soon)

Expand All @@ -36,7 +37,8 @@ operations-automation/
├── anycompany-it-demo-portal/
└── aws-services-lifecycle-tracker/
observability/
└── eks-investigation-devops-agent/
├── eks-investigation-devops-agent/
└── aws-site-to-site-vpn-tunnel-investigation-devops-agent/
resilience/
└── ai-chaos-engineering-with-fis/
shared/
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Internal project tracking (contain account IDs and secrets)
PROJECT-STATUS.md
TESTING-PROGRESS.md

# Python
__pycache__/
*.pyc

# CDK
cdk.out*
.venv/

# Keys
*.pem
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Architecture

![Architecture](architecture.drawio.png)

## Components

### Network Layer

| Component | Details |
|-----------|---------|
| **Cloud VPC** (10.0.0.0/16) | Simulates the AWS-side of a hybrid network. Contains a t3.micro instance as a ping target for throughput monitoring. |
| **On-Prem VPC** (172.16.0.0/16) | Simulates a customer data center. Contains the Customer Gateway instance running Libreswan (IPsec) and GoBGP (BGP) on Amazon Linux 2023. |
| **Site-to-Site VPN** | Two IPsec tunnels (169.254.10.0/30 and 169.254.10.4/30) connecting the VPCs via a Virtual Private Gateway. Supports both BGP (default) and static routing. |
| **Customer Gateway Instance** | t3.micro running AL2023 with Libreswan for IPsec and GoBGP for BGP. `SourceDestCheck` disabled. Inject/rollback scripts installed at `/opt/vpn-demo/`. |

### Monitoring Layer

| Component | Details |
|-----------|---------|
| **Per-tunnel alarms** (×2) | CloudWatch alarms on `TunnelState` metric using `TunnelIpAddress` dimension — a single tunnel failure triggers investigation even if the other tunnel stays up. |
| **Throughput alarm** | Metric math: `(TunnelDataIn + TunnelDataOut) × 8 / 300 < 100 bps`. Detects performance degradation while tunnels remain technically UP. Deployed with actions disabled. |
| **Route-withdrawn alarm** | CloudWatch Logs metric filter on `"WITHDRAWN"` in VPN tunnel logs → custom metric → alarm. Detects BGP route changes. Deployed with actions disabled. |
| **SNS Topic** | `vpn-demo-tunnel-alarm` — all 4 alarms publish here. |
| **Webhook Lambda** | Receives SNS notifications, constructs an HMAC-signed payload, and sends it to the DevOps Agent webhook endpoint. Only created when webhook URL is provided. |
| **VPN Tunnel Logs** | CloudWatch Logs group `/vpn-demo/tunnel-logs` — receives IKE and BGP logs from both tunnels in JSON format. |

### Intelligence Layer

| Component | Details |
|-----------|---------|
| **Amazon DevOps Agent** | Receives webhook alerts, reads VPN tunnel logs and CloudWatch metrics, queries the MCP server for business context, and produces root-cause analysis. |
| **MCP Server** | Lambda behind API Gateway (REST + API key). Implements JSON-RPC 2.0 with 3 tools: `get_service_dependencies`, `get_cost_impact`, `get_compliance_status`. |
| **Operator App** | Web UI for on-demand chat with the agent — used for scenarios like traffic-selector and bgp-route-withdraw where time-scoped prompts produce better results. |

## Data Flow

```
1. Failure injected on CGW (e.g. wrong PSK, blocked ports, BGP shutdown)
2. VPN tunnel state changes → CloudWatch metrics + tunnel logs
3. CloudWatch alarm fires (TunnelState < 1, throughput < 100 bps, or WITHDRAWN log)
4. SNS topic receives alarm notification
5. Lambda constructs HMAC-signed webhook payload → DevOps Agent
6. Agent investigates:
├── Reads VPN tunnel logs (IKE negotiation errors, BGP state changes)
├── Checks CloudWatch metrics (TunnelState, TunnelDataIn/Out)
└── Queries MCP server (service dependencies, cost impact, compliance)
7. Agent produces root-cause analysis with business context
```

## Security Considerations

- **No hardcoded credentials** — VPN PSKs are generated by AWS and configured via SSH at deploy time
- **API key authentication** — MCP server requires `x-api-key` header on all POST requests
- **HMAC-signed webhooks** — Lambda signs payloads with SHA-256 HMAC using the webhook secret
- **Minimal IAM** — webhook Lambda has only `AWSLambdaBasicExecutionRole`; MCP Lambda has the same
- **SSH key required** — CGW access requires the EC2 key pair specified at deployment
- **Security groups** — Cloud VPC allows only ICMP from on-prem + SSH; On-Prem VPC allows IKE (UDP 500/4500) + SSH + ICMP

## CDK Stacks

| Stack | Resources | Tracking |
|-------|-----------|----------|
| `VpnDemoStack-{region}` | 2 VPCs, 2 EC2 instances, VPN connection, SNS topic, webhook Lambda, CloudWatch log group | `(uksb-do9bhieqqh)(tag:vpn-investigation,observability)` |
| `VpnDemoMcpServer-{region}` | Lambda function, API Gateway REST API, API key, usage plan | None (secondary stack) |

Resources created outside CDK (by deploy-all.sh post-deployment):
- 4 CloudWatch alarms (require VPN tunnel IPs from the deployed VPN connection)
- CloudWatch Logs metric filter (requires the log group to exist)
- Libreswan + GoBGP configuration on CGW (requires SSH access to the running instance)
Loading