AWS DevOps Agent Demo - AI-Powered Incident Response

Demonstration of AWS DevOps Agent's AI-powered incident investigation capabilities in a containerized ECS environment. Transform incident response from 45 minutes of manual investigation to 2 minutes of automated AI analysis. Complete self-healing infrastructure with automated remediation, comprehensive monitoring, and intelligent root cause analysis.

🎯 What This Project Demonstrates

This is a complete, enterprise-grade DevOps automation platform that shows how AWS DevOps Agent can revolutionize incident response:

🤖 AI-Powered Investigation - Automatic incident analysis with root cause identification
🔄 Self-Healing Infrastructure - Automated remediation via Lambda playbooks
📊 Complete Observability - CloudWatch logs, metrics, alarms, and dashboards
🚀 Zero-Downtime Deployments - ECS Fargate with rolling updates
🔗 Code Correlation - Links incidents to GitHub commits automatically
⚡ 82% Faster Resolution - Reduces MTTR from 45 minutes to 2 minutes

🏗️ Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                         GitHub Repository                            │
│            (Source Code + Terraform IaC + Workflows)                 │
└────┬──────────────────┬────────────────────┬────────────────────────┘
     │                  │                    │
     │ terraform/**     │ app/**             │ (Code Correlation)
     ▼                  ▼                    ▼
┌─────────────────────────────┐  ┌─────────────────────────────────────┐
│  Terraform Workflow (IaC)    │  │      Deploy Workflow (Application)   │
│  ┌───────────────────────┐  │  │  ┌──────────────────────────────┐   │
│  │ 1. Format & Validate   │  │  │  │  1. Checkout Code            │   │
│  │ 2. terraform init      │  │  │  │  2. Build Docker Image       │   │
│  │ 3. terraform plan      │  │  │  │  3. Push to ECR              │   │
│  │ 4. PR: Comment plan    │  │  │  │  4. Update ECS Task Def      │   │
│  │ 5. terraform apply     │  │  │  │  5. Deploy to ECS (Rolling)  │   │
│  │    (manual trigger)    │  │  │  │  6. Store Metadata in SSM    │   │
│  └──────────┬────────────┘  │  │  └────────────────┬─────────────┘   │
│             │               │  │                   │                 │
│             │ Provisions    │  │                   │                 │
│             │ VPC, ECR, ECS,│  │                   │                 │
│             │ ALB, CW, etc.│  │                   │                 │
└─────────────┼──────────────┘  └───────────────────┼───────────────────┘
              │                                     │
              │                                     │ (Deployment Metadata)
              │                                     ▼
              │              ┌─────────────────────────────────────┐
              │              │   SSM Parameter Store               │
              │              │   - Deployment Timestamps            │
              │              │   - Commit SHA & Messages            │
              │              │   - Image Tags                       │
              │              └─────────────────────────────────────┘
              │
              │ (Provisions: VPC, ECR, ECS, ALB, CW, Lambda, SNS, S3)
              │
              └──────────────────────────┬──────────────────────────────┘
                                         │
                                         │ (Push Image from Deploy workflow)
                                         ▼
┌─────────────────────────────────────────────────────────────────────┐
│  AWS Resources (Provisioned by Terraform)                            │
├─────────────────────────────────────────────────────────────────────┤
│                       Amazon ECR Registry                            │
│                    (Container Image Storage)                         │
│                    - Image Versioning                                │
│                    - Lifecycle Policies                              │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             │ (Pull Image)
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         AWS VPC (10.0.0.0/16)                        │
│                                                                      │
│  ┌────────────────────────────────────────────────────────────┐    │
│  │              PUBLIC SUBNETS (Multi-AZ)                      │    │
│  │         10.0.0.0/24 (AZ-a) | 10.0.1.0/24 (AZ-b)           │    │
│  │                                                             │    │
│  │  ┌──────────────────────────────────────────────────────┐ │    │
│  │  │      Internet Gateway                                 │ │    │
│  │  └────────────────────┬─────────────────────────────────┘ │    │
│  │                       │                                     │    │
│  │  ┌────────────────────▼─────────────────────────────────┐ │    │
│  │  │   Application Load Balancer (ALB)                    │ │    │
│  │  │   - Health Checks (/health endpoint)                 │ │    │
│  │  │   - Traffic Distribution (Round Robin)               │ │    │
│  │  │   - SSL Termination                                  │ │    │
│  │  └────────────────────┬─────────────────────────────────┘ │    │
│  │                       │                                     │    │
│  │  ┌────────────────────▼─────────────────────────────────┐ │    │
│  │  │   NAT Gateway (AZ-a & AZ-b)                         │ │    │
│  │  │   - Enables private subnet internet access          │ │    │
│  │  │   - For ECR image pulls & CloudWatch                │ │    │
│  │  └──────────────────────────────────────────────────────┘ │    │
│  └────────────────────────┼─────────────────────────────────────┘
│                           │
│  ┌────────────────────────▼─────────────────────────────────────┐
│  │              PRIVATE SUBNETS (Multi-AZ)                      │
│  │        10.0.10.0/24 (AZ-a) | 10.0.11.0/24 (AZ-b)           │
│  │                                                              │
│  │  ┌──────────────────────────────────────────────────────┐  │
│  │  │         ECS Fargate Cluster                          │  │
│  │  │                                                       │  │
│  │  │  ┌──────────────┐      ┌──────────────┐            │  │
│  │  │  │   Task 1     │      │   Task 2     │            │  │
│  │  │  │  (Container) │      │  (Container) │            │  │
│  │  │  │  - Node.js   │      │  - Node.js   │            │  │
│  │  │  │  - Port 3000 │      │  - Port 3000 │            │  │
│  │  │  │  - Health    │      │  - Health    │            │  │
│  │  │  │    Checks    │      │    Checks    │            │  │
│  │  │  └──────┬───────┘      └──────┬───────┘            │  │
│  │  │         │                     │                      │  │
│  │  │         └─────────┬───────────┘                      │  │
│  │  │                   │ Logs & Metrics                   │  │
│  │  └───────────────────┼──────────────────────────────────┘  │
│  └────────────────────────┼─────────────────────────────────────┘
│                           │
└───────────────────────────┼──────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    CloudWatch Logs & Metrics                         │
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐             │
│  │  Log Groups  │  │   Metrics    │  │   Alarms     │             │
│  │  - Errors    │  │  - CPU       │  │  - CPU High  │             │
│  │  - Access    │  │  - Memory    │  │  - Memory    │             │
│  │  - Health    │  │  - 5XX       │  │  - 5XX       │             │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘             │
│         │                  │                  │                      │
│         └──────────┬───────┴──────────────────┘                      │
│                    │                                                  │
│                    ▼                                                  │
│         ┌──────────────────┐    ┌──────────────────┐  ┌────────┐     │
│         │  Lambda Playbook │    │  SNS Topic       │  │ S3     │     │
│         │  - Auto-Restart  │◄───│  - Email Alerts  │  │ Logs   │     │
│         │  - Auto-Scale    │    │  - Notifications │  │ Export │     │
│         │  - Force Deploy  │    └──────────────────┘  └────────┘     │
│         └──────────┬───────┘                                          │
│                    │                                                  │
│                    │ (Remediation Actions)                            │
│                    ▼                                                  │
│         ┌──────────────────────────────────────────┐                  │
│         │      AWS DevOps Agent (AI)               │                  │
│         │  ┌────────────────────────────────────┐ │                  │
│         │  │ - Log Analysis (CloudWatch)        │ │                  │
│         │  │ - Pattern Detection                │ │                  │
│         │  │ - Root Cause Analysis              │ │                  │
│         │  │ - ECS Task Status Monitoring       │ │                  │
│         │  │ - Code Correlation (GitHub + SSM)  │ │                  │
│         │  │ - Deployment History Analysis      │ │                  │
│         │  └────────────────────────────────────┘ │                  │
│         └──────────┬───────────────────────────────┘                  │
│                    │                                                  │
│                    │ (Reads Deployment Metadata)                      │
│                    ▼                                                  │
│         ┌─────────────────────────────────────┐                      │
│         │   SSM Parameter Store               │                      │
│         │   (Deployment Correlation)          │                      │
│         └─────────────────────────────────────┘                      │
└─────────────────────────────────────────────────────────────────────┘

🔄 Deployment Flow

Terraform Infrastructure (IaC):

Terraform Changes → Push to terraform/** triggers workflow
Plan → terraform plan runs on every push/PR
PR Review → Plan output posted as PR comment for review
Apply → terraform apply runs on manual workflow trigger
Provision → Creates/updates VPC, ECR, ECS, ALB, CloudWatch, Lambda, SNS, S3, IAM

Application CI/CD Pipeline:

Code Push → GitHub repository receives changes to app/**
GitHub Actions → Deploy workflow triggers on push to main
Build & Push → Docker image built and pushed to ECR with commit SHA tag
ECS Update → Task definition updated with new image
Rolling Deployment → ECS performs zero-downtime rolling update
Metadata Storage → Deployment info stored in SSM for correlation
DevOps Agent → Can correlate incidents with specific deployments

Incident Response Flow:

Alarm Triggers → CloudWatch alarm detects anomaly (CPU/Memory/5XX)
SNS Notification → Alarm sends notification to SNS topic
Lambda Playbook → Automated remediation attempts (restart/scale/deploy)
DevOps Agent → AI investigates by analyzing:
- CloudWatch logs and metrics
- ECS task status and recent changes
- SSM deployment history
- GitHub commit correlation
Root Cause Report → Agent provides analysis with code links and recommendations

✨ Key Features

🔍 Automated Incident Investigation

AI-Powered Analysis - DevOps Agent automatically investigates when alarms trigger
Log Pattern Detection - Identifies error patterns and anomalies in CloudWatch Logs
Code Correlation - Links incidents to specific GitHub commits and deployments
Root Cause Analysis - Provides likely causes with confidence scores
Actionable Recommendations - Suggests remediation steps and rollback commands

🛠️ Self-Healing Infrastructure

Lambda Playbooks - Automated remediation for common issues
Auto-Restart - Restarts ECS services on 5XX error spikes
Auto-Scale - Scales up on high CPU utilization
Health Recovery - Forces new deployments on unhealthy targets
Email Notifications - Reports all automated actions

📊 Complete Observability

5 CloudWatch Alarms - CPU, memory, 5XX errors, unhealthy targets, error count
Custom Dashboard - Real-time visualization of all metrics
Structured Logging - JSON logs with full context
Prometheus Metrics - Application-level metrics collection
S3 Log Export - Long-term log storage and analysis

🚀 Production-Ready Infrastructure

Multi-AZ Deployment - High availability across availability zones
Private Subnets - ECS tasks run in isolated private subnets
Security Groups - Least privilege network access
Auto-Scaling - Fargate with FARGATE_SPOT support
Zero-Downtime - Rolling deployments with health checks

📋 Prerequisites

Before you begin, ensure you have:

AWS Account with admin access
AWS CLI v2.x configured (aws configure)
Terraform v1.0+ installed
Docker v20.x+ installed and running
Git installed
Node.js 18+ (for local development)
~$60-120/month budget (or plan to destroy after testing)

🚀 Quick Start (20 Minutes)

1️⃣ Clone and Configure

git clone https://github.com/VanshShah174/AWS-Devops-Agent.git
cd AWS-Devops-Agent

# Copy example configuration
cp terraform/terraform.tfvars.example terraform/terraform.tfvars

# (Optional) Edit configuration
nano terraform/terraform.tfvars

2️⃣ Deploy Infrastructure

cd terraform
terraform init
terraform apply  # Type 'yes' when prompted

⏱️ This takes 10-15 minutes (NAT Gateway creation is the slowest part)

3️⃣ Build and Deploy Application

cd ..

# Using Makefile (recommended)
make build
make push

# Or manually
cd app
ECR_REPO=$(cd ../terraform && terraform output -raw ecr_repository_url)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $ECR_REPO
docker build -t devops-agent-demo:latest .
docker tag devops-agent-demo:latest $ECR_REPO:latest
docker push $ECR_REPO:latest

4️⃣ Wait for ECS Tasks (3-5 minutes)

# Check service status
make status

# Or manually
aws ecs describe-services \
  --cluster devops-agent-demo-dev-cluster \
  --services devops-agent-demo-dev-service \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}'

5️⃣ Verify Application

# Get application URL
make url

# Test health endpoint
curl $(make url)/health

Expected response:

{"status":"healthy","uptime":123.456,"memory":{...}}

6️⃣ Setup DevOps Agent

# PowerShell (Windows)
.\scripts\setup-devops-agent.ps1

# Bash (Linux/Mac)
chmod +x scripts/setup-agent-space.sh
./scripts/setup-agent-space.sh

7️⃣ Test Incident Response

# Trigger an error spike
make test-error-spike

# Or manually
.\scripts\trigger-incidents.ps1 -Scenario error-spike

What happens:

Script sends 20 error requests
CloudWatch alarm triggers (2-3 minutes)
Lambda playbook restarts service automatically
DevOps Agent investigates and analyzes
You receive email notifications

8️⃣ View Results

CloudWatch Dashboard:

https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=devops-agent-demo-dev

DevOps Agent Console:

https://console.aws.amazon.com/devopsagent/

🧪 Testing Scenarios

The project includes 7 realistic incident scenarios:

Error Spike

make test-error-spike

Triggers 20x 500 errors → High 5XX alarm → Auto-restart service

Memory Leak

make test-memory-leak

Allocates 100MB arrays → Memory alarm → Investigation

CPU Spike

make test-cpu-spike

CPU-intensive operations → CPU alarm → Auto-scale up

Health Check Failure

make test-health-failure

Disables health endpoint → Unhealthy targets alarm → Force new deployment

All Scenarios

make test-all

Runs all test scenarios sequentially

📊 What Gets Deployed

AWS Resources (40+ resources)

Networking:

VPC with public/private subnets (2 AZs)
Internet Gateway
NAT Gateway
Route tables and associations
Security groups

Compute:

ECS Fargate cluster
ECS service with auto-scaling
Task definition with health checks
Application Load Balancer
Target group

Storage:

ECR repository with lifecycle policies
S3 bucket for log exports

Monitoring:

5 CloudWatch alarms
CloudWatch dashboard
Log groups with 7-day retention
Log metric filters
SNS topic for notifications

Automation:

Lambda playbook function
DevOps Agent IAM roles
SSM parameters for configuration

Estimated Monthly Cost:

Standard (2 AZ, 2 tasks): ~$120/month
Optimized (1 AZ, 1 task): ~$56/month

📁 Project Structure

aws-devops-agent-demo/
├── app/                          # Node.js application
│   ├── src/
│   │   └── index.js             # Express API with 8 error endpoints
│   ├── Dockerfile               # Multi-stage Docker build
│   └── package.json             # Dependencies
├── terraform/                    # Infrastructure as Code
│   ├── main.tf                  # Provider configuration
│   ├── vpc.tf                   # Network resources
│   ├── ecs.tf                   # ECS cluster & service
│   ├── alb.tf                   # Load balancer
│   ├── cloudwatch.tf            # Monitoring & alarms
│   ├── devops-agent.tf          # DevOps Agent setup
│   ├── playbook-lambda.tf       # Automated remediation
│   └── s3-logs.tf               # Log storage
├── .github/workflows/            # CI/CD pipelines
│   ├── deploy.yml               # Application deployment
│   └── terraform.yml            # Infrastructure deployment
├── scripts/                      # Automation scripts
│   ├── setup-devops-agent.ps1   # Agent configuration
│   ├── trigger-incidents.ps1    # Test scenarios
│   ├── check-metrics.ps1        # Metrics verification
│   └── verify-agent-monitoring.ps1
├── docs/                         # Comprehensive documentation
│   ├── ARCHITECTURE.md          # System architecture
│   ├── SETUP.md                 # Detailed setup guide
│   ├── TESTING.md               # Testing guide
│   └── FAQ.md                   # Troubleshooting
├── QUICKSTART.md                 # 5-step quick start
├── DEPLOYMENT_FLOW.md            # Visual deployment guide
├── COMPLETE_SYSTEM_FLOW.md       # End-to-end flow
└── README.md                     # This file

🎓 Learning Outcomes

By deploying this project, you'll learn:

AWS Services

✅ Amazon ECS & Fargate (container orchestration)
✅ Application Load Balancer (traffic distribution)
✅ Amazon ECR (container registry)
✅ Amazon CloudWatch (monitoring & alarms)
✅ AWS Lambda (serverless automation)
✅ AWS DevOps Agent (AI-powered incident response)
✅ Amazon VPC (networking & security)
✅ IAM (roles & permissions)

DevOps Practices

✅ Infrastructure as Code (Terraform)
✅ Containerization (Docker)
✅ CI/CD Pipelines (GitHub Actions)
✅ Monitoring & Observability
✅ Automated Incident Response
✅ Self-Healing Systems

Best Practices

✅ Multi-AZ high availability
✅ Security groups & least privilege
✅ Health checks & auto-recovery
✅ Structured logging
✅ Metrics collection
✅ Automated testing

🔧 Available Commands

# Deployment
make init          # Initialize Terraform
make apply         # Deploy infrastructure
make build         # Build Docker image
make push          # Push to ECR
make deploy        # Full deployment

# Testing
make test-error-spike      # Test error spike
make test-memory-leak      # Test memory leak
make test-cpu-spike        # Test CPU spike
make test-health-failure   # Test health failure
make test-all              # Run all tests

# Monitoring
make logs          # Tail CloudWatch logs
make alarms        # Show alarm status
make status        # Application status
make url           # Show application URL

# Maintenance
make cleanup       # Restore healthy state
make destroy       # Delete all resources
make help          # Show all commands

🛡️ Security Best Practices

This project implements enterprise security standards:

✅ Private Subnets - ECS tasks run in isolated private subnets
✅ Security Groups - Least privilege network access
✅ IAM Roles - No hardcoded credentials
✅ Non-Root Container - Docker runs as non-root user
✅ ECR Scanning - Automatic image vulnerability scanning
✅ Encryption - AES256 encryption for ECR and S3
✅ VPC Endpoints - Secure AWS service access (optional)

📈 Real-World Benefits

Before DevOps Agent (Manual Investigation)

1. Alarm triggers at 3 AM                    → 5 min
2. Engineer wakes up and logs in             → 5 min
3. Searches CloudWatch logs                  → 10 min
4. Checks ECS task status                    → 5 min
5. Reviews recent deployments                → 10 min
6. Analyzes metrics manually                 → 10 min
7. Determines root cause                     → 15 min
8. Takes corrective action                   → 10 min
────────────────────────────────────────────────────
Total Time: 70 minutes
Engineer: Tired and frustrated 😫

After DevOps Agent (Automated)

1. Alarm triggers at 3 AM                    → 0 min
2. Lambda playbook restarts service          → 1 min
3. DevOps Agent investigates automatically   → 2 min
4. Engineer reviews complete report          → 5 min
5. Takes action based on recommendations     → 5 min
────────────────────────────────────────────────────
Total Time: 13 minutes
Engineer: Well-rested, confident 😊
Time Saved: 57 minutes (81% reduction)

🧹 Cleanup

⚠️ IMPORTANT: Always destroy resources when done to avoid charges!

# Restore application to healthy state
make cleanup

# Destroy all infrastructure
cd terraform
terraform destroy  # Type 'yes' to confirm

Estimated cost if left running:

Hourly: ~$0.08
Daily: ~$1.90
Monthly: ~$56-120

📚 Documentation

QUICKSTART.md - Fast 5-step deployment
STEP_BY_STEP_GUIDE.md - Detailed beginner guide
DEPLOYMENT_FLOW.md - Visual deployment flow
COMPLETE_SYSTEM_FLOW.md - End-to-end system flow
docs/ARCHITECTURE.md - Architecture deep dive
docs/SETUP.md - Detailed setup with troubleshooting
docs/TESTING.md - Complete testing guide
docs/FAQ.md - Common issues and solutions

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

AWS DevOps Agent team for the amazing AI-powered incident response service
AWS for providing comprehensive cloud services
The open-source community for tools and inspiration

📞 Support

Issues: GitHub Issues
Documentation: Check the docs/ directory
FAQ: See docs/FAQ.md

🎯 Use Cases

This project is perfect for:

Learning - Understand AWS DevOps Agent and ECS deployment
Proof of Concept - Demonstrate DevOps Agent value to stakeholders
Template - Use as a starting point for production applications
Training - Practice incident response and monitoring
Portfolio - Showcase DevOps and cloud engineering skills

🌟 Star History

If you find this project helpful, please consider giving it a star! ⭐

📊 Project Statistics

Total Files: 50+
Lines of Code: ~5,000+
AWS Resources: 40+
Documentation: 15+ guides
Test Scenarios: 7 realistic incidents
Setup Time: 20 minutes
Time Savings: 82% reduction in MTTR

Built with ❤️ by Vansh Shah

Ready to revolutionize your incident response? Get Started →

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
app		app
docs		docs
scripts		scripts
terraform-bootstrap		terraform-bootstrap
terraform		terraform
.gitignore		.gitignore
COMPLETE_SYSTEM_FLOW.md		COMPLETE_SYSTEM_FLOW.md
DEPLOYMENT_CHECKLIST.md		DEPLOYMENT_CHECKLIST.md
DEPLOYMENT_FLOW.md		DEPLOYMENT_FLOW.md
DEVOPS_AGENT_PURPOSE.md		DEVOPS_AGENT_PURPOSE.md
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
Makefile		Makefile
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
REAL_WORLD_USAGE.md		REAL_WORLD_USAGE.md
STEP_BY_STEP_GUIDE.md		STEP_BY_STEP_GUIDE.md
aws_devops_agent.png		aws_devops_agent.png

Folders and files

Latest commit

History

Repository files navigation

AWS DevOps Agent Demo - AI-Powered Incident Response

🎯 What This Project Demonstrates

🏗️ Architecture Overview

🔄 Deployment Flow

✨ Key Features

🔍 Automated Incident Investigation

🛠️ Self-Healing Infrastructure

📊 Complete Observability

🚀 Production-Ready Infrastructure

📋 Prerequisites

🚀 Quick Start (20 Minutes)

1️⃣ Clone and Configure

2️⃣ Deploy Infrastructure

3️⃣ Build and Deploy Application

4️⃣ Wait for ECS Tasks (3-5 minutes)

5️⃣ Verify Application

6️⃣ Setup DevOps Agent

7️⃣ Test Incident Response

8️⃣ View Results

🧪 Testing Scenarios

Error Spike

Memory Leak

CPU Spike

Health Check Failure

All Scenarios

📊 What Gets Deployed

AWS Resources (40+ resources)

📁 Project Structure

🎓 Learning Outcomes

AWS Services

DevOps Practices

Best Practices

🔧 Available Commands

🛡️ Security Best Practices

📈 Real-World Benefits

Before DevOps Agent (Manual Investigation)

After DevOps Agent (Automated)

🧹 Cleanup

📚 Documentation

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

🎯 Use Cases

🌟 Star History

📊 Project Statistics

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages