Demonstration of AWS DevOps Agent's AI-powered incident investigation capabilities in a containerized ECS environment. Transform incident response from 45 minutes of manual investigation to 2 minutes of automated AI analysis. Complete self-healing infrastructure with automated remediation, comprehensive monitoring, and intelligent root cause analysis.
This is a complete, enterprise-grade DevOps automation platform that shows how AWS DevOps Agent can revolutionize incident response:
- π€ AI-Powered Investigation - Automatic incident analysis with root cause identification
- π Self-Healing Infrastructure - Automated remediation via Lambda playbooks
- π Complete Observability - CloudWatch logs, metrics, alarms, and dashboards
- π Zero-Downtime Deployments - ECS Fargate with rolling updates
- π Code Correlation - Links incidents to GitHub commits automatically
- β‘ 82% Faster Resolution - Reduces MTTR from 45 minutes to 2 minutes
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GitHub Repository β
β (Source Code + Terraform IaC + Workflows) β
ββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββ¬βββββββββββββββββββββββββ
β β β
β terraform/** β app/** β (Code Correlation)
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββββββ
β Terraform Workflow (IaC) β β Deploy Workflow (Application) β
β βββββββββββββββββββββββββ β β ββββββββββββββββββββββββββββββββ β
β β 1. Format & Validate β β β β 1. Checkout Code β β
β β 2. terraform init β β β β 2. Build Docker Image β β
β β 3. terraform plan β β β β 3. Push to ECR β β
β β 4. PR: Comment plan β β β β 4. Update ECS Task Def β β
β β 5. terraform apply β β β β 5. Deploy to ECS (Rolling) β β
β β (manual trigger) β β β β 6. Store Metadata in SSM β β
β ββββββββββββ¬βββββββββββββ β β ββββββββββββββββββ¬ββββββββββββββ β
β β β β β β
β β Provisions β β β β
β β VPC, ECR, ECS,β β β β
β β ALB, CW, etc.β β β β
βββββββββββββββΌβββββββββββββββ βββββββββββββββββββββΌββββββββββββββββββββ
β β
β β (Deployment Metadata)
β βΌ
β βββββββββββββββββββββββββββββββββββββββ
β β SSM Parameter Store β
β β - Deployment Timestamps β
β β - Commit SHA & Messages β
β β - Image Tags β
β βββββββββββββββββββββββββββββββββββββββ
β
β (Provisions: VPC, ECR, ECS, ALB, CW, Lambda, SNS, S3)
β
ββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β
β (Push Image from Deploy workflow)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS Resources (Provisioned by Terraform) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Amazon ECR Registry β
β (Container Image Storage) β
β - Image Versioning β
β - Lifecycle Policies β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β (Pull Image)
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AWS VPC (10.0.0.0/16) β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β PUBLIC SUBNETS (Multi-AZ) β β
β β 10.0.0.0/24 (AZ-a) | 10.0.1.0/24 (AZ-b) β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Internet Gateway β β β
β β ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ β β
β β β Application Load Balancer (ALB) β β β
β β β - Health Checks (/health endpoint) β β β
β β β - Traffic Distribution (Round Robin) β β β
β β β - SSL Termination β β β
β β ββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββ β β
β β β NAT Gateway (AZ-a & AZ-b) β β β
β β β - Enables private subnet internet access β β β
β β β - For ECR image pulls & CloudWatch β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
β β
β ββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
β β PRIVATE SUBNETS (Multi-AZ) β
β β 10.0.10.0/24 (AZ-a) | 10.0.11.0/24 (AZ-b) β
β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β ECS Fargate Cluster β β
β β β β β
β β β ββββββββββββββββ ββββββββββββββββ β β
β β β β Task 1 β β Task 2 β β β
β β β β (Container) β β (Container) β β β
β β β β - Node.js β β - Node.js β β β
β β β β - Port 3000 β β - Port 3000 β β β
β β β β - Health β β - Health β β β
β β β β Checks β β Checks β β β
β β β ββββββββ¬ββββββββ ββββββββ¬ββββββββ β β
β β β β β β β
β β β βββββββββββ¬ββββββββββββ β β
β β β β Logs & Metrics β β
β β βββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββ
β β
βββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CloudWatch Logs & Metrics β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Log Groups β β Metrics β β Alarms β β
β β - Errors β β - CPU β β - CPU High β β
β β - Access β β - Memory β β - Memory β β
β β - Health β β - 5XX β β - 5XX β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββ β
β β β β β
β ββββββββββββ¬ββββββββ΄βββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββ β
β β Lambda Playbook β β SNS Topic β β S3 β β
β β - Auto-Restart ββββββ - Email Alerts β β Logs β β
β β - Auto-Scale β β - Notifications β β Export β β
β β - Force Deploy β ββββββββββββββββββββ ββββββββββ β
β ββββββββββββ¬ββββββββ β
β β β
β β (Remediation Actions) β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββ β
β β AWS DevOps Agent (AI) β β
β β ββββββββββββββββββββββββββββββββββββββ β β
β β β - Log Analysis (CloudWatch) β β β
β β β - Pattern Detection β β β
β β β - Root Cause Analysis β β β
β β β - ECS Task Status Monitoring β β β
β β β - Code Correlation (GitHub + SSM) β β β
β β β - Deployment History Analysis β β β
β β ββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββ¬ββββββββββββββββββββββββββββββββ β
β β β
β β (Reads Deployment Metadata) β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββ β
β β SSM Parameter Store β β
β β (Deployment Correlation) β β
β βββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Terraform Infrastructure (IaC):
- Terraform Changes β Push to
terraform/**triggers workflow - Plan β
terraform planruns on every push/PR - PR Review β Plan output posted as PR comment for review
- Apply β
terraform applyruns on manual workflow trigger - Provision β Creates/updates VPC, ECR, ECS, ALB, CloudWatch, Lambda, SNS, S3, IAM
Application CI/CD Pipeline:
- Code Push β GitHub repository receives changes to
app/** - GitHub Actions β Deploy workflow triggers on push to
main - Build & Push β Docker image built and pushed to ECR with commit SHA tag
- ECS Update β Task definition updated with new image
- Rolling Deployment β ECS performs zero-downtime rolling update
- Metadata Storage β Deployment info stored in SSM for correlation
- DevOps Agent β Can correlate incidents with specific deployments
Incident Response Flow:
- Alarm Triggers β CloudWatch alarm detects anomaly (CPU/Memory/5XX)
- SNS Notification β Alarm sends notification to SNS topic
- Lambda Playbook β Automated remediation attempts (restart/scale/deploy)
- DevOps Agent β AI investigates by analyzing:
- CloudWatch logs and metrics
- ECS task status and recent changes
- SSM deployment history
- GitHub commit correlation
- Root Cause Report β Agent provides analysis with code links and recommendations
- AI-Powered Analysis - DevOps Agent automatically investigates when alarms trigger
- Log Pattern Detection - Identifies error patterns and anomalies in CloudWatch Logs
- Code Correlation - Links incidents to specific GitHub commits and deployments
- Root Cause Analysis - Provides likely causes with confidence scores
- Actionable Recommendations - Suggests remediation steps and rollback commands
- Lambda Playbooks - Automated remediation for common issues
- Auto-Restart - Restarts ECS services on 5XX error spikes
- Auto-Scale - Scales up on high CPU utilization
- Health Recovery - Forces new deployments on unhealthy targets
- Email Notifications - Reports all automated actions
- 5 CloudWatch Alarms - CPU, memory, 5XX errors, unhealthy targets, error count
- Custom Dashboard - Real-time visualization of all metrics
- Structured Logging - JSON logs with full context
- Prometheus Metrics - Application-level metrics collection
- S3 Log Export - Long-term log storage and analysis
- Multi-AZ Deployment - High availability across availability zones
- Private Subnets - ECS tasks run in isolated private subnets
- Security Groups - Least privilege network access
- Auto-Scaling - Fargate with FARGATE_SPOT support
- Zero-Downtime - Rolling deployments with health checks
Before you begin, ensure you have:
- AWS Account with admin access
- AWS CLI v2.x configured (
aws configure) - Terraform v1.0+ installed
- Docker v20.x+ installed and running
- Git installed
- Node.js 18+ (for local development)
- ~$60-120/month budget (or plan to destroy after testing)
git clone https://github.com/VanshShah174/AWS-Devops-Agent.git
cd AWS-Devops-Agent
# Copy example configuration
cp terraform/terraform.tfvars.example terraform/terraform.tfvars
# (Optional) Edit configuration
nano terraform/terraform.tfvarscd terraform
terraform init
terraform apply # Type 'yes' when promptedβ±οΈ This takes 10-15 minutes (NAT Gateway creation is the slowest part)
cd ..
# Using Makefile (recommended)
make build
make push
# Or manually
cd app
ECR_REPO=$(cd ../terraform && terraform output -raw ecr_repository_url)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $ECR_REPO
docker build -t devops-agent-demo:latest .
docker tag devops-agent-demo:latest $ECR_REPO:latest
docker push $ECR_REPO:latest# Check service status
make status
# Or manually
aws ecs describe-services \
--cluster devops-agent-demo-dev-cluster \
--services devops-agent-demo-dev-service \
--query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}'# Get application URL
make url
# Test health endpoint
curl $(make url)/healthExpected response:
{"status":"healthy","uptime":123.456,"memory":{...}}# PowerShell (Windows)
.\scripts\setup-devops-agent.ps1
# Bash (Linux/Mac)
chmod +x scripts/setup-agent-space.sh
./scripts/setup-agent-space.sh# Trigger an error spike
make test-error-spike
# Or manually
.\scripts\trigger-incidents.ps1 -Scenario error-spikeWhat happens:
- Script sends 20 error requests
- CloudWatch alarm triggers (2-3 minutes)
- Lambda playbook restarts service automatically
- DevOps Agent investigates and analyzes
- You receive email notifications
CloudWatch Dashboard:
https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=devops-agent-demo-dev
DevOps Agent Console:
https://console.aws.amazon.com/devopsagent/
The project includes 7 realistic incident scenarios:
make test-error-spikeTriggers 20x 500 errors β High 5XX alarm β Auto-restart service
make test-memory-leakAllocates 100MB arrays β Memory alarm β Investigation
make test-cpu-spikeCPU-intensive operations β CPU alarm β Auto-scale up
make test-health-failureDisables health endpoint β Unhealthy targets alarm β Force new deployment
make test-allRuns all test scenarios sequentially
Networking:
- VPC with public/private subnets (2 AZs)
- Internet Gateway
- NAT Gateway
- Route tables and associations
- Security groups
Compute:
- ECS Fargate cluster
- ECS service with auto-scaling
- Task definition with health checks
- Application Load Balancer
- Target group
Storage:
- ECR repository with lifecycle policies
- S3 bucket for log exports
Monitoring:
- 5 CloudWatch alarms
- CloudWatch dashboard
- Log groups with 7-day retention
- Log metric filters
- SNS topic for notifications
Automation:
- Lambda playbook function
- DevOps Agent IAM roles
- SSM parameters for configuration
Estimated Monthly Cost:
- Standard (2 AZ, 2 tasks): ~$120/month
- Optimized (1 AZ, 1 task): ~$56/month
aws-devops-agent-demo/
βββ app/ # Node.js application
β βββ src/
β β βββ index.js # Express API with 8 error endpoints
β βββ Dockerfile # Multi-stage Docker build
β βββ package.json # Dependencies
βββ terraform/ # Infrastructure as Code
β βββ main.tf # Provider configuration
β βββ vpc.tf # Network resources
β βββ ecs.tf # ECS cluster & service
β βββ alb.tf # Load balancer
β βββ cloudwatch.tf # Monitoring & alarms
β βββ devops-agent.tf # DevOps Agent setup
β βββ playbook-lambda.tf # Automated remediation
β βββ s3-logs.tf # Log storage
βββ .github/workflows/ # CI/CD pipelines
β βββ deploy.yml # Application deployment
β βββ terraform.yml # Infrastructure deployment
βββ scripts/ # Automation scripts
β βββ setup-devops-agent.ps1 # Agent configuration
β βββ trigger-incidents.ps1 # Test scenarios
β βββ check-metrics.ps1 # Metrics verification
β βββ verify-agent-monitoring.ps1
βββ docs/ # Comprehensive documentation
β βββ ARCHITECTURE.md # System architecture
β βββ SETUP.md # Detailed setup guide
β βββ TESTING.md # Testing guide
β βββ FAQ.md # Troubleshooting
βββ QUICKSTART.md # 5-step quick start
βββ DEPLOYMENT_FLOW.md # Visual deployment guide
βββ COMPLETE_SYSTEM_FLOW.md # End-to-end flow
βββ README.md # This file
By deploying this project, you'll learn:
- β Amazon ECS & Fargate (container orchestration)
- β Application Load Balancer (traffic distribution)
- β Amazon ECR (container registry)
- β Amazon CloudWatch (monitoring & alarms)
- β AWS Lambda (serverless automation)
- β AWS DevOps Agent (AI-powered incident response)
- β Amazon VPC (networking & security)
- β IAM (roles & permissions)
- β Infrastructure as Code (Terraform)
- β Containerization (Docker)
- β CI/CD Pipelines (GitHub Actions)
- β Monitoring & Observability
- β Automated Incident Response
- β Self-Healing Systems
- β Multi-AZ high availability
- β Security groups & least privilege
- β Health checks & auto-recovery
- β Structured logging
- β Metrics collection
- β Automated testing
# Deployment
make init # Initialize Terraform
make apply # Deploy infrastructure
make build # Build Docker image
make push # Push to ECR
make deploy # Full deployment
# Testing
make test-error-spike # Test error spike
make test-memory-leak # Test memory leak
make test-cpu-spike # Test CPU spike
make test-health-failure # Test health failure
make test-all # Run all tests
# Monitoring
make logs # Tail CloudWatch logs
make alarms # Show alarm status
make status # Application status
make url # Show application URL
# Maintenance
make cleanup # Restore healthy state
make destroy # Delete all resources
make help # Show all commandsThis project implements enterprise security standards:
- β Private Subnets - ECS tasks run in isolated private subnets
- β Security Groups - Least privilege network access
- β IAM Roles - No hardcoded credentials
- β Non-Root Container - Docker runs as non-root user
- β ECR Scanning - Automatic image vulnerability scanning
- β Encryption - AES256 encryption for ECR and S3
- β VPC Endpoints - Secure AWS service access (optional)
1. Alarm triggers at 3 AM β 5 min
2. Engineer wakes up and logs in β 5 min
3. Searches CloudWatch logs β 10 min
4. Checks ECS task status β 5 min
5. Reviews recent deployments β 10 min
6. Analyzes metrics manually β 10 min
7. Determines root cause β 15 min
8. Takes corrective action β 10 min
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total Time: 70 minutes
Engineer: Tired and frustrated π«
1. Alarm triggers at 3 AM β 0 min
2. Lambda playbook restarts service β 1 min
3. DevOps Agent investigates automatically β 2 min
4. Engineer reviews complete report β 5 min
5. Takes action based on recommendations β 5 min
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total Time: 13 minutes
Engineer: Well-rested, confident π
Time Saved: 57 minutes (81% reduction)
# Restore application to healthy state
make cleanup
# Destroy all infrastructure
cd terraform
terraform destroy # Type 'yes' to confirmEstimated cost if left running:
- Hourly: ~$0.08
- Daily: ~$1.90
- Monthly: ~$56-120
- QUICKSTART.md - Fast 5-step deployment
- STEP_BY_STEP_GUIDE.md - Detailed beginner guide
- DEPLOYMENT_FLOW.md - Visual deployment flow
- COMPLETE_SYSTEM_FLOW.md - End-to-end system flow
- docs/ARCHITECTURE.md - Architecture deep dive
- docs/SETUP.md - Detailed setup with troubleshooting
- docs/TESTING.md - Complete testing guide
- docs/FAQ.md - Common issues and solutions
Contributions are welcome! Please feel free to submit a Pull Request.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- AWS DevOps Agent team for the amazing AI-powered incident response service
- AWS for providing comprehensive cloud services
- The open-source community for tools and inspiration
- Issues: GitHub Issues
- Documentation: Check the
docs/directory - FAQ: See docs/FAQ.md
This project is perfect for:
- Learning - Understand AWS DevOps Agent and ECS deployment
- Proof of Concept - Demonstrate DevOps Agent value to stakeholders
- Template - Use as a starting point for production applications
- Training - Practice incident response and monitoring
- Portfolio - Showcase DevOps and cloud engineering skills
If you find this project helpful, please consider giving it a star! β
- Total Files: 50+
- Lines of Code: ~5,000+
- AWS Resources: 40+
- Documentation: 15+ guides
- Test Scenarios: 7 realistic incidents
- Setup Time: 20 minutes
- Time Savings: 82% reduction in MTTR
Built with β€οΈ by Vansh Shah
Ready to revolutionize your incident response? Get Started β
