Skip to content

VanshShah174/AWS-Devops-Agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AWS DevOps Agent Demo - AI-Powered Incident Response

License: MIT Terraform AWS Node.js

Demonstration of AWS DevOps Agent's AI-powered incident investigation capabilities in a containerized ECS environment. Transform incident response from 45 minutes of manual investigation to 2 minutes of automated AI analysis. Complete self-healing infrastructure with automated remediation, comprehensive monitoring, and intelligent root cause analysis.


🎯 What This Project Demonstrates

This is a complete, enterprise-grade DevOps automation platform that shows how AWS DevOps Agent can revolutionize incident response:

  • πŸ€– AI-Powered Investigation - Automatic incident analysis with root cause identification
  • πŸ”„ Self-Healing Infrastructure - Automated remediation via Lambda playbooks
  • πŸ“Š Complete Observability - CloudWatch logs, metrics, alarms, and dashboards
  • πŸš€ Zero-Downtime Deployments - ECS Fargate with rolling updates
  • πŸ”— Code Correlation - Links incidents to GitHub commits automatically
  • ⚑ 82% Faster Resolution - Reduces MTTR from 45 minutes to 2 minutes

πŸ—οΈ Architecture Overview

AWS DevOps Agent Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         GitHub Repository                            β”‚
β”‚            (Source Code + Terraform IaC + Workflows)                 β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚                  β”‚                    β”‚
     β”‚ terraform/**     β”‚ app/**             β”‚ (Code Correlation)
     β–Ό                  β–Ό                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Terraform Workflow (IaC)    β”‚  β”‚      Deploy Workflow (Application)   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ 1. Format & Validate   β”‚  β”‚  β”‚  β”‚  1. Checkout Code            β”‚   β”‚
β”‚  β”‚ 2. terraform init      β”‚  β”‚  β”‚  β”‚  2. Build Docker Image       β”‚   β”‚
β”‚  β”‚ 3. terraform plan      β”‚  β”‚  β”‚  β”‚  3. Push to ECR              β”‚   β”‚
β”‚  β”‚ 4. PR: Comment plan    β”‚  β”‚  β”‚  β”‚  4. Update ECS Task Def      β”‚   β”‚
β”‚  β”‚ 5. terraform apply     β”‚  β”‚  β”‚  β”‚  5. Deploy to ECS (Rolling)  β”‚   β”‚
β”‚  β”‚    (manual trigger)    β”‚  β”‚  β”‚  β”‚  6. Store Metadata in SSM    β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚             β”‚               β”‚  β”‚                   β”‚                 β”‚
β”‚             β”‚ Provisions    β”‚  β”‚                   β”‚                 β”‚
β”‚             β”‚ VPC, ECR, ECS,β”‚  β”‚                   β”‚                 β”‚
β”‚             β”‚ ALB, CW, etc.β”‚  β”‚                   β”‚                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                                     β”‚
              β”‚                                     β”‚ (Deployment Metadata)
              β”‚                                     β–Ό
              β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚              β”‚   SSM Parameter Store               β”‚
              β”‚              β”‚   - Deployment Timestamps            β”‚
              β”‚              β”‚   - Commit SHA & Messages            β”‚
              β”‚              β”‚   - Image Tags                       β”‚
              β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β”‚ (Provisions: VPC, ECR, ECS, ALB, CW, Lambda, SNS, S3)
              β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                         β”‚
                                         β”‚ (Push Image from Deploy workflow)
                                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AWS Resources (Provisioned by Terraform)                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                       Amazon ECR Registry                            β”‚
β”‚                    (Container Image Storage)                         β”‚
β”‚                    - Image Versioning                                β”‚
β”‚                    - Lifecycle Policies                              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚
                             β”‚ (Pull Image)
                             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         AWS VPC (10.0.0.0/16)                        β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              PUBLIC SUBNETS (Multi-AZ)                      β”‚    β”‚
β”‚  β”‚         10.0.0.0/24 (AZ-a) | 10.0.1.0/24 (AZ-b)           β”‚    β”‚
β”‚  β”‚                                                             β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚  β”‚      Internet Gateway                                 β”‚ β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚                       β”‚                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚  β”‚   Application Load Balancer (ALB)                    β”‚ β”‚    β”‚
β”‚  β”‚  β”‚   - Health Checks (/health endpoint)                 β”‚ β”‚    β”‚
β”‚  β”‚  β”‚   - Traffic Distribution (Round Robin)               β”‚ β”‚    β”‚
β”‚  β”‚  β”‚   - SSL Termination                                  β”‚ β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β”‚                       β”‚                                     β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚    β”‚
β”‚  β”‚  β”‚   NAT Gateway (AZ-a & AZ-b)                         β”‚ β”‚    β”‚
β”‚  β”‚  β”‚   - Enables private subnet internet access          β”‚ β”‚    β”‚
β”‚  β”‚  β”‚   - For ECR image pulls & CloudWatch                β”‚ β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  β”‚              PRIVATE SUBNETS (Multi-AZ)                      β”‚
β”‚  β”‚        10.0.10.0/24 (AZ-a) | 10.0.11.0/24 (AZ-b)           β”‚
β”‚  β”‚                                                              β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  β”‚         ECS Fargate Cluster                          β”‚  β”‚
β”‚  β”‚  β”‚                                                       β”‚  β”‚
β”‚  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚  β”‚
β”‚  β”‚  β”‚  β”‚   Task 1     β”‚      β”‚   Task 2     β”‚            β”‚  β”‚
β”‚  β”‚  β”‚  β”‚  (Container) β”‚      β”‚  (Container) β”‚            β”‚  β”‚
β”‚  β”‚  β”‚  β”‚  - Node.js   β”‚      β”‚  - Node.js   β”‚            β”‚  β”‚
β”‚  β”‚  β”‚  β”‚  - Port 3000 β”‚      β”‚  - Port 3000 β”‚            β”‚  β”‚
β”‚  β”‚  β”‚  β”‚  - Health    β”‚      β”‚  - Health    β”‚            β”‚  β”‚
β”‚  β”‚  β”‚  β”‚    Checks    β”‚      β”‚    Checks    β”‚            β”‚  β”‚
β”‚  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚  β”‚
β”‚  β”‚  β”‚         β”‚                     β”‚                      β”‚  β”‚
β”‚  β”‚  β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚  β”‚
β”‚  β”‚  β”‚                   β”‚ Logs & Metrics                   β”‚  β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚                           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    CloudWatch Logs & Metrics                         β”‚
β”‚                                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚  Log Groups  β”‚  β”‚   Metrics    β”‚  β”‚   Alarms     β”‚             β”‚
β”‚  β”‚  - Errors    β”‚  β”‚  - CPU       β”‚  β”‚  - CPU High  β”‚             β”‚
β”‚  β”‚  - Access    β”‚  β”‚  - Memory    β”‚  β”‚  - Memory    β”‚             β”‚
β”‚  β”‚  - Health    β”‚  β”‚  - 5XX       β”‚  β”‚  - 5XX       β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚         β”‚                  β”‚                  β”‚                      β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚                    β”‚                                                  β”‚
β”‚                    β–Ό                                                  β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚         β”‚  Lambda Playbook β”‚    β”‚  SNS Topic       β”‚  β”‚ S3     β”‚     β”‚
β”‚         β”‚  - Auto-Restart  │◄───│  - Email Alerts  β”‚  β”‚ Logs   β”‚     β”‚
β”‚         β”‚  - Auto-Scale    β”‚    β”‚  - Notifications β”‚  β”‚ Export β”‚     β”‚
β”‚         β”‚  - Force Deploy  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                          β”‚
β”‚                    β”‚                                                  β”‚
β”‚                    β”‚ (Remediation Actions)                            β”‚
β”‚                    β–Ό                                                  β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚         β”‚      AWS DevOps Agent (AI)               β”‚                  β”‚
β”‚         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚                  β”‚
β”‚         β”‚  β”‚ - Log Analysis (CloudWatch)        β”‚ β”‚                  β”‚
β”‚         β”‚  β”‚ - Pattern Detection                β”‚ β”‚                  β”‚
β”‚         β”‚  β”‚ - Root Cause Analysis              β”‚ β”‚                  β”‚
β”‚         β”‚  β”‚ - ECS Task Status Monitoring       β”‚ β”‚                  β”‚
β”‚         β”‚  β”‚ - Code Correlation (GitHub + SSM)  β”‚ β”‚                  β”‚
β”‚         β”‚  β”‚ - Deployment History Analysis      β”‚ β”‚                  β”‚
β”‚         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚                  β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                    β”‚                                                  β”‚
β”‚                    β”‚ (Reads Deployment Metadata)                      β”‚
β”‚                    β–Ό                                                  β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚         β”‚   SSM Parameter Store               β”‚                      β”‚
β”‚         β”‚   (Deployment Correlation)          β”‚                      β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Deployment Flow

Terraform Infrastructure (IaC):

  1. Terraform Changes β†’ Push to terraform/** triggers workflow
  2. Plan β†’ terraform plan runs on every push/PR
  3. PR Review β†’ Plan output posted as PR comment for review
  4. Apply β†’ terraform apply runs on manual workflow trigger
  5. Provision β†’ Creates/updates VPC, ECR, ECS, ALB, CloudWatch, Lambda, SNS, S3, IAM

Application CI/CD Pipeline:

  1. Code Push β†’ GitHub repository receives changes to app/**
  2. GitHub Actions β†’ Deploy workflow triggers on push to main
  3. Build & Push β†’ Docker image built and pushed to ECR with commit SHA tag
  4. ECS Update β†’ Task definition updated with new image
  5. Rolling Deployment β†’ ECS performs zero-downtime rolling update
  6. Metadata Storage β†’ Deployment info stored in SSM for correlation
  7. DevOps Agent β†’ Can correlate incidents with specific deployments

Incident Response Flow:

  1. Alarm Triggers β†’ CloudWatch alarm detects anomaly (CPU/Memory/5XX)
  2. SNS Notification β†’ Alarm sends notification to SNS topic
  3. Lambda Playbook β†’ Automated remediation attempts (restart/scale/deploy)
  4. DevOps Agent β†’ AI investigates by analyzing:
    • CloudWatch logs and metrics
    • ECS task status and recent changes
    • SSM deployment history
    • GitHub commit correlation
  5. Root Cause Report β†’ Agent provides analysis with code links and recommendations

✨ Key Features

πŸ” Automated Incident Investigation

  • AI-Powered Analysis - DevOps Agent automatically investigates when alarms trigger
  • Log Pattern Detection - Identifies error patterns and anomalies in CloudWatch Logs
  • Code Correlation - Links incidents to specific GitHub commits and deployments
  • Root Cause Analysis - Provides likely causes with confidence scores
  • Actionable Recommendations - Suggests remediation steps and rollback commands

πŸ› οΈ Self-Healing Infrastructure

  • Lambda Playbooks - Automated remediation for common issues
  • Auto-Restart - Restarts ECS services on 5XX error spikes
  • Auto-Scale - Scales up on high CPU utilization
  • Health Recovery - Forces new deployments on unhealthy targets
  • Email Notifications - Reports all automated actions

πŸ“Š Complete Observability

  • 5 CloudWatch Alarms - CPU, memory, 5XX errors, unhealthy targets, error count
  • Custom Dashboard - Real-time visualization of all metrics
  • Structured Logging - JSON logs with full context
  • Prometheus Metrics - Application-level metrics collection
  • S3 Log Export - Long-term log storage and analysis

πŸš€ Production-Ready Infrastructure

  • Multi-AZ Deployment - High availability across availability zones
  • Private Subnets - ECS tasks run in isolated private subnets
  • Security Groups - Least privilege network access
  • Auto-Scaling - Fargate with FARGATE_SPOT support
  • Zero-Downtime - Rolling deployments with health checks

πŸ“‹ Prerequisites

Before you begin, ensure you have:

  • AWS Account with admin access
  • AWS CLI v2.x configured (aws configure)
  • Terraform v1.0+ installed
  • Docker v20.x+ installed and running
  • Git installed
  • Node.js 18+ (for local development)
  • ~$60-120/month budget (or plan to destroy after testing)

πŸš€ Quick Start (20 Minutes)

1️⃣ Clone and Configure

git clone https://github.com/VanshShah174/AWS-Devops-Agent.git
cd AWS-Devops-Agent

# Copy example configuration
cp terraform/terraform.tfvars.example terraform/terraform.tfvars

# (Optional) Edit configuration
nano terraform/terraform.tfvars

2️⃣ Deploy Infrastructure

cd terraform
terraform init
terraform apply  # Type 'yes' when prompted

⏱️ This takes 10-15 minutes (NAT Gateway creation is the slowest part)

3️⃣ Build and Deploy Application

cd ..

# Using Makefile (recommended)
make build
make push

# Or manually
cd app
ECR_REPO=$(cd ../terraform && terraform output -raw ecr_repository_url)
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin $ECR_REPO
docker build -t devops-agent-demo:latest .
docker tag devops-agent-demo:latest $ECR_REPO:latest
docker push $ECR_REPO:latest

4️⃣ Wait for ECS Tasks (3-5 minutes)

# Check service status
make status

# Or manually
aws ecs describe-services \
  --cluster devops-agent-demo-dev-cluster \
  --services devops-agent-demo-dev-service \
  --query 'services[0].{desired:desiredCount,running:runningCount,pending:pendingCount}'

5️⃣ Verify Application

# Get application URL
make url

# Test health endpoint
curl $(make url)/health

Expected response:

{"status":"healthy","uptime":123.456,"memory":{...}}

6️⃣ Setup DevOps Agent

# PowerShell (Windows)
.\scripts\setup-devops-agent.ps1

# Bash (Linux/Mac)
chmod +x scripts/setup-agent-space.sh
./scripts/setup-agent-space.sh

7️⃣ Test Incident Response

# Trigger an error spike
make test-error-spike

# Or manually
.\scripts\trigger-incidents.ps1 -Scenario error-spike

What happens:

  1. Script sends 20 error requests
  2. CloudWatch alarm triggers (2-3 minutes)
  3. Lambda playbook restarts service automatically
  4. DevOps Agent investigates and analyzes
  5. You receive email notifications

8️⃣ View Results

CloudWatch Dashboard:

https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#dashboards:name=devops-agent-demo-dev

DevOps Agent Console:

https://console.aws.amazon.com/devopsagent/

πŸ§ͺ Testing Scenarios

The project includes 7 realistic incident scenarios:

Error Spike

make test-error-spike

Triggers 20x 500 errors β†’ High 5XX alarm β†’ Auto-restart service

Memory Leak

make test-memory-leak

Allocates 100MB arrays β†’ Memory alarm β†’ Investigation

CPU Spike

make test-cpu-spike

CPU-intensive operations β†’ CPU alarm β†’ Auto-scale up

Health Check Failure

make test-health-failure

Disables health endpoint β†’ Unhealthy targets alarm β†’ Force new deployment

All Scenarios

make test-all

Runs all test scenarios sequentially


πŸ“Š What Gets Deployed

AWS Resources (40+ resources)

Networking:

  • VPC with public/private subnets (2 AZs)
  • Internet Gateway
  • NAT Gateway
  • Route tables and associations
  • Security groups

Compute:

  • ECS Fargate cluster
  • ECS service with auto-scaling
  • Task definition with health checks
  • Application Load Balancer
  • Target group

Storage:

  • ECR repository with lifecycle policies
  • S3 bucket for log exports

Monitoring:

  • 5 CloudWatch alarms
  • CloudWatch dashboard
  • Log groups with 7-day retention
  • Log metric filters
  • SNS topic for notifications

Automation:

  • Lambda playbook function
  • DevOps Agent IAM roles
  • SSM parameters for configuration

Estimated Monthly Cost:

  • Standard (2 AZ, 2 tasks): ~$120/month
  • Optimized (1 AZ, 1 task): ~$56/month

πŸ“ Project Structure

aws-devops-agent-demo/
β”œβ”€β”€ app/                          # Node.js application
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   └── index.js             # Express API with 8 error endpoints
β”‚   β”œβ”€β”€ Dockerfile               # Multi-stage Docker build
β”‚   └── package.json             # Dependencies
β”œβ”€β”€ terraform/                    # Infrastructure as Code
β”‚   β”œβ”€β”€ main.tf                  # Provider configuration
β”‚   β”œβ”€β”€ vpc.tf                   # Network resources
β”‚   β”œβ”€β”€ ecs.tf                   # ECS cluster & service
β”‚   β”œβ”€β”€ alb.tf                   # Load balancer
β”‚   β”œβ”€β”€ cloudwatch.tf            # Monitoring & alarms
β”‚   β”œβ”€β”€ devops-agent.tf          # DevOps Agent setup
β”‚   β”œβ”€β”€ playbook-lambda.tf       # Automated remediation
β”‚   └── s3-logs.tf               # Log storage
β”œβ”€β”€ .github/workflows/            # CI/CD pipelines
β”‚   β”œβ”€β”€ deploy.yml               # Application deployment
β”‚   └── terraform.yml            # Infrastructure deployment
β”œβ”€β”€ scripts/                      # Automation scripts
β”‚   β”œβ”€β”€ setup-devops-agent.ps1   # Agent configuration
β”‚   β”œβ”€β”€ trigger-incidents.ps1    # Test scenarios
β”‚   β”œβ”€β”€ check-metrics.ps1        # Metrics verification
β”‚   └── verify-agent-monitoring.ps1
β”œβ”€β”€ docs/                         # Comprehensive documentation
β”‚   β”œβ”€β”€ ARCHITECTURE.md          # System architecture
β”‚   β”œβ”€β”€ SETUP.md                 # Detailed setup guide
β”‚   β”œβ”€β”€ TESTING.md               # Testing guide
β”‚   └── FAQ.md                   # Troubleshooting
β”œβ”€β”€ QUICKSTART.md                 # 5-step quick start
β”œβ”€β”€ DEPLOYMENT_FLOW.md            # Visual deployment guide
β”œβ”€β”€ COMPLETE_SYSTEM_FLOW.md       # End-to-end flow
└── README.md                     # This file

πŸŽ“ Learning Outcomes

By deploying this project, you'll learn:

AWS Services

  • βœ… Amazon ECS & Fargate (container orchestration)
  • βœ… Application Load Balancer (traffic distribution)
  • βœ… Amazon ECR (container registry)
  • βœ… Amazon CloudWatch (monitoring & alarms)
  • βœ… AWS Lambda (serverless automation)
  • βœ… AWS DevOps Agent (AI-powered incident response)
  • βœ… Amazon VPC (networking & security)
  • βœ… IAM (roles & permissions)

DevOps Practices

  • βœ… Infrastructure as Code (Terraform)
  • βœ… Containerization (Docker)
  • βœ… CI/CD Pipelines (GitHub Actions)
  • βœ… Monitoring & Observability
  • βœ… Automated Incident Response
  • βœ… Self-Healing Systems

Best Practices

  • βœ… Multi-AZ high availability
  • βœ… Security groups & least privilege
  • βœ… Health checks & auto-recovery
  • βœ… Structured logging
  • βœ… Metrics collection
  • βœ… Automated testing

πŸ”§ Available Commands

# Deployment
make init          # Initialize Terraform
make apply         # Deploy infrastructure
make build         # Build Docker image
make push          # Push to ECR
make deploy        # Full deployment

# Testing
make test-error-spike      # Test error spike
make test-memory-leak      # Test memory leak
make test-cpu-spike        # Test CPU spike
make test-health-failure   # Test health failure
make test-all              # Run all tests

# Monitoring
make logs          # Tail CloudWatch logs
make alarms        # Show alarm status
make status        # Application status
make url           # Show application URL

# Maintenance
make cleanup       # Restore healthy state
make destroy       # Delete all resources
make help          # Show all commands

πŸ›‘οΈ Security Best Practices

This project implements enterprise security standards:

  • βœ… Private Subnets - ECS tasks run in isolated private subnets
  • βœ… Security Groups - Least privilege network access
  • βœ… IAM Roles - No hardcoded credentials
  • βœ… Non-Root Container - Docker runs as non-root user
  • βœ… ECR Scanning - Automatic image vulnerability scanning
  • βœ… Encryption - AES256 encryption for ECR and S3
  • βœ… VPC Endpoints - Secure AWS service access (optional)

πŸ“ˆ Real-World Benefits

Before DevOps Agent (Manual Investigation)

1. Alarm triggers at 3 AM                    β†’ 5 min
2. Engineer wakes up and logs in             β†’ 5 min
3. Searches CloudWatch logs                  β†’ 10 min
4. Checks ECS task status                    β†’ 5 min
5. Reviews recent deployments                β†’ 10 min
6. Analyzes metrics manually                 β†’ 10 min
7. Determines root cause                     β†’ 15 min
8. Takes corrective action                   β†’ 10 min
────────────────────────────────────────────────────
Total Time: 70 minutes
Engineer: Tired and frustrated 😫

After DevOps Agent (Automated)

1. Alarm triggers at 3 AM                    β†’ 0 min
2. Lambda playbook restarts service          β†’ 1 min
3. DevOps Agent investigates automatically   β†’ 2 min
4. Engineer reviews complete report          β†’ 5 min
5. Takes action based on recommendations     β†’ 5 min
────────────────────────────────────────────────────
Total Time: 13 minutes
Engineer: Well-rested, confident 😊
Time Saved: 57 minutes (81% reduction)

🧹 Cleanup

⚠️ IMPORTANT: Always destroy resources when done to avoid charges!

# Restore application to healthy state
make cleanup

# Destroy all infrastructure
cd terraform
terraform destroy  # Type 'yes' to confirm

Estimated cost if left running:

  • Hourly: ~$0.08
  • Daily: ~$1.90
  • Monthly: ~$56-120

πŸ“š Documentation


🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgments

  • AWS DevOps Agent team for the amazing AI-powered incident response service
  • AWS for providing comprehensive cloud services
  • The open-source community for tools and inspiration

πŸ“ž Support


🎯 Use Cases

This project is perfect for:

  • Learning - Understand AWS DevOps Agent and ECS deployment
  • Proof of Concept - Demonstrate DevOps Agent value to stakeholders
  • Template - Use as a starting point for production applications
  • Training - Practice incident response and monitoring
  • Portfolio - Showcase DevOps and cloud engineering skills

🌟 Star History

If you find this project helpful, please consider giving it a star! ⭐


πŸ“Š Project Statistics

  • Total Files: 50+
  • Lines of Code: ~5,000+
  • AWS Resources: 40+
  • Documentation: 15+ guides
  • Test Scenarios: 7 realistic incidents
  • Setup Time: 20 minutes
  • Time Savings: 82% reduction in MTTR

Built with ❀️ by Vansh Shah

Ready to revolutionize your incident response? Get Started β†’

About

Demonstration of AWS DevOps Agent's AI-powered incident investigation capabilities in a containerized ECS environment. Transform incident response from 45 minutes of manual investigation to 2 minutes of automated AI analysis. Complete self-healing infrastructure with automated remediation, comprehensive monitoring, and intelligent analysis

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors