You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. By deliberately injecting failures, teams discover weaknesses before they cause outages. This guide covers chaos engineering principles, patterns, and cloud-specific tooling.
Core Principles
flowchart LR
H[Form hypothesis<br/>steady-state metric] --> SS[Measure steady state]
SS --> INJ[Inject failure<br/>start with smallest blast radius]
INJ --> OBS[Observe:<br/>did the system hold?]
OBS -- yes --> EXP[Expand blast radius<br/>or move to next experiment]
OBS -- no --> FIX[Fix the weakness<br/>add to runbook]
FIX --> H
EXP --> H
Loading
The Chaos Engineering Manifesto
Build a hypothesis around steady state - Define what "normal" looks like using measurable metrics
Vary real-world events - Inject failures that mirror real production incidents
Run experiments in production - Test in the environment that matters most
Automate experiments to run continuously - Make chaos testing part of CI/CD
Minimize blast radius - Start small and expand as confidence grows
Steady State Metrics
Category
Metrics
Example Threshold
Availability
Uptime percentage, success rate
> 99.95%
Latency
p50, p95, p99 response time
p99 < 500ms
Throughput
Requests per second
> 1,000 RPS
Error Rate
5xx errors, failed transactions
< 0.1%
Saturation
CPU, memory, disk, network utilization
< 80%
Business Metrics
Orders per minute, sign-ups
Within 2 standard deviations
Failure Injection Patterns
Infrastructure Failures
Failure Type
Description
What It Tests
Instance Termination
Kill a VM/container
Auto-scaling, self-healing
AZ/Zone Failure
Simulate entire zone outage
Multi-AZ redundancy
Region Failure
Simulate entire region outage
Multi-region failover
Disk Failure
Corrupt or fill disk
Data durability, alerting
Network Partition
Block traffic between services
Timeout handling, circuit breakers
DNS Failure
Return errors for DNS queries
DNS failover, caching
Clock Skew
Offset system clock
Time-dependent logic, certificates
Application Failures
Failure Type
Description
What It Tests
Latency Injection
Add artificial delay to requests
Timeout configuration, user experience
Error Injection
Return errors from dependencies
Error handling, fallback logic
Memory Leak
Gradually consume memory
OOM handling, auto-scaling
CPU Stress
Consume CPU resources
Performance degradation, scaling
Thread Pool Exhaustion
Consume all threads
Connection pooling, circuit breakers
Certificate Expiry
Use expired certificates
Certificate rotation, monitoring
Data Failures
Failure Type
Description
What It Tests
Database Failover
Force primary to secondary switch
Failover time, connection handling
Replication Lag
Introduce replication delay
Read-after-write consistency
Data Corruption
Inject corrupt data
Validation, error handling
Cache Eviction
Clear cache entirely
Cache miss handling, thundering herd
Queue Backlog
Flood message queue
Backpressure, dead letter handling
Dependency Failures
Failure Type
Description
What It Tests
Third-Party API Outage
Block external API calls
Fallback mechanisms, graceful degradation
Payment Gateway Failure
Simulate payment provider outage
Retry logic, user notification
CDN Failure
Simulate CDN outage
Origin failover, cache headers
Authentication Service Down
Block auth service
Session handling, cached tokens
Chaos Engineering Tools
Cloud-Native Tools
Tool
Cloud
Features
AWS Fault Injection Service (FIS)
AWS
Native AWS chaos - EC2, ECS, EKS, RDS, network
Azure Chaos Studio
Azure
Native Azure chaos - VMs, AKS, Cosmos DB, network
Google Cloud Fault Injection
GCP
Limited (use third-party or custom)
Open-Source Tools
Tool
Type
Supported Targets
Chaos Monkey (Netflix)
Instance termination
AWS EC2, Kubernetes
Litmus Chaos
Kubernetes-native
Pods, nodes, network, DNS, disk
Chaos Mesh
Kubernetes-native
Pods, network, I/O, time, JVM
Gremlin
Commercial platform
Multi-cloud, VMs, containers, serverless
Toxiproxy
Network proxy
TCP connections (latency, timeout, bandwidth)
Pumba
Container chaos
Docker containers (kill, pause, network)
PowerfulSeal
Kubernetes
Pods, nodes, network policies
Cloud Implementation
AWS Fault Injection Service (FIS)
Feature
Details
Experiment Templates
Pre-built and custom templates
Targets
EC2, ECS, EKS, RDS, Network, Systems Manager
Actions
Instance stop/terminate, network disruption, API throttle