|
| 1 | +# Chaos Engineering Experiments |
| 2 | + |
| 3 | +Chaos Mesh experiments for demonstrating Kubernetes resilience and self-healing. All experiment YAMLs are in |
| 4 | +`k8s/chaos/`. |
| 5 | + |
| 6 | +## Prerequisites |
| 7 | + |
| 8 | +- Chaos Mesh installed in the cluster (`helm install chaos-mesh` — already done) |
| 9 | +- Pipeline running with documents flowing through Redis Streams |
| 10 | +- Grafana dashboard open to observe the effects |
| 11 | + |
| 12 | +## Experiment 1: Pod Kill (Self-Healing) |
| 13 | + |
| 14 | +Kills classify-worker pods. K8s restarts them within seconds. Redis re-delivers unacknowledged messages — zero data loss, |
| 15 | +zero manual intervention. |
| 16 | + |
| 17 | +```bash |
| 18 | +# Watch pods (run in a separate terminal) |
| 19 | +kubectl get pods -w |
| 20 | + |
| 21 | +# Apply the experiment |
| 22 | +kubectl apply -f k8s/chaos/pod-kill.yaml |
| 23 | + |
| 24 | +# Check experiment status |
| 25 | +kubectl get podchaos |
| 26 | + |
| 27 | +# Clean up |
| 28 | +kubectl delete podchaos pod-kill-classify-worker |
| 29 | +``` |
| 30 | + |
| 31 | +**What to watch for:** |
| 32 | + |
| 33 | +- Classify-worker pod disappears and a new one starts within seconds |
| 34 | +- Pod restarts counter increments (visible in Grafana "Pod Restarts" panel) |
| 35 | +- Pipeline continues processing after the new pod is ready |
| 36 | + |
| 37 | +## Experiment 2: Network Delay (Resilience) |
| 38 | + |
| 39 | +Injects 500ms latency (with 100ms jitter) on store-worker pods for 2 minutes. Simulates degraded connectivity to |
| 40 | +PostgreSQL or Azure Blob Storage. |
| 41 | + |
| 42 | +```bash |
| 43 | +# Apply the experiment |
| 44 | +kubectl apply -f k8s/chaos/network-delay.yaml |
| 45 | + |
| 46 | +# Check experiment status |
| 47 | +kubectl get networkchaos |
| 48 | + |
| 49 | +# Clean up (or wait 2 minutes for auto-expiry) |
| 50 | +kubectl delete networkchaos network-delay-store-worker |
| 51 | +``` |
| 52 | + |
| 53 | +**What to watch for:** |
| 54 | + |
| 55 | +- Pipeline slows but does not break |
| 56 | +- Redis queue depth increases (store-worker processing is slower) |
| 57 | +- Grafana network I/O panel shows the latency effect |
| 58 | +- After expiry, throughput returns to normal |
| 59 | + |
| 60 | +## Experiment 3: CPU Stress (KEDA Autoscaling) |
| 61 | + |
| 62 | +Burns 80% CPU on classify-worker pods for 2 minutes. This is the most impressive experiment — it triggers KEDA |
| 63 | +autoscaling. |
| 64 | + |
| 65 | +```bash |
| 66 | +# Apply the experiment |
| 67 | +kubectl apply -f k8s/chaos/cpu-stress.yaml |
| 68 | + |
| 69 | +# Generate load so messages pile up in the queue |
| 70 | +# (use the dashboard "Generate" button, or repeat this curl) |
| 71 | +curl -X POST http://51.138.91.82/api/generate -H 'Content-Type: application/json' -d '{"count": 10}' |
| 72 | + |
| 73 | +# Watch KEDA scale up workers |
| 74 | +kubectl get pods -w |
| 75 | + |
| 76 | +# Check experiment status |
| 77 | +kubectl get stresschaos |
| 78 | + |
| 79 | +# Clean up (or wait 2 minutes for auto-expiry) |
| 80 | +kubectl delete stresschaos cpu-stress-classify-worker |
| 81 | +``` |
| 82 | + |
| 83 | +**What to watch for:** |
| 84 | + |
| 85 | +- Classify-workers slow down, Redis queue depth rises |
| 86 | +- KEDA detects the lag and scales up additional classify-worker pods |
| 87 | +- New pods process the backlog, queue depth drops |
| 88 | +- After stress ends + 60s cooldown, KEDA scales back down to 1 |
| 89 | + |
| 90 | +## Recommended Demo Order |
| 91 | + |
| 92 | +1. **Pod Kill** — quick (30s), shows self-healing |
| 93 | +2. **CPU Stress** — most visual (2min), shows KEDA autoscaling, generate load while it runs |
| 94 | +3. **Network Delay** — optional, shows resilience under degraded conditions |
| 95 | + |
| 96 | +## Cleaning Up All Experiments |
| 97 | + |
| 98 | +```bash |
| 99 | +kubectl delete podchaos,networkchaos,stresschaos --all |
| 100 | +``` |
0 commit comments