Test chaos experiments on live cluster and complete demo rehearsal

johnmathews · claude · johnmathews · commit abda7e3ae9ff · 2026-04-02T22:17:31.000+02:00
- Run all 3 Chaos Mesh experiments (pod-kill, cpu-stress, network-delay) on AKS
- Fix Chaos Mesh containerd runtime config (was failing with "expected docker://")
- Fix pod-kill mode from fixed/2 to one (KEDA keeps replicas at 1 when idle)
- Update rolling update demo to target gateway (has readiness probes) not workers
- Correct Azure resource names across all docs (RG, AKS cluster, storage account)
- Update test count in README (51 → 92)
- Mark chaos mesh, rolling update, and demo rehearsal as DONE in implementation plan

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/.engineering-team/architecture-plan.md b/.engineering-team/architecture-plan.md
@@ -178,17 +178,17 @@ This is what the live demo looks like. Every architectural decision below serves
 - "KEDA monitors Redis queue depth. When documents pile up, it adds workers automatically"
 
 ### Minute 3-5: "Watch it heal"
-- Open Chaos Mesh dashboard, create a PodChaos experiment: kill 2 classify workers
-- Switch to Grafana: watch pods die and instantly restart
+- Apply PodChaos experiment: `kubectl apply -f k8s/chaos/pod-kill.yaml` (kills 1 classify worker)
+- Switch to Grafana: watch pod die and instantly restart (~8 seconds to Running+Ready)
 - Show brief latency spike, then recovery
-- "Kubernetes detected the failed pods and restarted them in seconds. No documents were lost — they stayed in the queue"
+- "Kubernetes detected the failed pod and restarted it in seconds. No documents were lost — they stayed in the queue"
 
 ### Minute 5-7: "Watch it handle a bad deployment"
-- Deploy a "buggy" version (returns 500 errors) using `kubectl set image`
-- Show rolling update: old pods still serving while new pods start failing readiness probes
-- K8s stops the rollout automatically (maxUnavailable protects the system)
-- Run `kubectl rollout undo` — instant rollback
-- "The readiness probe caught the bug. K8s stopped the rollout before it affected users. One command to rollback"
+- Deploy a "buggy" gateway version: `kubectl set image deployment/gateway gateway=acrdocumentstream.azurecr.io/gateway:buggy -n documentstream`
+- Show rolling update: old pods still serving while new pod fails to start (ImagePullBackOff)
+- Verify system still serves traffic: `curl http://51.138.91.82/health` returns 200
+- Run `kubectl rollout undo deployment/gateway -n documentstream` — instant rollback
+- "The rolling update strategy kept the old pods running. One command to rollback"
 
 ### Minute 7-8: "The CI/CD pipeline"
 - Show the GitHub Actions workflow (on screen or paper)
@@ -284,8 +284,8 @@ k8s/
 | # | Task | Details |
 |---|---|---|
 | 1.1 | Azure account setup | Create subscription, install `az` CLI, authenticate |
-| 1.2 | Create resource group | `az group create -n documentstream-rg -l westeurope` |
-| 1.3 | Create ACR | `az acr create -n documentstreamacr -g documentstream-rg --sku Basic` |
+| 1.2 | Create resource group | `az group create -n documentstream -l westeurope` |
+| 1.3 | Create ACR | `az acr create -n documentstreamacr -g documentstream --sku Basic` |
 | 1.4 | Create AKS cluster | 3x B2ms nodes, attach ACR, enable monitoring |
 | 1.5 | Install Helm charts | kube-prometheus-stack, Redis, KEDA, Chaos Mesh, ingress-nginx |
 | 1.6 | Verify cluster | `kubectl get nodes`, access Grafana, access Chaos Mesh dashboard |
@@ -410,12 +410,12 @@ This project demonstrates the following Kubernetes concepts, mapped to interview
 
 ```bash
 # Variables
-RG=documentstream-rg
+RG=documentstream
 LOCATION=westeurope
-CLUSTER=documentstream-aks
+CLUSTER=DocumentStreamManagedCluster
 ACR=documentstreamacr
 PG_SERVER=documentstream-pg
-STORAGE=documentstreamstorage
+STORAGE=documentstream
 
 # Resource Group
 az group create -n $RG -l $LOCATION
diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
@@ -48,8 +48,8 @@ jobs:
       - name: Set AKS context
         run: |
           az aks get-credentials \
-            -n documentstream-aks \
-            -g documentstream-rg \
+            -n DocumentStreamManagedCluster \
+            -g documentstream \
             --overwrite-existing
 
       - name: Apply K8s manifests
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -35,9 +35,7 @@ loan documents.
 - **ServiceMonitor:** Prometheus scrape config for kube-prometheus-stack
 
 ### Not yet done
-- Apply Chaos Mesh experiments on live cluster
 - CI/CD deploy workflow needs AZURE_CREDENTIALS secret
-- Demo rehearsal
 
 ## Project Structure
 - `src/gateway/` — FastAPI API + web UI (dual-mode: sync or async via Redis)
diff --git a/README.md b/README.md
@@ -105,7 +105,7 @@ src/
   gateway/          FastAPI API + web UI + Dockerfile
   worker/           Extract, classify, and semantic modules
   generator/        PDF document generator (5 templates, CLI)
-tests/              51 pytest tests
+tests/              92 pytest tests
 docs/               Architecture, classification, demo guide, dictionary
 demo_samples/       One committed loan scenario (5 PDFs)
 k8s/                Kubernetes manifests (base, scaling, chaos)
diff --git a/docs/chaos-experiments.md b/docs/chaos-experiments.md
@@ -5,7 +5,16 @@ Chaos Mesh experiments for demonstrating Kubernetes resilience and self-healing.
 
 ## Prerequisites
 
-- Chaos Mesh installed in the cluster (`helm install chaos-mesh` — already done)
+- Chaos Mesh installed with containerd runtime support (AKS uses containerd, not Docker):
+  ```bash
+  helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
+    --namespace chaos-mesh \
+    --set chaosDaemon.runtime=containerd \
+    --set chaosDaemon.socketPath=/run/containerd/containerd.sock
+  ```
+  Without the containerd settings, StressChaos and NetworkChaos fail with
+  `expected docker:// but got container` errors. The `infra/helm-install.sh` script
+  already includes these settings.
 - Pipeline running with documents flowing through Redis Streams
 - Grafana dashboard open to observe the effects
 
@@ -34,6 +43,9 @@ kubectl delete podchaos pod-kill-classify-worker
 - Pod restarts counter increments (visible in Grafana "Pod Restarts" panel)
 - Pipeline continues processing after the new pod is ready
 
+**Verified result (2026-04-02):** Pod killed and replacement Running+Ready in ~8 seconds.
+Pipeline processed documents immediately after recovery. Zero data loss confirmed.
+
 ## Experiment 2: Network Delay (Resilience)
 
 Injects 500ms latency (with 100ms jitter) on store-worker pods for 2 minutes. Simulates degraded connectivity to
@@ -57,6 +69,10 @@ kubectl delete networkchaos network-delay-store-worker
 - Grafana network I/O panel shows the latency effect
 - After expiry, throughput returns to normal
 
+**Verified result (2026-04-02):** Pipeline slowed but continued processing. Store-worker
+lag reached 8 messages (normally 0) during the experiment. All messages processed after
+delay cleared.
+
 ## Experiment 3: CPU Stress (KEDA Autoscaling)
 
 Burns 80% CPU on classify-worker pods for 2 minutes. This is the most impressive experiment — it triggers KEDA
@@ -87,6 +103,16 @@ kubectl delete stresschaos cpu-stress-classify-worker
 - New pods process the backlog, queue depth drops
 - After stress ends + 60s cooldown, KEDA scales back down to 1
 
+**Verified result (2026-04-02):** CPU stress injected successfully (required containerd
+runtime fix — see Prerequisites). With 80 concurrent PDF uploads, classify lag reached 18
+messages. KEDA scaled classify-workers from 1 to 4 pods within 15 seconds. Additional pods
+were Pending due to node capacity (2 nodes) — in production, AKS Cluster Autoscaler would
+add nodes.
+
+**Note:** The `/api/generate` endpoint processes documents synchronously (bypasses Redis).
+To test KEDA scaling, use the `/api/documents` upload endpoint which queues through Redis,
+or use Locust.
+
 ## Recommended Demo Order
 
 1. **Pod Kill** — quick (30s), shows self-healing
diff --git a/docs/demo-guide.md b/docs/demo-guide.md
@@ -9,8 +9,8 @@ commands, timing, talking points, and things to name-drop with context on **why*
 
 Before the interview:
 
-- [ ] AKS cluster is running (`az aks start -n documentstream-aks -g documentstream-rg`)
-- [ ] PostgreSQL is running (`az postgres flexible-server start -n documentstream-pg -g documentstream-rg`)
+- [ ] AKS cluster is running (`az aks start -n DocumentStreamManagedCluster -g documentstream`)
+- [ ] PostgreSQL is running (`az postgres flexible-server start -n documentstream-pg -g documentstream`)
 - [ ] All pods are healthy (`kubectl get pods -n documentstream`)
 - [ ] Grafana is accessible and dashboard is loaded
 - [ ] Chaos Mesh dashboard is accessible
@@ -104,15 +104,16 @@ Before the interview:
 
 ### Minute 4-6: "Watch it heal"
 
-**Show:** Chaos Mesh dashboard — create a PodChaos experiment
+**Run:** `kubectl apply -f k8s/chaos/pod-kill.yaml`
 
-> "I'm going to kill 2 classify workers. This simulates a node failure or a process crash."
+> "I'm going to kill a classify worker. This simulates a node failure or a process crash."
 
-**Show:** Grafana — pods drop, then come back
+**Show:** `kubectl get pods -n documentstream -l app=classify-worker` — pod dies and restarts
+in ~8 seconds
 
-> "Kubernetes detected the failed pods within seconds and created replacements. The
-> documents those workers were processing? They stayed unacknowledged in the Redis
-> stream. When the new workers started, they picked up the unfinished messages.
+> "Kubernetes detected the failed pod within seconds and created a replacement. The
+> document that worker was processing? It stayed unacknowledged in the Redis
+> stream. When the new worker started, it picked up the unfinished message.
 > Zero data loss."
 
 **Explain the Redis Streams guarantee:**
@@ -126,21 +127,27 @@ Before the interview:
 
 ### Minute 6-8: "Watch it handle a bad deployment"
 
-**Run:** `kubectl set image deployment/classify-worker classify-worker=documentstreamacr.azurecr.io/worker:buggy`
+**Run:** `kubectl set image deployment/gateway gateway=acrdocumentstream.azurecr.io/gateway:buggy -n documentstream`
 
-> "I just deployed a 'buggy' version of the classify worker — it returns errors on
-> every request. Watch the rolling update."
+> "I just deployed a 'buggy' version of the gateway — pointing to an image tag that
+> doesn't exist. Watch the rolling update."
 
-**Show:** Grafana — new pods start failing readiness probes
+**Show:** `kubectl get pods -n documentstream -l app=gateway` — new pod is Pending/ImagePullBackOff,
+old pods still Running
 
-> "K8s starts the new pods, but they fail their readiness probes. K8s notices and
-> stops the rollout — the old pods keep running. The system is still serving traffic.
-> No downtime."
+> "K8s starts the new pod, but it can't pull the image. The rolling update strategy
+> keeps the old pods running — the system is still serving traffic. No downtime."
 
-**Run:** `kubectl rollout undo deployment/classify-worker`
+**Verify:** `curl http://51.138.91.82/health` — still returns 200
+
+**Run:** `kubectl rollout undo deployment/gateway -n documentstream`
 
 > "One command to rollback. The previous version is restored in seconds."
 
+**Note:** The gateway has readiness probes configured, so K8s knows not to route
+traffic to unhealthy pods. The workers don't have HTTP endpoints (they're Redis
+consumers), so the gateway is the best target for this demo.
+
 ---
 
 ### Minute 8-10: "Architecture & cost"
diff --git a/docs/implementation-plan.md b/docs/implementation-plan.md
@@ -18,11 +18,11 @@
 | 4 | Build, push, deploy to AKS | MUST | 1-1.5h | DONE (pipeline running at 51.138.91.82) |
 | 5 | KEDA autoscaling | MUST | 1-1.5h | DONE (applied, verified scaling 1→8→1) |
 | 6 | Grafana dashboard | HIGH | 1.5-2h | DONE (9 panels, including blob storage metrics) |
-| 7 | Chaos Mesh experiments | MEDIUM | 1h | PARTIAL (YAMLs written, Chaos Mesh installed, needs apply) |
+| 7 | Chaos Mesh experiments | MEDIUM | 1h | DONE (all 3 experiments applied and verified on live cluster, containerd fix applied) |
 | 8 | Locust load testing | MEDIUM | 1h | DONE (ran against AKS, verified KEDA scaling) |
 | 9 | CI/CD deploy workflow | MEDIUM | 1h | PARTIAL (workflow written, needs AZURE_CREDENTIALS secret) |
-| 10 | Rolling update demo prep | LOW | 30min | TODO (live demo technique) |
-| 11 | Polish and demo rehearsal | MUST | 1.5-2h | TODO |
+| 10 | Rolling update demo prep | LOW | 30min | DONE (verified on gateway deployment, rollback tested) |
+| 11 | Polish and demo rehearsal | MUST | 1.5-2h | DONE (full demo rehearsal completed 2026-04-02) |
 | -- | **Day 4: Enhancements** | -- | -- | **DONE** |
 | 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
 | 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (sentence-transformers backend="onnx", ~50MB vs ~5GB PyTorch) |
diff --git a/infra/helm-install.sh b/infra/helm-install.sh
@@ -82,9 +82,14 @@ install_prometheus() {
 
 install_chaos_mesh() {
     echo "==> Installing Chaos Mesh..."
+    # AKS uses containerd (not Docker). Chaos Mesh defaults to docker runtime,
+    # which causes "expected docker:// but got containerd://" errors on all
+    # experiment types. Must set runtime + socket path explicitly.
     helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
         --namespace chaos-mesh \
         --create-namespace \
+        --set chaosDaemon.runtime=containerd \
+        --set chaosDaemon.socketPath=/run/containerd/containerd.sock \
         --set controllerManager.resources.requests.cpu=25m \
         --set controllerManager.resources.requests.memory=64Mi \
         --set dashboard.resources.requests.cpu=25m \
diff --git a/infra/setup.sh b/infra/setup.sh
@@ -26,7 +26,7 @@ AKS_NAME="DocumentStreamManagedCluster"
 PG_NAME="documentstream-pg"
 PG_ADMIN="documentstream"
 PG_PASSWORD="${PG_PASSWORD:?Set PG_PASSWORD environment variable before running this script}"
-STORAGE_NAME="documentstreamstorage"
+STORAGE_NAME="documentstream"
 BLOB_CONTAINER="documents"
 
 echo "==> Creating resource group..."
diff --git a/journal/260402-chaos-mesh-testing-and-demo-rehearsal.md b/journal/260402-chaos-mesh-testing-and-demo-rehearsal.md
@@ -0,0 +1,64 @@
+# Chaos Mesh Testing & Demo Rehearsal
+
+**Date:** 2026-04-02
+
+## What happened
+
+Ran all three Chaos Mesh experiments on the live AKS cluster and did a full demo rehearsal.
+
+## Chaos Mesh containerd fix
+
+The CPU stress and network delay experiments were failing with:
+```
+rpc error: code = Unknown desc = expected docker:// but got container
+```
+
+Root cause: AKS uses containerd as the container runtime, but Chaos Mesh defaults to Docker.
+Fixed by adding Helm values to `infra/helm-install.sh`:
+```
+--set chaosDaemon.runtime=containerd
+--set chaosDaemon.socketPath=/run/containerd/containerd.sock
+```
+
+Applied via `helm upgrade --install`. Chaos daemons restarted and all experiments work.
+
+## Experiment results
+
+### Pod Kill
+- Pod killed and replacement Running+Ready in ~8 seconds
+- Pipeline processed documents immediately after recovery
+- Zero data loss confirmed
+- Changed `mode: fixed` / `value: "2"` to `mode: one` — KEDA keeps replicas at 1 when
+  queue is empty, so requesting to kill 2 pods fails
+
+### CPU Stress
+- Successfully injected 80% CPU burn on classify-worker
+- With 80 concurrent uploads, classify lag reached 18 messages
+- KEDA scaled classify-workers from 1 to 4 pods within 15 seconds
+- 3 new pods were Pending due to node capacity (2 nodes instead of 3)
+- In production, AKS Cluster Autoscaler would add nodes
+
+### Network Delay
+- 500ms delay injected on store-worker
+- Pipeline slowed but continued processing
+- Store lag reached 8 messages during the experiment
+- All messages processed after delay cleared
+
+## Demo rehearsal findings
+
+1. **Web UI** loads in ~36ms, generate endpoint works reliably
+2. **Rolling update demo** should target the gateway (has readiness probes),
+   not the workers (no HTTP endpoints). Updated demo guide accordingly.
+3. **KEDA + Chaos Mesh interaction:** The `/api/generate` endpoint processes
+   synchronously (bypasses Redis), so it doesn't trigger KEDA scaling. Must use
+   `/api/documents` upload endpoint or Locust for queue-based load.
+4. **Node capacity:** Only 2 of 3 configured nodes are running. Some KEDA-scaled
+   pods couldn't schedule. Not a blocker for the demo narrative but worth noting.
+
+## Files changed
+
+- `k8s/chaos/pod-kill.yaml` — changed mode from `fixed`/`value: "2"` to `one`
+- `infra/helm-install.sh` — added containerd runtime settings for Chaos Mesh
+- `docs/chaos-experiments.md` — added containerd prerequisite note and verified results
+- `docs/demo-guide.md` — updated rolling update section to use gateway instead of workers
+- `CLAUDE.md` — removed chaos mesh and demo rehearsal from "Not yet done"
diff --git a/k8s/chaos/pod-kill.yaml b/k8s/chaos/pod-kill.yaml
@@ -19,13 +19,12 @@ metadata:
   namespace: documentstream
   annotations:
     chaos-mesh.org/description: >-
-      Kill 2 classify-worker pods to verify K8s self-healing and Redis
-      at-least-once message delivery. Pods should restart within seconds
+      Kill a classify-worker pod to verify K8s self-healing and Redis
+      at-least-once message delivery. Pod should restart within seconds
       and resume processing without data loss.
 spec:
   action: pod-kill
-  mode: fixed
-  value: "2"
+  mode: one
   selector:
     namespaces:
       - documentstream