Skip to content

Commit abda7e3

Browse files
johnmathewsclaude
andcommitted
Test chaos experiments on live cluster and complete demo rehearsal
- Run all 3 Chaos Mesh experiments (pod-kill, cpu-stress, network-delay) on AKS - Fix Chaos Mesh containerd runtime config (was failing with "expected docker://") - Fix pod-kill mode from fixed/2 to one (KEDA keeps replicas at 1 when idle) - Update rolling update demo to target gateway (has readiness probes) not workers - Correct Azure resource names across all docs (RG, AKS cluster, storage account) - Update test count in README (51 → 92) - Mark chaos mesh, rolling update, and demo rehearsal as DONE in implementation plan Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 50964d9 commit abda7e3

11 files changed

Lines changed: 142 additions & 43 deletions

File tree

.engineering-team/architecture-plan.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -178,17 +178,17 @@ This is what the live demo looks like. Every architectural decision below serves
178178
- "KEDA monitors Redis queue depth. When documents pile up, it adds workers automatically"
179179

180180
### Minute 3-5: "Watch it heal"
181-
- Open Chaos Mesh dashboard, create a PodChaos experiment: kill 2 classify workers
182-
- Switch to Grafana: watch pods die and instantly restart
181+
- Apply PodChaos experiment: `kubectl apply -f k8s/chaos/pod-kill.yaml` (kills 1 classify worker)
182+
- Switch to Grafana: watch pod die and instantly restart (~8 seconds to Running+Ready)
183183
- Show brief latency spike, then recovery
184-
- "Kubernetes detected the failed pods and restarted them in seconds. No documents were lost — they stayed in the queue"
184+
- "Kubernetes detected the failed pod and restarted it in seconds. No documents were lost — they stayed in the queue"
185185

186186
### Minute 5-7: "Watch it handle a bad deployment"
187-
- Deploy a "buggy" version (returns 500 errors) using `kubectl set image`
188-
- Show rolling update: old pods still serving while new pods start failing readiness probes
189-
- K8s stops the rollout automatically (maxUnavailable protects the system)
190-
- Run `kubectl rollout undo` — instant rollback
191-
- "The readiness probe caught the bug. K8s stopped the rollout before it affected users. One command to rollback"
187+
- Deploy a "buggy" gateway version: `kubectl set image deployment/gateway gateway=acrdocumentstream.azurecr.io/gateway:buggy -n documentstream`
188+
- Show rolling update: old pods still serving while new pod fails to start (ImagePullBackOff)
189+
- Verify system still serves traffic: `curl http://51.138.91.82/health` returns 200
190+
- Run `kubectl rollout undo deployment/gateway -n documentstream` — instant rollback
191+
- "The rolling update strategy kept the old pods running. One command to rollback"
192192

193193
### Minute 7-8: "The CI/CD pipeline"
194194
- Show the GitHub Actions workflow (on screen or paper)
@@ -284,8 +284,8 @@ k8s/
284284
| # | Task | Details |
285285
|---|---|---|
286286
| 1.1 | Azure account setup | Create subscription, install `az` CLI, authenticate |
287-
| 1.2 | Create resource group | `az group create -n documentstream-rg -l westeurope` |
288-
| 1.3 | Create ACR | `az acr create -n documentstreamacr -g documentstream-rg --sku Basic` |
287+
| 1.2 | Create resource group | `az group create -n documentstream -l westeurope` |
288+
| 1.3 | Create ACR | `az acr create -n documentstreamacr -g documentstream --sku Basic` |
289289
| 1.4 | Create AKS cluster | 3x B2ms nodes, attach ACR, enable monitoring |
290290
| 1.5 | Install Helm charts | kube-prometheus-stack, Redis, KEDA, Chaos Mesh, ingress-nginx |
291291
| 1.6 | Verify cluster | `kubectl get nodes`, access Grafana, access Chaos Mesh dashboard |
@@ -410,12 +410,12 @@ This project demonstrates the following Kubernetes concepts, mapped to interview
410410

411411
```bash
412412
# Variables
413-
RG=documentstream-rg
413+
RG=documentstream
414414
LOCATION=westeurope
415-
CLUSTER=documentstream-aks
415+
CLUSTER=DocumentStreamManagedCluster
416416
ACR=documentstreamacr
417417
PG_SERVER=documentstream-pg
418-
STORAGE=documentstreamstorage
418+
STORAGE=documentstream
419419

420420
# Resource Group
421421
az group create -n $RG -l $LOCATION

.github/workflows/deploy.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ jobs:
4848
- name: Set AKS context
4949
run: |
5050
az aks get-credentials \
51-
-n documentstream-aks \
52-
-g documentstream-rg \
51+
-n DocumentStreamManagedCluster \
52+
-g documentstream \
5353
--overwrite-existing
5454
5555
- name: Apply K8s manifests

CLAUDE.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,7 @@ loan documents.
3535
- **ServiceMonitor:** Prometheus scrape config for kube-prometheus-stack
3636

3737
### Not yet done
38-
- Apply Chaos Mesh experiments on live cluster
3938
- CI/CD deploy workflow needs AZURE_CREDENTIALS secret
40-
- Demo rehearsal
4139

4240
## Project Structure
4341
- `src/gateway/` — FastAPI API + web UI (dual-mode: sync or async via Redis)

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ src/
105105
gateway/ FastAPI API + web UI + Dockerfile
106106
worker/ Extract, classify, and semantic modules
107107
generator/ PDF document generator (5 templates, CLI)
108-
tests/ 51 pytest tests
108+
tests/ 92 pytest tests
109109
docs/ Architecture, classification, demo guide, dictionary
110110
demo_samples/ One committed loan scenario (5 PDFs)
111111
k8s/ Kubernetes manifests (base, scaling, chaos)

docs/chaos-experiments.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,16 @@ Chaos Mesh experiments for demonstrating Kubernetes resilience and self-healing.
55

66
## Prerequisites
77

8-
- Chaos Mesh installed in the cluster (`helm install chaos-mesh` — already done)
8+
- Chaos Mesh installed with containerd runtime support (AKS uses containerd, not Docker):
9+
```bash
10+
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
11+
--namespace chaos-mesh \
12+
--set chaosDaemon.runtime=containerd \
13+
--set chaosDaemon.socketPath=/run/containerd/containerd.sock
14+
```
15+
Without the containerd settings, StressChaos and NetworkChaos fail with
16+
`expected docker:// but got container` errors. The `infra/helm-install.sh` script
17+
already includes these settings.
918
- Pipeline running with documents flowing through Redis Streams
1019
- Grafana dashboard open to observe the effects
1120

@@ -34,6 +43,9 @@ kubectl delete podchaos pod-kill-classify-worker
3443
- Pod restarts counter increments (visible in Grafana "Pod Restarts" panel)
3544
- Pipeline continues processing after the new pod is ready
3645

46+
**Verified result (2026-04-02):** Pod killed and replacement Running+Ready in ~8 seconds.
47+
Pipeline processed documents immediately after recovery. Zero data loss confirmed.
48+
3749
## Experiment 2: Network Delay (Resilience)
3850

3951
Injects 500ms latency (with 100ms jitter) on store-worker pods for 2 minutes. Simulates degraded connectivity to
@@ -57,6 +69,10 @@ kubectl delete networkchaos network-delay-store-worker
5769
- Grafana network I/O panel shows the latency effect
5870
- After expiry, throughput returns to normal
5971

72+
**Verified result (2026-04-02):** Pipeline slowed but continued processing. Store-worker
73+
lag reached 8 messages (normally 0) during the experiment. All messages processed after
74+
delay cleared.
75+
6076
## Experiment 3: CPU Stress (KEDA Autoscaling)
6177

6278
Burns 80% CPU on classify-worker pods for 2 minutes. This is the most impressive experiment — it triggers KEDA
@@ -87,6 +103,16 @@ kubectl delete stresschaos cpu-stress-classify-worker
87103
- New pods process the backlog, queue depth drops
88104
- After stress ends + 60s cooldown, KEDA scales back down to 1
89105

106+
**Verified result (2026-04-02):** CPU stress injected successfully (required containerd
107+
runtime fix — see Prerequisites). With 80 concurrent PDF uploads, classify lag reached 18
108+
messages. KEDA scaled classify-workers from 1 to 4 pods within 15 seconds. Additional pods
109+
were Pending due to node capacity (2 nodes) — in production, AKS Cluster Autoscaler would
110+
add nodes.
111+
112+
**Note:** The `/api/generate` endpoint processes documents synchronously (bypasses Redis).
113+
To test KEDA scaling, use the `/api/documents` upload endpoint which queues through Redis,
114+
or use Locust.
115+
90116
## Recommended Demo Order
91117

92118
1. **Pod Kill** — quick (30s), shows self-healing

docs/demo-guide.md

Lines changed: 23 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@ commands, timing, talking points, and things to name-drop with context on **why*
99

1010
Before the interview:
1111

12-
- [ ] AKS cluster is running (`az aks start -n documentstream-aks -g documentstream-rg`)
13-
- [ ] PostgreSQL is running (`az postgres flexible-server start -n documentstream-pg -g documentstream-rg`)
12+
- [ ] AKS cluster is running (`az aks start -n DocumentStreamManagedCluster -g documentstream`)
13+
- [ ] PostgreSQL is running (`az postgres flexible-server start -n documentstream-pg -g documentstream`)
1414
- [ ] All pods are healthy (`kubectl get pods -n documentstream`)
1515
- [ ] Grafana is accessible and dashboard is loaded
1616
- [ ] Chaos Mesh dashboard is accessible
@@ -104,15 +104,16 @@ Before the interview:
104104

105105
### Minute 4-6: "Watch it heal"
106106

107-
**Show:** Chaos Mesh dashboard — create a PodChaos experiment
107+
**Run:** `kubectl apply -f k8s/chaos/pod-kill.yaml`
108108

109-
> "I'm going to kill 2 classify workers. This simulates a node failure or a process crash."
109+
> "I'm going to kill a classify worker. This simulates a node failure or a process crash."
110110
111-
**Show:** Grafana — pods drop, then come back
111+
**Show:** `kubectl get pods -n documentstream -l app=classify-worker` — pod dies and restarts
112+
in ~8 seconds
112113

113-
> "Kubernetes detected the failed pods within seconds and created replacements. The
114-
> documents those workers were processing? They stayed unacknowledged in the Redis
115-
> stream. When the new workers started, they picked up the unfinished messages.
114+
> "Kubernetes detected the failed pod within seconds and created a replacement. The
115+
> document that worker was processing? It stayed unacknowledged in the Redis
116+
> stream. When the new worker started, it picked up the unfinished message.
116117
> Zero data loss."
117118
118119
**Explain the Redis Streams guarantee:**
@@ -126,21 +127,27 @@ Before the interview:
126127

127128
### Minute 6-8: "Watch it handle a bad deployment"
128129

129-
**Run:** `kubectl set image deployment/classify-worker classify-worker=documentstreamacr.azurecr.io/worker:buggy`
130+
**Run:** `kubectl set image deployment/gateway gateway=acrdocumentstream.azurecr.io/gateway:buggy -n documentstream`
130131

131-
> "I just deployed a 'buggy' version of the classify worker — it returns errors on
132-
> every request. Watch the rolling update."
132+
> "I just deployed a 'buggy' version of the gateway — pointing to an image tag that
133+
> doesn't exist. Watch the rolling update."
133134
134-
**Show:** Grafana — new pods start failing readiness probes
135+
**Show:** `kubectl get pods -n documentstream -l app=gateway` — new pod is Pending/ImagePullBackOff,
136+
old pods still Running
135137

136-
> "K8s starts the new pods, but they fail their readiness probes. K8s notices and
137-
> stops the rollout — the old pods keep running. The system is still serving traffic.
138-
> No downtime."
138+
> "K8s starts the new pod, but it can't pull the image. The rolling update strategy
139+
> keeps the old pods running — the system is still serving traffic. No downtime."
139140
140-
**Run:** `kubectl rollout undo deployment/classify-worker`
141+
**Verify:** `curl http://51.138.91.82/health` — still returns 200
142+
143+
**Run:** `kubectl rollout undo deployment/gateway -n documentstream`
141144

142145
> "One command to rollback. The previous version is restored in seconds."
143146
147+
**Note:** The gateway has readiness probes configured, so K8s knows not to route
148+
traffic to unhealthy pods. The workers don't have HTTP endpoints (they're Redis
149+
consumers), so the gateway is the best target for this demo.
150+
144151
---
145152

146153
### Minute 8-10: "Architecture & cost"

docs/implementation-plan.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,11 @@
1818
| 4 | Build, push, deploy to AKS | MUST | 1-1.5h | DONE (pipeline running at 51.138.91.82) |
1919
| 5 | KEDA autoscaling | MUST | 1-1.5h | DONE (applied, verified scaling 1→8→1) |
2020
| 6 | Grafana dashboard | HIGH | 1.5-2h | DONE (9 panels, including blob storage metrics) |
21-
| 7 | Chaos Mesh experiments | MEDIUM | 1h | PARTIAL (YAMLs written, Chaos Mesh installed, needs apply) |
21+
| 7 | Chaos Mesh experiments | MEDIUM | 1h | DONE (all 3 experiments applied and verified on live cluster, containerd fix applied) |
2222
| 8 | Locust load testing | MEDIUM | 1h | DONE (ran against AKS, verified KEDA scaling) |
2323
| 9 | CI/CD deploy workflow | MEDIUM | 1h | PARTIAL (workflow written, needs AZURE_CREDENTIALS secret) |
24-
| 10 | Rolling update demo prep | LOW | 30min | TODO (live demo technique) |
25-
| 11 | Polish and demo rehearsal | MUST | 1.5-2h | TODO |
24+
| 10 | Rolling update demo prep | LOW | 30min | DONE (verified on gateway deployment, rollback tested) |
25+
| 11 | Polish and demo rehearsal | MUST | 1.5-2h | DONE (full demo rehearsal completed 2026-04-02) |
2626
| -- | **Day 4: Enhancements** | -- | -- | **DONE** |
2727
| 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
2828
| 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (sentence-transformers backend="onnx", ~50MB vs ~5GB PyTorch) |

infra/helm-install.sh

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -82,9 +82,14 @@ install_prometheus() {
8282

8383
install_chaos_mesh() {
8484
echo "==> Installing Chaos Mesh..."
85+
# AKS uses containerd (not Docker). Chaos Mesh defaults to docker runtime,
86+
# which causes "expected docker:// but got containerd://" errors on all
87+
# experiment types. Must set runtime + socket path explicitly.
8588
helm upgrade --install chaos-mesh chaos-mesh/chaos-mesh \
8689
--namespace chaos-mesh \
8790
--create-namespace \
91+
--set chaosDaemon.runtime=containerd \
92+
--set chaosDaemon.socketPath=/run/containerd/containerd.sock \
8893
--set controllerManager.resources.requests.cpu=25m \
8994
--set controllerManager.resources.requests.memory=64Mi \
9095
--set dashboard.resources.requests.cpu=25m \

infra/setup.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ AKS_NAME="DocumentStreamManagedCluster"
2626
PG_NAME="documentstream-pg"
2727
PG_ADMIN="documentstream"
2828
PG_PASSWORD="${PG_PASSWORD:?Set PG_PASSWORD environment variable before running this script}"
29-
STORAGE_NAME="documentstreamstorage"
29+
STORAGE_NAME="documentstream"
3030
BLOB_CONTAINER="documents"
3131

3232
echo "==> Creating resource group..."
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# Chaos Mesh Testing & Demo Rehearsal
2+
3+
**Date:** 2026-04-02
4+
5+
## What happened
6+
7+
Ran all three Chaos Mesh experiments on the live AKS cluster and did a full demo rehearsal.
8+
9+
## Chaos Mesh containerd fix
10+
11+
The CPU stress and network delay experiments were failing with:
12+
```
13+
rpc error: code = Unknown desc = expected docker:// but got container
14+
```
15+
16+
Root cause: AKS uses containerd as the container runtime, but Chaos Mesh defaults to Docker.
17+
Fixed by adding Helm values to `infra/helm-install.sh`:
18+
```
19+
--set chaosDaemon.runtime=containerd
20+
--set chaosDaemon.socketPath=/run/containerd/containerd.sock
21+
```
22+
23+
Applied via `helm upgrade --install`. Chaos daemons restarted and all experiments work.
24+
25+
## Experiment results
26+
27+
### Pod Kill
28+
- Pod killed and replacement Running+Ready in ~8 seconds
29+
- Pipeline processed documents immediately after recovery
30+
- Zero data loss confirmed
31+
- Changed `mode: fixed` / `value: "2"` to `mode: one` — KEDA keeps replicas at 1 when
32+
queue is empty, so requesting to kill 2 pods fails
33+
34+
### CPU Stress
35+
- Successfully injected 80% CPU burn on classify-worker
36+
- With 80 concurrent uploads, classify lag reached 18 messages
37+
- KEDA scaled classify-workers from 1 to 4 pods within 15 seconds
38+
- 3 new pods were Pending due to node capacity (2 nodes instead of 3)
39+
- In production, AKS Cluster Autoscaler would add nodes
40+
41+
### Network Delay
42+
- 500ms delay injected on store-worker
43+
- Pipeline slowed but continued processing
44+
- Store lag reached 8 messages during the experiment
45+
- All messages processed after delay cleared
46+
47+
## Demo rehearsal findings
48+
49+
1. **Web UI** loads in ~36ms, generate endpoint works reliably
50+
2. **Rolling update demo** should target the gateway (has readiness probes),
51+
not the workers (no HTTP endpoints). Updated demo guide accordingly.
52+
3. **KEDA + Chaos Mesh interaction:** The `/api/generate` endpoint processes
53+
synchronously (bypasses Redis), so it doesn't trigger KEDA scaling. Must use
54+
`/api/documents` upload endpoint or Locust for queue-based load.
55+
4. **Node capacity:** Only 2 of 3 configured nodes are running. Some KEDA-scaled
56+
pods couldn't schedule. Not a blocker for the demo narrative but worth noting.
57+
58+
## Files changed
59+
60+
- `k8s/chaos/pod-kill.yaml` — changed mode from `fixed`/`value: "2"` to `one`
61+
- `infra/helm-install.sh` — added containerd runtime settings for Chaos Mesh
62+
- `docs/chaos-experiments.md` — added containerd prerequisite note and verified results
63+
- `docs/demo-guide.md` — updated rolling update section to use gateway instead of workers
64+
- `CLAUDE.md` — removed chaos mesh and demo rehearsal from "Not yet done"

0 commit comments

Comments
 (0)