@@ -9,8 +9,8 @@ commands, timing, talking points, and things to name-drop with context on **why*
99
1010Before the interview:
1111
12- - [ ] AKS cluster is running (` az aks start -n documentstream-aks -g documentstream-rg ` )
13- - [ ] PostgreSQL is running (` az postgres flexible-server start -n documentstream-pg -g documentstream-rg ` )
12+ - [ ] AKS cluster is running (` az aks start -n DocumentStreamManagedCluster -g documentstream ` )
13+ - [ ] PostgreSQL is running (` az postgres flexible-server start -n documentstream-pg -g documentstream ` )
1414- [ ] All pods are healthy (` kubectl get pods -n documentstream ` )
1515- [ ] Grafana is accessible and dashboard is loaded
1616- [ ] Chaos Mesh dashboard is accessible
@@ -104,15 +104,16 @@ Before the interview:
104104
105105### Minute 4-6: "Watch it heal"
106106
107- ** Show :** Chaos Mesh dashboard — create a PodChaos experiment
107+ ** Run :** ` kubectl apply -f k8s/chaos/pod-kill.yaml `
108108
109- > "I'm going to kill 2 classify workers . This simulates a node failure or a process crash."
109+ > "I'm going to kill a classify worker . This simulates a node failure or a process crash."
110110
111- ** Show:** Grafana — pods drop, then come back
111+ ** Show:** ` kubectl get pods -n documentstream -l app=classify-worker ` — pod dies and restarts
112+ in ~ 8 seconds
112113
113- > "Kubernetes detected the failed pods within seconds and created replacements . The
114- > documents those workers were processing? They stayed unacknowledged in the Redis
115- > stream. When the new workers started, they picked up the unfinished messages .
114+ > "Kubernetes detected the failed pod within seconds and created a replacement . The
115+ > document that worker was processing? It stayed unacknowledged in the Redis
116+ > stream. When the new worker started, it picked up the unfinished message .
116117> Zero data loss."
117118
118119** Explain the Redis Streams guarantee:**
@@ -126,21 +127,27 @@ Before the interview:
126127
127128### Minute 6-8: "Watch it handle a bad deployment"
128129
129- ** Run:** ` kubectl set image deployment/classify-worker classify-worker=documentstreamacr .azurecr.io/worker :buggy `
130+ ** Run:** ` kubectl set image deployment/gateway gateway=acrdocumentstream .azurecr.io/gateway :buggy -n documentstream `
130131
131- > "I just deployed a 'buggy' version of the classify worker — it returns errors on
132- > every request . Watch the rolling update."
132+ > "I just deployed a 'buggy' version of the gateway — pointing to an image tag that
133+ > doesn't exist . Watch the rolling update."
133134
134- ** Show:** Grafana — new pods start failing readiness probes
135+ ** Show:** ` kubectl get pods -n documentstream -l app=gateway ` — new pod is Pending/ImagePullBackOff,
136+ old pods still Running
135137
136- > "K8s starts the new pods, but they fail their readiness probes. K8s notices and
137- > stops the rollout — the old pods keep running. The system is still serving traffic.
138- > No downtime."
138+ > "K8s starts the new pod, but it can't pull the image. The rolling update strategy
139+ > keeps the old pods running — the system is still serving traffic. No downtime."
139140
140- ** Run:** ` kubectl rollout undo deployment/classify-worker `
141+ ** Verify:** ` curl http://51.138.91.82/health ` — still returns 200
142+
143+ ** Run:** ` kubectl rollout undo deployment/gateway -n documentstream `
141144
142145> "One command to rollback. The previous version is restored in seconds."
143146
147+ ** Note:** The gateway has readiness probes configured, so K8s knows not to route
148+ traffic to unhealthy pods. The workers don't have HTTP endpoints (they're Redis
149+ consumers), so the gateway is the best target for this demo.
150+
144151---
145152
146153### Minute 8-10: "Architecture & cost"
0 commit comments