Skip to content

Commit 7486e6d

Browse files
johnmathewsclaude
andcommitted
Add conceptual map learning doc and fix Chaos Mesh pod-kill mode
- learning/260331-conceptual-map.md: three-layer mental model, all levers, data flow, resource types, common command patterns - k8s/chaos/pod-kill.yaml: fix mode fixed-number→fixed (Chaos Mesh v2 API) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 68f9096 commit 7486e6d

2 files changed

Lines changed: 291 additions & 1 deletion

File tree

k8s/chaos/pod-kill.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ metadata:
2424
and resume processing without data loss.
2525
spec:
2626
action: pod-kill
27-
mode: fixed-number
27+
mode: fixed
2828
value: "2"
2929
selector:
3030
namespaces:

learning/260331-conceptual-map.md

Lines changed: 290 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,290 @@
1+
# Conceptual Map: DocumentStream on K8s
2+
3+
**Date:** 2026-03-31
4+
5+
This document maps out all the moving parts, what they do, and how they connect.
6+
Use it as a mental model reference before the interview.
7+
8+
---
9+
10+
## The Big Picture
11+
12+
You have **three layers**:
13+
14+
```
15+
┌─────────────────────────────────────────────────────────┐
16+
│ LAYER 3: Your Application │
17+
│ Gateway, Extract/Classify/Store workers, Redis, Postgres│
18+
│ Managed by: kubectl apply -k k8s/base/ │
19+
├─────────────────────────────────────────────────────────┤
20+
│ LAYER 2: Platform Services (Helm charts) │
21+
│ KEDA, Prometheus+Grafana, Chaos Mesh, Ingress-nginx │
22+
│ Managed by: helm upgrade --install │
23+
├─────────────────────────────────────────────────────────┤
24+
│ LAYER 1: Infrastructure (Azure) │
25+
│ AKS cluster, ACR registry, Nodes (VMs) │
26+
│ Managed by: az aks / az acr / Azure Portal │
27+
└─────────────────────────────────────────────────────────┘
28+
```
29+
30+
Each layer builds on the one below. You can change Layer 3 without touching Layer 2.
31+
You can change Layer 2 without touching Layer 1.
32+
33+
---
34+
35+
## Components and What They Do
36+
37+
### Layer 1: Infrastructure
38+
39+
| Component | What it is | What it does | How you interact |
40+
|---|---|---|---|
41+
| **AKS cluster** | Managed Kubernetes on Azure | Runs the control plane + your nodes | `az aks start/stop/get-credentials` |
42+
| **Nodes** (2x B2s_v2) | Virtual machines | Run your containers. Each has a kubelet agent | `kubectl get nodes` |
43+
| **ACR** | Container image registry | Stores your Docker images | `docker push`, `az acr build` |
44+
| **Control plane** | API server + etcd + scheduler + controllers | Brain of the cluster. You never see these VMs | `kubectl` talks to API server |
45+
46+
### Layer 2: Platform Services
47+
48+
| Component | Installed via | What it does | Namespace |
49+
|---|---|---|---|
50+
| **Redis** | `helm install redis bitnami/redis` | Message broker (Streams) between pipeline stages | documentstream |
51+
| **Ingress-nginx** | `helm install ingress-nginx` | Routes external HTTP traffic into the cluster | ingress-nginx |
52+
| **KEDA** | `helm install keda kedacore/keda` | Autoscales pods based on Redis queue depth | keda |
53+
| **Prometheus + Grafana** | `helm install prometheus` | Collects metrics, displays dashboards | monitoring |
54+
| **Chaos Mesh** | `helm install chaos-mesh` | Injects failures for testing resilience | chaos-mesh |
55+
56+
### Layer 3: Your Application
57+
58+
| Component | K8s resource type | Replicas | What it does |
59+
|---|---|---|---|
60+
| **Gateway** | Deployment + Service + Ingress | 2 | FastAPI app. Receives uploads, publishes to Redis |
61+
| **Extract worker** | Deployment (+ KEDA ScaledObject) | 1-8 | Reads from `raw-docs` stream, extracts text with PyMuPDF |
62+
| **Classify worker** | Deployment (+ KEDA ScaledObject) | 1-8 | Reads from `extracted` stream, runs rule-based + semantic classifiers |
63+
| **Store worker** | Deployment (+ KEDA ScaledObject) | 1-8 | Reads from `classified` stream, writes to PostgreSQL |
64+
| **PostgreSQL** | Deployment + Service + PVC | 1 | Stores document metadata, classifications, vector embeddings |
65+
66+
---
67+
68+
## The Levers You Pull
69+
70+
### 1. Deploying / Updating Your App
71+
72+
**Command:** `kubectl apply -k k8s/base/`
73+
74+
**What it does:** Sends all your YAML manifests to the API server. K8s compares
75+
desired state (your YAML) with actual state (what's running) and reconciles.
76+
77+
**When to use:** After changing any YAML in `k8s/base/` — resource limits, replica
78+
counts, env vars, image tags.
79+
80+
**Pattern:** Edit YAML → `kubectl apply` → K8s rolls out changes.
81+
82+
### 2. Scaling
83+
84+
**Automatic (KEDA):**
85+
- `kubectl apply -f k8s/scaling/` — tells KEDA to watch Redis queue depth
86+
- KEDA checks every 15 seconds. If lag > 5 messages, scales up. If lag = 0 for
87+
60 seconds, scales down. Min 1, max 8 replicas.
88+
- `kubectl get hpa -n documentstream` — see current scaling state
89+
90+
**Manual override:**
91+
- `kubectl scale deployment/classify-worker -n documentstream --replicas=3`
92+
- KEDA will take back control when it next evaluates (within 15 seconds)
93+
94+
**Node-level scaling:** Not configured. Would use Cluster Autoscaler to add nodes
95+
when pods are Pending. Currently fixed at 2 nodes.
96+
97+
### 3. Installing Platform Services
98+
99+
**Command:** `helm upgrade --install <name> <chart> --namespace <ns> --set key=value`
100+
101+
**What it does:** Downloads a chart (bundled templated YAML), renders it with your
102+
`--set` values, and applies the resulting manifests.
103+
104+
**When to use:** Setting up infrastructure inside the cluster. One-time setup, then
105+
rarely touched.
106+
107+
**Key flags:**
108+
- `--create-namespace` — create namespace if it doesn't exist
109+
- `--set key=value` — override chart defaults
110+
- `--wait` — block until pods are running
111+
- `--timeout` — give up after this duration
112+
113+
**See what's installed:** `helm list --all-namespaces`
114+
115+
### 4. Injecting Failures (Chaos Mesh)
116+
117+
**Command:** `kubectl apply -f k8s/chaos/pod-kill.yaml`
118+
119+
**What it does:** Creates a CRD instance. Chaos Mesh operator watches for it and
120+
executes the failure injection.
121+
122+
**Three experiments:**
123+
| File | What it does | Duration |
124+
|---|---|---|
125+
| `pod-kill.yaml` | Kills 2 classify-worker pods | 30s |
126+
| `network-delay.yaml` | Adds 500ms latency to store-worker | 2min |
127+
| `cpu-stress.yaml` | Burns 80% CPU on classify-worker | 2min |
128+
129+
**Clean up:** Experiments auto-expire after their duration. Or:
130+
`kubectl delete podchaos pod-kill-classify-worker -n documentstream`
131+
132+
### 5. Monitoring
133+
134+
**Grafana dashboard:** Port-forward then open in browser.
135+
```bash
136+
kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
137+
# http://localhost:3000 (admin / <password from secret>)
138+
```
139+
140+
**Quick CLI checks:**
141+
```bash
142+
kubectl get all -n documentstream # Everything in your namespace
143+
kubectl get hpa -n documentstream # KEDA scaling state
144+
kubectl get pods -n documentstream -w # Watch pods in real-time
145+
kubectl logs deployment/<name> -n documentstream --tail=30 # Check logs
146+
kubectl describe pod <name> -n documentstream # Detailed events
147+
```
148+
149+
### 6. Load Testing
150+
151+
**Command:** `uv run locust -f locust/locustfile.py --host http://51.138.91.82`
152+
153+
**What it does:** Runs simulated users from your laptop hitting the cluster.
154+
Web UI at http://localhost:8089.
155+
156+
**The upload task** (weight 3) drives the Redis pipeline — this is what triggers
157+
KEDA scaling. The generate task (weight 1) runs synchronously in the gateway.
158+
159+
### 7. Resetting Demo Data
160+
161+
**Command:** `./infra/reset-demo.sh`
162+
163+
**What it does:** Truncates PostgreSQL, deletes Redis streams and status hashes,
164+
restarts gateway to clear in-memory store.
165+
166+
### 8. Building and Deploying New Images
167+
168+
```bash
169+
# Build for AMD64 (required — AKS is x86, Mac is ARM)
170+
docker build --platform linux/amd64 -t acrdocumentstream.azurecr.io/gateway:latest -f src/gateway/Dockerfile .
171+
docker push acrdocumentstream.azurecr.io/gateway:latest
172+
173+
# Tell K8s to pull the new image
174+
kubectl rollout restart deployment/gateway -n documentstream
175+
```
176+
177+
### 9. Cluster Lifecycle
178+
179+
```bash
180+
# Start cluster (3-8 min, then re-fetch credentials)
181+
az aks start -g DocumentStream -n DocumentStreamManagedCluster
182+
az aks get-credentials -n DocumentStreamManagedCluster -g DocumentStream --overwrite-existing
183+
184+
# Stop cluster (saves money — only disk costs remain)
185+
az aks stop -g DocumentStream -n DocumentStreamManagedCluster
186+
187+
# Flush DNS if needed after restart
188+
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder
189+
```
190+
191+
---
192+
193+
## Data Flow
194+
195+
```
196+
User/Locust
197+
198+
199+
Ingress-nginx (51.138.91.82:80)
200+
201+
202+
Gateway (FastAPI, 2 replicas)
203+
204+
├── POST /api/documents → Redis stream "raw-docs" → return 202
205+
├── POST /api/generate → sync processing in gateway → return results
206+
├── GET /api/documents → read in-memory store → return list
207+
└── GET /health → return {"status":"healthy","mode":"async"}
208+
209+
Redis stream: raw-docs
210+
211+
212+
Extract Worker (KEDA-scaled 1-8)
213+
│ PyMuPDF: extract text from PDF
214+
215+
216+
Redis stream: extracted
217+
218+
219+
Classify Worker (KEDA-scaled 1-8)
220+
│ Rule-based: privacy level (Public/Confidential/Secret)
221+
│ Semantic: environmental impact + industries (sentence-transformers)
222+
223+
224+
Redis stream: classified
225+
226+
227+
Store Worker (KEDA-scaled 1-8)
228+
229+
230+
PostgreSQL (metadata + embeddings + classifications)
231+
```
232+
233+
---
234+
235+
## K8s Resource Types You're Using
236+
237+
| Resource | Your files | Purpose |
238+
|---|---|---|
239+
| **Namespace** | `namespace.yaml` | Logical boundary: `documentstream` |
240+
| **ConfigMap** | `configmap.yaml` | Env vars: REDIS_URL, DATABASE_URL, stream names |
241+
| **Deployment** | `gateway-deployment.yaml`, `*-deployment.yaml` | Manages pod replicas and rollouts |
242+
| **Service** | `gateway-service.yaml`, `postgres-deployment.yaml` (contains Service) | Stable DNS name + load balancing to pods |
243+
| **Ingress** | `ingress.yaml` | Routes external HTTP → gateway Service |
244+
| **PersistentVolumeClaim** | `postgres-deployment.yaml` (contains PVC) | 1Gi disk for PostgreSQL data |
245+
| **Kustomization** | `kustomization.yaml` | Lists all resources for `kubectl apply -k` |
246+
| **ScaledObject** (CRD) | `k8s/scaling/*.yaml` | KEDA autoscaling rules per worker |
247+
| **PodChaos** (CRD) | `k8s/chaos/pod-kill.yaml` | Chaos Mesh pod kill experiment |
248+
| **NetworkChaos** (CRD) | `k8s/chaos/network-delay.yaml` | Chaos Mesh network delay experiment |
249+
| **StressChaos** (CRD) | `k8s/chaos/cpu-stress.yaml` | Chaos Mesh CPU stress experiment |
250+
251+
---
252+
253+
## Common Patterns
254+
255+
### "I changed a YAML file, how do I apply it?"
256+
```bash
257+
kubectl apply -k k8s/base/ # for base manifests
258+
kubectl apply -f k8s/scaling/ # for scaling rules
259+
kubectl apply -f k8s/chaos/ # for chaos experiments
260+
```
261+
262+
### "Something is broken, how do I debug?"
263+
```bash
264+
kubectl get pods -n documentstream # What's the status?
265+
kubectl logs deployment/<name> -n documentstream # What's the error?
266+
kubectl describe pod <name> -n documentstream # Events (scheduling, pulling, OOM)
267+
kubectl get events -n documentstream --sort-by=.metadata.creationTimestamp # Recent events
268+
```
269+
270+
### "I want to restart something cleanly"
271+
```bash
272+
kubectl rollout restart deployment/<name> -n documentstream
273+
```
274+
275+
### "I want to see what K8s thinks the desired state is"
276+
```bash
277+
kubectl get deployment <name> -n documentstream -o yaml # Full YAML from etcd
278+
kubectl get deployment <name> -n documentstream -o jsonpath='{.spec.template.spec.containers[0].resources}' # Specific field
279+
```
280+
281+
### "I want to temporarily force a specific number of replicas"
282+
```bash
283+
kubectl scale deployment/<name> -n documentstream --replicas=3
284+
```
285+
286+
### "I want to see what Helm installed"
287+
```bash
288+
helm list --all-namespaces
289+
helm get values redis -n documentstream # See config values for a release
290+
```

0 commit comments

Comments
 (0)