|
| 1 | +# Conceptual Map: DocumentStream on K8s |
| 2 | + |
| 3 | +**Date:** 2026-03-31 |
| 4 | + |
| 5 | +This document maps out all the moving parts, what they do, and how they connect. |
| 6 | +Use it as a mental model reference before the interview. |
| 7 | + |
| 8 | +--- |
| 9 | + |
| 10 | +## The Big Picture |
| 11 | + |
| 12 | +You have **three layers**: |
| 13 | + |
| 14 | +``` |
| 15 | +┌─────────────────────────────────────────────────────────┐ |
| 16 | +│ LAYER 3: Your Application │ |
| 17 | +│ Gateway, Extract/Classify/Store workers, Redis, Postgres│ |
| 18 | +│ Managed by: kubectl apply -k k8s/base/ │ |
| 19 | +├─────────────────────────────────────────────────────────┤ |
| 20 | +│ LAYER 2: Platform Services (Helm charts) │ |
| 21 | +│ KEDA, Prometheus+Grafana, Chaos Mesh, Ingress-nginx │ |
| 22 | +│ Managed by: helm upgrade --install │ |
| 23 | +├─────────────────────────────────────────────────────────┤ |
| 24 | +│ LAYER 1: Infrastructure (Azure) │ |
| 25 | +│ AKS cluster, ACR registry, Nodes (VMs) │ |
| 26 | +│ Managed by: az aks / az acr / Azure Portal │ |
| 27 | +└─────────────────────────────────────────────────────────┘ |
| 28 | +``` |
| 29 | + |
| 30 | +Each layer builds on the one below. You can change Layer 3 without touching Layer 2. |
| 31 | +You can change Layer 2 without touching Layer 1. |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Components and What They Do |
| 36 | + |
| 37 | +### Layer 1: Infrastructure |
| 38 | + |
| 39 | +| Component | What it is | What it does | How you interact | |
| 40 | +|---|---|---|---| |
| 41 | +| **AKS cluster** | Managed Kubernetes on Azure | Runs the control plane + your nodes | `az aks start/stop/get-credentials` | |
| 42 | +| **Nodes** (2x B2s_v2) | Virtual machines | Run your containers. Each has a kubelet agent | `kubectl get nodes` | |
| 43 | +| **ACR** | Container image registry | Stores your Docker images | `docker push`, `az acr build` | |
| 44 | +| **Control plane** | API server + etcd + scheduler + controllers | Brain of the cluster. You never see these VMs | `kubectl` talks to API server | |
| 45 | + |
| 46 | +### Layer 2: Platform Services |
| 47 | + |
| 48 | +| Component | Installed via | What it does | Namespace | |
| 49 | +|---|---|---|---| |
| 50 | +| **Redis** | `helm install redis bitnami/redis` | Message broker (Streams) between pipeline stages | documentstream | |
| 51 | +| **Ingress-nginx** | `helm install ingress-nginx` | Routes external HTTP traffic into the cluster | ingress-nginx | |
| 52 | +| **KEDA** | `helm install keda kedacore/keda` | Autoscales pods based on Redis queue depth | keda | |
| 53 | +| **Prometheus + Grafana** | `helm install prometheus` | Collects metrics, displays dashboards | monitoring | |
| 54 | +| **Chaos Mesh** | `helm install chaos-mesh` | Injects failures for testing resilience | chaos-mesh | |
| 55 | + |
| 56 | +### Layer 3: Your Application |
| 57 | + |
| 58 | +| Component | K8s resource type | Replicas | What it does | |
| 59 | +|---|---|---|---| |
| 60 | +| **Gateway** | Deployment + Service + Ingress | 2 | FastAPI app. Receives uploads, publishes to Redis | |
| 61 | +| **Extract worker** | Deployment (+ KEDA ScaledObject) | 1-8 | Reads from `raw-docs` stream, extracts text with PyMuPDF | |
| 62 | +| **Classify worker** | Deployment (+ KEDA ScaledObject) | 1-8 | Reads from `extracted` stream, runs rule-based + semantic classifiers | |
| 63 | +| **Store worker** | Deployment (+ KEDA ScaledObject) | 1-8 | Reads from `classified` stream, writes to PostgreSQL | |
| 64 | +| **PostgreSQL** | Deployment + Service + PVC | 1 | Stores document metadata, classifications, vector embeddings | |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## The Levers You Pull |
| 69 | + |
| 70 | +### 1. Deploying / Updating Your App |
| 71 | + |
| 72 | +**Command:** `kubectl apply -k k8s/base/` |
| 73 | + |
| 74 | +**What it does:** Sends all your YAML manifests to the API server. K8s compares |
| 75 | +desired state (your YAML) with actual state (what's running) and reconciles. |
| 76 | + |
| 77 | +**When to use:** After changing any YAML in `k8s/base/` — resource limits, replica |
| 78 | +counts, env vars, image tags. |
| 79 | + |
| 80 | +**Pattern:** Edit YAML → `kubectl apply` → K8s rolls out changes. |
| 81 | + |
| 82 | +### 2. Scaling |
| 83 | + |
| 84 | +**Automatic (KEDA):** |
| 85 | +- `kubectl apply -f k8s/scaling/` — tells KEDA to watch Redis queue depth |
| 86 | +- KEDA checks every 15 seconds. If lag > 5 messages, scales up. If lag = 0 for |
| 87 | + 60 seconds, scales down. Min 1, max 8 replicas. |
| 88 | +- `kubectl get hpa -n documentstream` — see current scaling state |
| 89 | + |
| 90 | +**Manual override:** |
| 91 | +- `kubectl scale deployment/classify-worker -n documentstream --replicas=3` |
| 92 | +- KEDA will take back control when it next evaluates (within 15 seconds) |
| 93 | + |
| 94 | +**Node-level scaling:** Not configured. Would use Cluster Autoscaler to add nodes |
| 95 | +when pods are Pending. Currently fixed at 2 nodes. |
| 96 | + |
| 97 | +### 3. Installing Platform Services |
| 98 | + |
| 99 | +**Command:** `helm upgrade --install <name> <chart> --namespace <ns> --set key=value` |
| 100 | + |
| 101 | +**What it does:** Downloads a chart (bundled templated YAML), renders it with your |
| 102 | +`--set` values, and applies the resulting manifests. |
| 103 | + |
| 104 | +**When to use:** Setting up infrastructure inside the cluster. One-time setup, then |
| 105 | +rarely touched. |
| 106 | + |
| 107 | +**Key flags:** |
| 108 | +- `--create-namespace` — create namespace if it doesn't exist |
| 109 | +- `--set key=value` — override chart defaults |
| 110 | +- `--wait` — block until pods are running |
| 111 | +- `--timeout` — give up after this duration |
| 112 | + |
| 113 | +**See what's installed:** `helm list --all-namespaces` |
| 114 | + |
| 115 | +### 4. Injecting Failures (Chaos Mesh) |
| 116 | + |
| 117 | +**Command:** `kubectl apply -f k8s/chaos/pod-kill.yaml` |
| 118 | + |
| 119 | +**What it does:** Creates a CRD instance. Chaos Mesh operator watches for it and |
| 120 | +executes the failure injection. |
| 121 | + |
| 122 | +**Three experiments:** |
| 123 | +| File | What it does | Duration | |
| 124 | +|---|---|---| |
| 125 | +| `pod-kill.yaml` | Kills 2 classify-worker pods | 30s | |
| 126 | +| `network-delay.yaml` | Adds 500ms latency to store-worker | 2min | |
| 127 | +| `cpu-stress.yaml` | Burns 80% CPU on classify-worker | 2min | |
| 128 | + |
| 129 | +**Clean up:** Experiments auto-expire after their duration. Or: |
| 130 | +`kubectl delete podchaos pod-kill-classify-worker -n documentstream` |
| 131 | + |
| 132 | +### 5. Monitoring |
| 133 | + |
| 134 | +**Grafana dashboard:** Port-forward then open in browser. |
| 135 | +```bash |
| 136 | +kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring |
| 137 | +# http://localhost:3000 (admin / <password from secret>) |
| 138 | +``` |
| 139 | + |
| 140 | +**Quick CLI checks:** |
| 141 | +```bash |
| 142 | +kubectl get all -n documentstream # Everything in your namespace |
| 143 | +kubectl get hpa -n documentstream # KEDA scaling state |
| 144 | +kubectl get pods -n documentstream -w # Watch pods in real-time |
| 145 | +kubectl logs deployment/<name> -n documentstream --tail=30 # Check logs |
| 146 | +kubectl describe pod <name> -n documentstream # Detailed events |
| 147 | +``` |
| 148 | + |
| 149 | +### 6. Load Testing |
| 150 | + |
| 151 | +**Command:** `uv run locust -f locust/locustfile.py --host http://51.138.91.82` |
| 152 | + |
| 153 | +**What it does:** Runs simulated users from your laptop hitting the cluster. |
| 154 | +Web UI at http://localhost:8089. |
| 155 | + |
| 156 | +**The upload task** (weight 3) drives the Redis pipeline — this is what triggers |
| 157 | +KEDA scaling. The generate task (weight 1) runs synchronously in the gateway. |
| 158 | + |
| 159 | +### 7. Resetting Demo Data |
| 160 | + |
| 161 | +**Command:** `./infra/reset-demo.sh` |
| 162 | + |
| 163 | +**What it does:** Truncates PostgreSQL, deletes Redis streams and status hashes, |
| 164 | +restarts gateway to clear in-memory store. |
| 165 | + |
| 166 | +### 8. Building and Deploying New Images |
| 167 | + |
| 168 | +```bash |
| 169 | +# Build for AMD64 (required — AKS is x86, Mac is ARM) |
| 170 | +docker build --platform linux/amd64 -t acrdocumentstream.azurecr.io/gateway:latest -f src/gateway/Dockerfile . |
| 171 | +docker push acrdocumentstream.azurecr.io/gateway:latest |
| 172 | + |
| 173 | +# Tell K8s to pull the new image |
| 174 | +kubectl rollout restart deployment/gateway -n documentstream |
| 175 | +``` |
| 176 | + |
| 177 | +### 9. Cluster Lifecycle |
| 178 | + |
| 179 | +```bash |
| 180 | +# Start cluster (3-8 min, then re-fetch credentials) |
| 181 | +az aks start -g DocumentStream -n DocumentStreamManagedCluster |
| 182 | +az aks get-credentials -n DocumentStreamManagedCluster -g DocumentStream --overwrite-existing |
| 183 | + |
| 184 | +# Stop cluster (saves money — only disk costs remain) |
| 185 | +az aks stop -g DocumentStream -n DocumentStreamManagedCluster |
| 186 | + |
| 187 | +# Flush DNS if needed after restart |
| 188 | +sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder |
| 189 | +``` |
| 190 | + |
| 191 | +--- |
| 192 | + |
| 193 | +## Data Flow |
| 194 | + |
| 195 | +``` |
| 196 | +User/Locust |
| 197 | + │ |
| 198 | + ▼ |
| 199 | +Ingress-nginx (51.138.91.82:80) |
| 200 | + │ |
| 201 | + ▼ |
| 202 | +Gateway (FastAPI, 2 replicas) |
| 203 | + │ |
| 204 | + ├── POST /api/documents → Redis stream "raw-docs" → return 202 |
| 205 | + ├── POST /api/generate → sync processing in gateway → return results |
| 206 | + ├── GET /api/documents → read in-memory store → return list |
| 207 | + └── GET /health → return {"status":"healthy","mode":"async"} |
| 208 | +
|
| 209 | + Redis stream: raw-docs |
| 210 | + │ |
| 211 | + ▼ |
| 212 | +Extract Worker (KEDA-scaled 1-8) |
| 213 | + │ PyMuPDF: extract text from PDF |
| 214 | + │ |
| 215 | + ▼ |
| 216 | + Redis stream: extracted |
| 217 | + │ |
| 218 | + ▼ |
| 219 | +Classify Worker (KEDA-scaled 1-8) |
| 220 | + │ Rule-based: privacy level (Public/Confidential/Secret) |
| 221 | + │ Semantic: environmental impact + industries (sentence-transformers) |
| 222 | + │ |
| 223 | + ▼ |
| 224 | + Redis stream: classified |
| 225 | + │ |
| 226 | + ▼ |
| 227 | +Store Worker (KEDA-scaled 1-8) |
| 228 | + │ |
| 229 | + ▼ |
| 230 | +PostgreSQL (metadata + embeddings + classifications) |
| 231 | +``` |
| 232 | + |
| 233 | +--- |
| 234 | + |
| 235 | +## K8s Resource Types You're Using |
| 236 | + |
| 237 | +| Resource | Your files | Purpose | |
| 238 | +|---|---|---| |
| 239 | +| **Namespace** | `namespace.yaml` | Logical boundary: `documentstream` | |
| 240 | +| **ConfigMap** | `configmap.yaml` | Env vars: REDIS_URL, DATABASE_URL, stream names | |
| 241 | +| **Deployment** | `gateway-deployment.yaml`, `*-deployment.yaml` | Manages pod replicas and rollouts | |
| 242 | +| **Service** | `gateway-service.yaml`, `postgres-deployment.yaml` (contains Service) | Stable DNS name + load balancing to pods | |
| 243 | +| **Ingress** | `ingress.yaml` | Routes external HTTP → gateway Service | |
| 244 | +| **PersistentVolumeClaim** | `postgres-deployment.yaml` (contains PVC) | 1Gi disk for PostgreSQL data | |
| 245 | +| **Kustomization** | `kustomization.yaml` | Lists all resources for `kubectl apply -k` | |
| 246 | +| **ScaledObject** (CRD) | `k8s/scaling/*.yaml` | KEDA autoscaling rules per worker | |
| 247 | +| **PodChaos** (CRD) | `k8s/chaos/pod-kill.yaml` | Chaos Mesh pod kill experiment | |
| 248 | +| **NetworkChaos** (CRD) | `k8s/chaos/network-delay.yaml` | Chaos Mesh network delay experiment | |
| 249 | +| **StressChaos** (CRD) | `k8s/chaos/cpu-stress.yaml` | Chaos Mesh CPU stress experiment | |
| 250 | + |
| 251 | +--- |
| 252 | + |
| 253 | +## Common Patterns |
| 254 | + |
| 255 | +### "I changed a YAML file, how do I apply it?" |
| 256 | +```bash |
| 257 | +kubectl apply -k k8s/base/ # for base manifests |
| 258 | +kubectl apply -f k8s/scaling/ # for scaling rules |
| 259 | +kubectl apply -f k8s/chaos/ # for chaos experiments |
| 260 | +``` |
| 261 | + |
| 262 | +### "Something is broken, how do I debug?" |
| 263 | +```bash |
| 264 | +kubectl get pods -n documentstream # What's the status? |
| 265 | +kubectl logs deployment/<name> -n documentstream # What's the error? |
| 266 | +kubectl describe pod <name> -n documentstream # Events (scheduling, pulling, OOM) |
| 267 | +kubectl get events -n documentstream --sort-by=.metadata.creationTimestamp # Recent events |
| 268 | +``` |
| 269 | + |
| 270 | +### "I want to restart something cleanly" |
| 271 | +```bash |
| 272 | +kubectl rollout restart deployment/<name> -n documentstream |
| 273 | +``` |
| 274 | + |
| 275 | +### "I want to see what K8s thinks the desired state is" |
| 276 | +```bash |
| 277 | +kubectl get deployment <name> -n documentstream -o yaml # Full YAML from etcd |
| 278 | +kubectl get deployment <name> -n documentstream -o jsonpath='{.spec.template.spec.containers[0].resources}' # Specific field |
| 279 | +``` |
| 280 | + |
| 281 | +### "I want to temporarily force a specific number of replicas" |
| 282 | +```bash |
| 283 | +kubectl scale deployment/<name> -n documentstream --replicas=3 |
| 284 | +``` |
| 285 | + |
| 286 | +### "I want to see what Helm installed" |
| 287 | +```bash |
| 288 | +helm list --all-namespaces |
| 289 | +helm get values redis -n documentstream # See config values for a release |
| 290 | +``` |
0 commit comments