johnmathews
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 0 deletions b/‎.gitignore‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CLAUDE.md‎
Lines changed: 13 additions & 9 deletions b/‎CLAUDE.md‎
Lines changed: 13 additions & 9 deletions
diff --git a/‎docker-compose.yml‎
Lines changed: 12 additions & 0 deletions b/‎docker-compose.yml‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎docs/chaos-experiments.md‎
Lines changed: 100 additions & 0 deletions b/‎docs/chaos-experiments.md‎
Lines changed: 100 additions & 0 deletions
diff --git a/‎docs/implementation-plan.md‎
Lines changed: 6 additions & 4 deletions b/‎docs/implementation-plan.md‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎grafana/documentstream-dashboard.json‎
Lines changed: 114 additions & 0 deletions b/‎grafana/documentstream-dashboard.json‎
Lines changed: 114 additions & 0 deletions
@@ -33,6 +33,7 @@ htmlcov/
 *.pem
 credentials.json
 *.secret
+k8s/base/secret.yaml
 
 # OS
 .DS_Store
 
@@ -20,7 +20,6 @@ loan documents.
 ### Implemented (Day 2)
 - **Queue:** Redis Streams (pipeline message broker)
 - **Database:** PostgreSQL with pgvector (metadata + embeddings)
-- **Blob storage:** Azure Blob Storage (original PDFs, optional)
 
 ### Implemented (Day 3)
 - **Autoscaling:** KEDA ScaledObjects (queue-depth based, YAMLs in k8s/scaling/)
@@ -29,24 +28,29 @@ loan documents.
 - **Load testing:** Locust (locust/locustfile.py)
 - **CI/CD:** GitHub Actions deploy workflow (.github/workflows/deploy.yml)
 
-### Not yet done (needs live AKS cluster)
-- Azure infra provisioning (scripts ready in infra/)
-- Build/push images to ACR and deploy to AKS
-- Import Grafana dashboard, apply KEDA/Chaos manifests
-- End-to-end demo rehearsal
+### Implemented (Day 4)
+- **Blob storage:** Azure Blob Storage (PDFs uploaded on generate, tracked in PostgreSQL)
+- **Custom metrics:** Prometheus counters for blob uploads (count + bytes by doc_type)
+- **Dashboard:** 9 Grafana panels (added blob count/size by doc type)
+- **ServiceMonitor:** Prometheus scrape config for kube-prometheus-stack
+
+### Not yet done
+- Apply Chaos Mesh experiments on live cluster
+- CI/CD deploy workflow needs AZURE_CREDENTIALS secret
+- Demo rehearsal
 
 ## Project Structure
 - `src/gateway/` — FastAPI API + web UI (dual-mode: sync or async via Redis)
 - `src/worker/` — Extract, classify, semantic, store modules + Redis queue + worker runners
 - `src/generator/` — PDF document generator (5 templates, CLI tool)
 - `demo_samples/` — One complete loan scenario (5 PDFs, committed to git for visibility)
-- `tests/` — All tests (83 tests)
-- `k8s/base/` — Kubernetes base manifests (9 files: namespace, configmap, deployments, service, ingress, kustomization)
+- `tests/` — All tests (92 tests)
+- `k8s/base/` — Kubernetes base manifests (10 files: namespace, configmap, deployments, service, ingress, kustomization, servicemonitor)
 - `k8s/scaling/` — KEDA ScaledObjects for extract, classify, store workers
 - `k8s/chaos/` — Chaos Mesh experiments (pod-kill, network-delay, cpu-stress)
 - `infra/` — Azure setup/teardown/helm-install scripts
 - `locust/` — Locust load test (locustfile.py)
-- `grafana/` — Grafana dashboard JSON (7 panels)
+- `grafana/` — Grafana dashboard JSON (9 panels)
 - `docs/` — Documentation (architecture, classification, demo guide, dictionary, implementation plan)
 - `journal/` — Development journal
 
 
@@ -49,15 +49,21 @@ services:
       context: .
       dockerfile: src/worker/Dockerfile
     command: ["uv", "run", "python", "-m", "worker.store_runner"]
+    ports:
+      - "9102:9102"
     environment:
       - PYTHONPATH=/app/src
       - REDIS_URL=redis://redis:6379
       - DATABASE_URL=postgresql://documentstream:documentstream@postgres:5432/documentstream
+      - BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite:10000/devstoreaccount1;
+      - BLOB_CONTAINER=documents
     depends_on:
       redis:
         condition: service_started
       postgres:
         condition: service_healthy
+      azurite:
+        condition: service_started
 
   redis:
     image: redis:7-alpine
@@ -69,6 +75,12 @@ services:
       timeout: 5s
       retries: 3
 
+  azurite:
+    image: mcr.microsoft.com/azure-storage/azurite
+    ports:
+      - "10000:10000"
+    command: ["azurite-blob", "--blobHost", "0.0.0.0", "--blobPort", "10000"]
+
   postgres:
     image: pgvector/pgvector:pg16
     environment:
 
@@ -0,0 +1,100 @@
+# Chaos Engineering Experiments
+
+Chaos Mesh experiments for demonstrating Kubernetes resilience and self-healing. All experiment YAMLs are in
+`k8s/chaos/`.
+
+## Prerequisites
+
+- Chaos Mesh installed in the cluster (`helm install chaos-mesh` — already done)
+- Pipeline running with documents flowing through Redis Streams
+- Grafana dashboard open to observe the effects
+
+## Experiment 1: Pod Kill (Self-Healing)
+
+Kills classify-worker pods. K8s restarts them within seconds. Redis re-delivers unacknowledged messages — zero data loss,
+zero manual intervention.
+
+```bash
+# Watch pods (run in a separate terminal)
+kubectl get pods -w
+
+# Apply the experiment
+kubectl apply -f k8s/chaos/pod-kill.yaml
+
+# Check experiment status
+kubectl get podchaos
+
+# Clean up
+kubectl delete podchaos pod-kill-classify-worker
+```
+
+**What to watch for:**
+
+- Classify-worker pod disappears and a new one starts within seconds
+- Pod restarts counter increments (visible in Grafana "Pod Restarts" panel)
+- Pipeline continues processing after the new pod is ready
+
+## Experiment 2: Network Delay (Resilience)
+
+Injects 500ms latency (with 100ms jitter) on store-worker pods for 2 minutes. Simulates degraded connectivity to
+PostgreSQL or Azure Blob Storage.
+
+```bash
+# Apply the experiment
+kubectl apply -f k8s/chaos/network-delay.yaml
+
+# Check experiment status
+kubectl get networkchaos
+
+# Clean up (or wait 2 minutes for auto-expiry)
+kubectl delete networkchaos network-delay-store-worker
+```
+
+**What to watch for:**
+
+- Pipeline slows but does not break
+- Redis queue depth increases (store-worker processing is slower)
+- Grafana network I/O panel shows the latency effect
+- After expiry, throughput returns to normal
+
+## Experiment 3: CPU Stress (KEDA Autoscaling)
+
+Burns 80% CPU on classify-worker pods for 2 minutes. This is the most impressive experiment — it triggers KEDA
+autoscaling.
+
+```bash
+# Apply the experiment
+kubectl apply -f k8s/chaos/cpu-stress.yaml
+
+# Generate load so messages pile up in the queue
+# (use the dashboard "Generate" button, or repeat this curl)
+curl -X POST http://51.138.91.82/api/generate -H 'Content-Type: application/json' -d '{"count": 10}'
+
+# Watch KEDA scale up workers
+kubectl get pods -w
+
+# Check experiment status
+kubectl get stresschaos
+
+# Clean up (or wait 2 minutes for auto-expiry)
+kubectl delete stresschaos cpu-stress-classify-worker
+```
+
+**What to watch for:**
+
+- Classify-workers slow down, Redis queue depth rises
+- KEDA detects the lag and scales up additional classify-worker pods
+- New pods process the backlog, queue depth drops
+- After stress ends + 60s cooldown, KEDA scales back down to 1
+
+## Recommended Demo Order
+
+1. **Pod Kill** — quick (30s), shows self-healing
+2. **CPU Stress** — most visual (2min), shows KEDA autoscaling, generate load while it runs
+3. **Network Delay** — optional, shows resilience under degraded conditions
+
+## Cleaning Up All Experiments
+
+```bash
+kubectl delete podchaos,networkchaos,stresschaos --all
+```
@@ -1,8 +1,8 @@
 # DocumentStream — Implementation Plan
 
-**Timeline:** 3 days (2026-03-28 to 2026-03-30)
+**Timeline:** 3 days (2026-03-28 to 2026-03-30) + Day 4 enhancements
 **Interview:** After Day 3
-**Last updated:** 2026-03-30
+**Last updated:** 2026-04-01
 
 ---
 
@@ -17,12 +17,14 @@
 | 3 | K8s manifests | MUST | 2-2.5h | DONE |
 | 4 | Build, push, deploy to AKS | MUST | 1-1.5h | DONE (pipeline running at 51.138.91.82) |
 | 5 | KEDA autoscaling | MUST | 1-1.5h | DONE (applied, verified scaling 1→8→1) |
-| 6 | Grafana dashboard | HIGH | 1.5-2h | DONE (imported, verified with live traffic) |
+| 6 | Grafana dashboard | HIGH | 1.5-2h | DONE (9 panels, including blob storage metrics) |
 | 7 | Chaos Mesh experiments | MEDIUM | 1h | PARTIAL (YAMLs written, Chaos Mesh installed, needs apply) |
 | 8 | Locust load testing | MEDIUM | 1h | DONE (ran against AKS, verified KEDA scaling) |
-| 9 | CI/CD deploy workflow | MEDIUM | 1h | DONE |
+| 9 | CI/CD deploy workflow | MEDIUM | 1h | PARTIAL (workflow written, needs AZURE_CREDENTIALS secret) |
 | 10 | Rolling update demo prep | LOW | 30min | TODO (live demo technique) |
 | 11 | Polish and demo rehearsal | MUST | 1.5-2h | TODO |
+| -- | **Day 4: Enhancements** | -- | -- | **DONE** |
+| 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
 
 **If time runs short:** Cut from the bottom. Stages 0-5 + 11 are non-negotiable. Stage 6 (Grafana) is
 the most important "nice to have" because it's the visual centerpiece of the demo. Stages 7-10 can
 
@@ -565,6 +565,120 @@
       "title": "Network I/O per Pod",
       "type": "timeseries"
     }
+    ,
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "description": "Total number of PDFs uploaded to Azure Blob Storage, broken down by document type.",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic-by-name"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null }
+            ]
+          }
+        },
+        "overrides": []
+      },
+      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
+      "id": 8,
+      "options": {
+        "displayMode": "gradient",
+        "minVizHeight": 16,
+        "minVizWidth": 8,
+        "namePlacement": "auto",
+        "orientation": "horizontal",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showUnfilled": true,
+        "sizing": "auto",
+        "valueMode": "color"
+      },
+      "pluginVersion": "10.4.0",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "documentstream_blob_uploads_total",
+          "instant": true,
+          "legendFormat": "{{doc_type}}",
+          "range": false,
+          "refId": "A"
+        }
+      ],
+      "title": "Blob Storage — PDF Count by Type",
+      "type": "bargauge"
+    },
+    {
+      "datasource": {
+        "type": "prometheus",
+        "uid": "prometheus"
+      },
+      "description": "Total bytes uploaded to Azure Blob Storage, broken down by document type.",
+      "fieldConfig": {
+        "defaults": {
+          "color": {
+            "mode": "palette-classic-by-name"
+          },
+          "mappings": [],
+          "thresholds": {
+            "mode": "absolute",
+            "steps": [
+              { "color": "green", "value": null }
+            ]
+          },
+          "unit": "bytes"
+        },
+        "overrides": []
+      },
+      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
+      "id": 9,
+      "options": {
+        "displayMode": "gradient",
+        "minVizHeight": 16,
+        "minVizWidth": 8,
+        "namePlacement": "auto",
+        "orientation": "horizontal",
+        "reduceOptions": {
+          "calcs": ["lastNotNull"],
+          "fields": "",
+          "values": false
+        },
+        "showUnfilled": true,
+        "sizing": "auto",
+        "valueMode": "color"
+      },
+      "pluginVersion": "10.4.0",
+      "targets": [
+        {
+          "datasource": {
+            "type": "prometheus",
+            "uid": "prometheus"
+          },
+          "editorMode": "code",
+          "expr": "documentstream_blob_bytes_total",
+          "instant": true,
+          "legendFormat": "{{doc_type}}",
+          "range": false,
+          "refId": "A"
+        }
+      ],
+      "title": "Blob Storage — Total Size by Type",
+      "type": "bargauge"
+    }
   ],
   "refresh": "5s",
   "schemaVersion": 39,