Skip to content

Commit 5f3226f

Browse files
johnmathewsclaude
andcommitted
Add Azure Blob Storage integration with Prometheus metrics
- Upload PDFs to Azure Blob Storage from both gateway (generate) and store worker - Add doc_type and blob_url columns to PostgreSQL schema - Add Prometheus counters (blob uploads count + bytes by doc_type) - Add /metrics endpoint to gateway, ServiceMonitor for kube-prometheus-stack - Add 2 Grafana dashboard panels (blob count + size by type) - Add Azurite emulator to docker-compose for local dev - Move BLOB_CONNECTION_STRING to K8s Secret (gitignored), deployments use secretRef - Add chaos experiments runbook (docs/chaos-experiments.md) - 92 tests passing, lint clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 7486e6d commit 5f3226f

19 files changed

Lines changed: 767 additions & 31 deletions

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ htmlcov/
3333
*.pem
3434
credentials.json
3535
*.secret
36+
k8s/base/secret.yaml
3637

3738
# OS
3839
.DS_Store

CLAUDE.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ loan documents.
2020
### Implemented (Day 2)
2121
- **Queue:** Redis Streams (pipeline message broker)
2222
- **Database:** PostgreSQL with pgvector (metadata + embeddings)
23-
- **Blob storage:** Azure Blob Storage (original PDFs, optional)
2423

2524
### Implemented (Day 3)
2625
- **Autoscaling:** KEDA ScaledObjects (queue-depth based, YAMLs in k8s/scaling/)
@@ -29,24 +28,29 @@ loan documents.
2928
- **Load testing:** Locust (locust/locustfile.py)
3029
- **CI/CD:** GitHub Actions deploy workflow (.github/workflows/deploy.yml)
3130

32-
### Not yet done (needs live AKS cluster)
33-
- Azure infra provisioning (scripts ready in infra/)
34-
- Build/push images to ACR and deploy to AKS
35-
- Import Grafana dashboard, apply KEDA/Chaos manifests
36-
- End-to-end demo rehearsal
31+
### Implemented (Day 4)
32+
- **Blob storage:** Azure Blob Storage (PDFs uploaded on generate, tracked in PostgreSQL)
33+
- **Custom metrics:** Prometheus counters for blob uploads (count + bytes by doc_type)
34+
- **Dashboard:** 9 Grafana panels (added blob count/size by doc type)
35+
- **ServiceMonitor:** Prometheus scrape config for kube-prometheus-stack
36+
37+
### Not yet done
38+
- Apply Chaos Mesh experiments on live cluster
39+
- CI/CD deploy workflow needs AZURE_CREDENTIALS secret
40+
- Demo rehearsal
3741

3842
## Project Structure
3943
- `src/gateway/` — FastAPI API + web UI (dual-mode: sync or async via Redis)
4044
- `src/worker/` — Extract, classify, semantic, store modules + Redis queue + worker runners
4145
- `src/generator/` — PDF document generator (5 templates, CLI tool)
4246
- `demo_samples/` — One complete loan scenario (5 PDFs, committed to git for visibility)
43-
- `tests/` — All tests (83 tests)
44-
- `k8s/base/` — Kubernetes base manifests (9 files: namespace, configmap, deployments, service, ingress, kustomization)
47+
- `tests/` — All tests (92 tests)
48+
- `k8s/base/` — Kubernetes base manifests (10 files: namespace, configmap, deployments, service, ingress, kustomization, servicemonitor)
4549
- `k8s/scaling/` — KEDA ScaledObjects for extract, classify, store workers
4650
- `k8s/chaos/` — Chaos Mesh experiments (pod-kill, network-delay, cpu-stress)
4751
- `infra/` — Azure setup/teardown/helm-install scripts
4852
- `locust/` — Locust load test (locustfile.py)
49-
- `grafana/` — Grafana dashboard JSON (7 panels)
53+
- `grafana/` — Grafana dashboard JSON (9 panels)
5054
- `docs/` — Documentation (architecture, classification, demo guide, dictionary, implementation plan)
5155
- `journal/` — Development journal
5256

docker-compose.yml

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,15 +49,21 @@ services:
4949
context: .
5050
dockerfile: src/worker/Dockerfile
5151
command: ["uv", "run", "python", "-m", "worker.store_runner"]
52+
ports:
53+
- "9102:9102"
5254
environment:
5355
- PYTHONPATH=/app/src
5456
- REDIS_URL=redis://redis:6379
5557
- DATABASE_URL=postgresql://documentstream:documentstream@postgres:5432/documentstream
58+
- BLOB_CONNECTION_STRING=DefaultEndpointsProtocol=http;AccountName=devstoreaccount1;AccountKey=Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==;BlobEndpoint=http://azurite:10000/devstoreaccount1;
59+
- BLOB_CONTAINER=documents
5660
depends_on:
5761
redis:
5862
condition: service_started
5963
postgres:
6064
condition: service_healthy
65+
azurite:
66+
condition: service_started
6167

6268
redis:
6369
image: redis:7-alpine
@@ -69,6 +75,12 @@ services:
6975
timeout: 5s
7076
retries: 3
7177

78+
azurite:
79+
image: mcr.microsoft.com/azure-storage/azurite
80+
ports:
81+
- "10000:10000"
82+
command: ["azurite-blob", "--blobHost", "0.0.0.0", "--blobPort", "10000"]
83+
7284
postgres:
7385
image: pgvector/pgvector:pg16
7486
environment:

docs/chaos-experiments.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Chaos Engineering Experiments
2+
3+
Chaos Mesh experiments for demonstrating Kubernetes resilience and self-healing. All experiment YAMLs are in
4+
`k8s/chaos/`.
5+
6+
## Prerequisites
7+
8+
- Chaos Mesh installed in the cluster (`helm install chaos-mesh` — already done)
9+
- Pipeline running with documents flowing through Redis Streams
10+
- Grafana dashboard open to observe the effects
11+
12+
## Experiment 1: Pod Kill (Self-Healing)
13+
14+
Kills classify-worker pods. K8s restarts them within seconds. Redis re-delivers unacknowledged messages — zero data loss,
15+
zero manual intervention.
16+
17+
```bash
18+
# Watch pods (run in a separate terminal)
19+
kubectl get pods -w
20+
21+
# Apply the experiment
22+
kubectl apply -f k8s/chaos/pod-kill.yaml
23+
24+
# Check experiment status
25+
kubectl get podchaos
26+
27+
# Clean up
28+
kubectl delete podchaos pod-kill-classify-worker
29+
```
30+
31+
**What to watch for:**
32+
33+
- Classify-worker pod disappears and a new one starts within seconds
34+
- Pod restarts counter increments (visible in Grafana "Pod Restarts" panel)
35+
- Pipeline continues processing after the new pod is ready
36+
37+
## Experiment 2: Network Delay (Resilience)
38+
39+
Injects 500ms latency (with 100ms jitter) on store-worker pods for 2 minutes. Simulates degraded connectivity to
40+
PostgreSQL or Azure Blob Storage.
41+
42+
```bash
43+
# Apply the experiment
44+
kubectl apply -f k8s/chaos/network-delay.yaml
45+
46+
# Check experiment status
47+
kubectl get networkchaos
48+
49+
# Clean up (or wait 2 minutes for auto-expiry)
50+
kubectl delete networkchaos network-delay-store-worker
51+
```
52+
53+
**What to watch for:**
54+
55+
- Pipeline slows but does not break
56+
- Redis queue depth increases (store-worker processing is slower)
57+
- Grafana network I/O panel shows the latency effect
58+
- After expiry, throughput returns to normal
59+
60+
## Experiment 3: CPU Stress (KEDA Autoscaling)
61+
62+
Burns 80% CPU on classify-worker pods for 2 minutes. This is the most impressive experiment — it triggers KEDA
63+
autoscaling.
64+
65+
```bash
66+
# Apply the experiment
67+
kubectl apply -f k8s/chaos/cpu-stress.yaml
68+
69+
# Generate load so messages pile up in the queue
70+
# (use the dashboard "Generate" button, or repeat this curl)
71+
curl -X POST http://51.138.91.82/api/generate -H 'Content-Type: application/json' -d '{"count": 10}'
72+
73+
# Watch KEDA scale up workers
74+
kubectl get pods -w
75+
76+
# Check experiment status
77+
kubectl get stresschaos
78+
79+
# Clean up (or wait 2 minutes for auto-expiry)
80+
kubectl delete stresschaos cpu-stress-classify-worker
81+
```
82+
83+
**What to watch for:**
84+
85+
- Classify-workers slow down, Redis queue depth rises
86+
- KEDA detects the lag and scales up additional classify-worker pods
87+
- New pods process the backlog, queue depth drops
88+
- After stress ends + 60s cooldown, KEDA scales back down to 1
89+
90+
## Recommended Demo Order
91+
92+
1. **Pod Kill** — quick (30s), shows self-healing
93+
2. **CPU Stress** — most visual (2min), shows KEDA autoscaling, generate load while it runs
94+
3. **Network Delay** — optional, shows resilience under degraded conditions
95+
96+
## Cleaning Up All Experiments
97+
98+
```bash
99+
kubectl delete podchaos,networkchaos,stresschaos --all
100+
```

docs/implementation-plan.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# DocumentStream — Implementation Plan
22

3-
**Timeline:** 3 days (2026-03-28 to 2026-03-30)
3+
**Timeline:** 3 days (2026-03-28 to 2026-03-30) + Day 4 enhancements
44
**Interview:** After Day 3
5-
**Last updated:** 2026-03-30
5+
**Last updated:** 2026-04-01
66

77
---
88

@@ -17,12 +17,14 @@
1717
| 3 | K8s manifests | MUST | 2-2.5h | DONE |
1818
| 4 | Build, push, deploy to AKS | MUST | 1-1.5h | DONE (pipeline running at 51.138.91.82) |
1919
| 5 | KEDA autoscaling | MUST | 1-1.5h | DONE (applied, verified scaling 1→8→1) |
20-
| 6 | Grafana dashboard | HIGH | 1.5-2h | DONE (imported, verified with live traffic) |
20+
| 6 | Grafana dashboard | HIGH | 1.5-2h | DONE (9 panels, including blob storage metrics) |
2121
| 7 | Chaos Mesh experiments | MEDIUM | 1h | PARTIAL (YAMLs written, Chaos Mesh installed, needs apply) |
2222
| 8 | Locust load testing | MEDIUM | 1h | DONE (ran against AKS, verified KEDA scaling) |
23-
| 9 | CI/CD deploy workflow | MEDIUM | 1h | DONE |
23+
| 9 | CI/CD deploy workflow | MEDIUM | 1h | PARTIAL (workflow written, needs AZURE_CREDENTIALS secret) |
2424
| 10 | Rolling update demo prep | LOW | 30min | TODO (live demo technique) |
2525
| 11 | Polish and demo rehearsal | MUST | 1.5-2h | TODO |
26+
| -- | **Day 4: Enhancements** | -- | -- | **DONE** |
27+
| 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
2628

2729
**If time runs short:** Cut from the bottom. Stages 0-5 + 11 are non-negotiable. Stage 6 (Grafana) is
2830
the most important "nice to have" because it's the visual centerpiece of the demo. Stages 7-10 can

grafana/documentstream-dashboard.json

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -565,6 +565,120 @@
565565
"title": "Network I/O per Pod",
566566
"type": "timeseries"
567567
}
568+
,
569+
{
570+
"datasource": {
571+
"type": "prometheus",
572+
"uid": "prometheus"
573+
},
574+
"description": "Total number of PDFs uploaded to Azure Blob Storage, broken down by document type.",
575+
"fieldConfig": {
576+
"defaults": {
577+
"color": {
578+
"mode": "palette-classic-by-name"
579+
},
580+
"mappings": [],
581+
"thresholds": {
582+
"mode": "absolute",
583+
"steps": [
584+
{ "color": "green", "value": null }
585+
]
586+
}
587+
},
588+
"overrides": []
589+
},
590+
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 24 },
591+
"id": 8,
592+
"options": {
593+
"displayMode": "gradient",
594+
"minVizHeight": 16,
595+
"minVizWidth": 8,
596+
"namePlacement": "auto",
597+
"orientation": "horizontal",
598+
"reduceOptions": {
599+
"calcs": ["lastNotNull"],
600+
"fields": "",
601+
"values": false
602+
},
603+
"showUnfilled": true,
604+
"sizing": "auto",
605+
"valueMode": "color"
606+
},
607+
"pluginVersion": "10.4.0",
608+
"targets": [
609+
{
610+
"datasource": {
611+
"type": "prometheus",
612+
"uid": "prometheus"
613+
},
614+
"editorMode": "code",
615+
"expr": "documentstream_blob_uploads_total",
616+
"instant": true,
617+
"legendFormat": "{{doc_type}}",
618+
"range": false,
619+
"refId": "A"
620+
}
621+
],
622+
"title": "Blob Storage — PDF Count by Type",
623+
"type": "bargauge"
624+
},
625+
{
626+
"datasource": {
627+
"type": "prometheus",
628+
"uid": "prometheus"
629+
},
630+
"description": "Total bytes uploaded to Azure Blob Storage, broken down by document type.",
631+
"fieldConfig": {
632+
"defaults": {
633+
"color": {
634+
"mode": "palette-classic-by-name"
635+
},
636+
"mappings": [],
637+
"thresholds": {
638+
"mode": "absolute",
639+
"steps": [
640+
{ "color": "green", "value": null }
641+
]
642+
},
643+
"unit": "bytes"
644+
},
645+
"overrides": []
646+
},
647+
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 24 },
648+
"id": 9,
649+
"options": {
650+
"displayMode": "gradient",
651+
"minVizHeight": 16,
652+
"minVizWidth": 8,
653+
"namePlacement": "auto",
654+
"orientation": "horizontal",
655+
"reduceOptions": {
656+
"calcs": ["lastNotNull"],
657+
"fields": "",
658+
"values": false
659+
},
660+
"showUnfilled": true,
661+
"sizing": "auto",
662+
"valueMode": "color"
663+
},
664+
"pluginVersion": "10.4.0",
665+
"targets": [
666+
{
667+
"datasource": {
668+
"type": "prometheus",
669+
"uid": "prometheus"
670+
},
671+
"editorMode": "code",
672+
"expr": "documentstream_blob_bytes_total",
673+
"instant": true,
674+
"legendFormat": "{{doc_type}}",
675+
"range": false,
676+
"refId": "A"
677+
}
678+
],
679+
"title": "Blob Storage — Total Size by Type",
680+
"type": "bargauge"
681+
}
568682
],
569683
"refresh": "5s",
570684
"schemaVersion": 39,

0 commit comments

Comments
 (0)