Timeline: 3 days (2026-03-28 to 2026-03-30) + Day 4 enhancements Interview: After Day 3 Last updated: 2026-04-01
| Stage | What | Priority | Est. | Status |
|---|---|---|---|---|
| -- | Day 1: Foundation | -- | -- | DONE |
| 0 | Tool setup (helm, kustomize) | MUST | 15min | DONE |
| 1 | Azure infrastructure | MUST | 1.5-2h | DONE (AKS + ACR + Helm charts running) |
| 2 | Redis Streams pipeline refactor | MUST | 2.5-3h | DONE |
| 3 | K8s manifests | MUST | 2-2.5h | DONE |
| 4 | Build, push, deploy to AKS | MUST | 1-1.5h | DONE (pipeline running at 51.138.91.82) |
| 5 | KEDA autoscaling | MUST | 1-1.5h | DONE (applied, verified scaling 1→8→1) |
| 6 | Grafana dashboard | HIGH | 1.5-2h | DONE (9 panels, including blob storage metrics) |
| 7 | Chaos Mesh experiments | MEDIUM | 1h | DONE (all 3 experiments applied and verified on live cluster, containerd fix applied) |
| 8 | Locust load testing | MEDIUM | 1h | DONE (ran against AKS, verified KEDA scaling) |
| 9 | CI/CD deploy workflow | MEDIUM | 1h | PARTIAL (workflow written, needs AZURE_CREDENTIALS secret) |
| 10 | Rolling update demo prep | LOW | 30min | DONE (verified on gateway deployment, rollback tested) |
| 11 | Polish and demo rehearsal | MUST | 1.5-2h | DONE (full demo rehearsal completed 2026-04-02) |
| -- | Day 4: Enhancements | -- | -- | DONE |
| 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
| 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (direct ONNX Runtime, no PyTorch/sentence-transformers, image ~190MB vs ~3GB) |
If time runs short: Cut from the bottom. Stages 0-5 + 11 are non-negotiable. Stage 6 (Grafana) is
the most important "nice to have" because it's the visual centerpiece of the demo. Stages 7-10 can
be done live during the interview with kubectl apply if needed.
Everything below is built and working.
| Component | Files | Status |
|---|---|---|
| FastAPI gateway (synchronous) | src/gateway/app.py, src/gateway/templates/index.html |
DONE |
| PDF generator (5 templates) | src/generator/scenario.py, templates.py, generate.py |
DONE |
| Text extractor | src/worker/extract.py |
DONE |
| Rule-based classifier | src/worker/classify.py |
DONE |
| Semantic classifier | src/worker/semantic.py |
DONE |
| Tests (51 passing, 94% coverage) | tests/test_*.py, tests/conftest.py |
DONE |
| Gateway Dockerfile | src/gateway/Dockerfile |
DONE |
| Docker Compose (gateway + redis + postgres) | docker-compose.yml |
DONE |
| CI workflow (lint + test) | .github/workflows/ci.yml |
DONE |
| Docker build+push to ghcr.io | .github/workflows/docker.yml |
DONE |
| Documentation | docs/architecture.md, classification.md, demo-guide.md, dictionary.md |
DONE |
| Demo samples (1 loan, 5 PDFs) | demo_samples/CRE-729976/ |
DONE |
| README | README.md |
DONE |
Current architecture: PDF upload -> FastAPI -> extract_text() -> classify_text() + classify_semantic() -> return results. All synchronous, all in-memory. Redis and PostgreSQL containers exist in docker-compose but the gateway doesn't connect to them yet.
| # | Task | Exit Criteria | Done |
|---|---|---|---|
| 0.1 | brew install helm kustomize |
helm version and kustomize version succeed |
[ ] |
Start this first -- AKS provisioning takes 5-10 minutes. Write Stage 2 code while it provisions.
| # | Task | Files | Exit Criteria | Done |
|---|---|---|---|---|
| 1.1 | Write Azure setup script | infra/setup.sh |
Creates resource group, ACR (Basic), AKS (3x Standard_B2ms, Free tier), PostgreSQL Flexible (Burstable B1ms, pg16), Storage Account (Standard_LRS) + blob container | [ ] |
| 1.2 | Write teardown script | infra/teardown.sh |
az group delete for full teardown. Separate stop() and start() functions for cost management |
[ ] |
| 1.3 | Write Helm install script | infra/helm-install.sh |
Installs 5 charts: ingress-nginx, kube-prometheus-stack, bitnami/redis, kedacore/keda, chaos-mesh. Each in its own namespace | [ ] |
| 1.4 | Run setup.sh | -- | kubectl get nodes shows 3 Ready nodes |
[ ] |
| 1.5 | Run helm-install.sh | -- | All Helm releases show deployed status |
[ ] |
Dependencies: None. This is the first thing to start.
This is the highest-risk stage. The gateway currently calls worker functions directly
(src/gateway/app.py lines 107-113). This must become: gateway publishes to Redis,
workers consume and process independently.
raw-docs {doc_id, filename, pdf_b64} -> extract-group
extracted {doc_id, filename, text, page_count, -> classify-group
word_count, pdf_b64}
classified {doc_id, filename, text, pdf_b64, -> store-group
classification, confidence,
semantic_privacy, environmental_impact,
industries, embedding}
Redis hash doc:{doc_id} tracks status for gateway polling (queued -> extracting -> classifying -> completed).
| # | Task | Files | Exit Criteria | Done |
|---|---|---|---|---|
| 2.1 | Redis Streams utility module | src/worker/queue.py |
publish(), consume(), ack(), set_doc_status(), get_doc_status(). Consumer group auto-creation with MKSTREAM. Configurable via env vars (REDIS_URL, stream names). PDF bytes base64-encoded. |
[ ] |
| 2.2 | Refactor gateway to dual-mode | src/gateway/app.py |
If REDIS_URL is set: upload publishes to raw-docs, returns status: queued, list/get read from Redis hash. If not set: existing synchronous behavior unchanged. |
[ ] |
| 2.3 | Extract worker runner | src/worker/extract_runner.py |
Infinite loop: XREADGROUP from raw-docs, call extract_text(), publish to extracted, XACK. SIGTERM graceful shutdown. |
[ ] |
| 2.4 | Classify worker runner | src/worker/classify_runner.py |
XREADGROUP from extracted, call classify_text() + classify_semantic(), publish to classified, XACK. |
[ ] |
| 2.5 | Store worker | src/worker/store.py |
PostgreSQL insertion: metadata + classification + vector(384) via pgvector. Optional Azure Blob upload for original PDF. | [ ] |
| 2.6 | Store worker runner | src/worker/store_runner.py |
XREADGROUP from classified, call store logic, update status to completed, XACK. |
[ ] |
| 2.7 | Database schema | src/worker/schema.sql |
CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE documents(...) with pgvector column. |
[ ] |
| 2.8 | Worker Dockerfile | src/worker/Dockerfile |
Single image (python:3.13-slim + uv), CMD overridden per K8s deployment. | [ ] |
| 2.9 | Update docker-compose for local pipeline test | docker-compose.yml |
Add extract-worker, classify-worker, store-worker services. docker compose up -> upload PDF -> document lands in PostgreSQL. |
[ ] |
| 2.10 | Tests for queue module | tests/test_queue.py |
Test publish/consume/ack with mocked Redis. | [ ] |
| 2.11 | Verify existing tests still pass | -- | All 51 tests pass (sync fallback preserved). | [ ] |
Dependencies: None for code (2.1-2.8, 2.10-2.11). Task 2.9 needs docker-compose running. Stage 1 is NOT required -- local testing uses docker-compose.
Key design decision: Dual-mode gateway preserves all existing tests. No test refactoring needed.
| # | Task | Files | Exit Criteria | Done |
|---|---|---|---|---|
| 3.1 | Namespace | k8s/base/namespace.yaml |
Namespace documentstream |
[ ] |
| 3.2 | ConfigMap | k8s/base/configmap.yaml |
REDIS_URL, DATABASE_URL, stream names |
[ ] |
| 3.3 | Gateway Deployment + Service | k8s/base/gateway-deployment.yaml, k8s/base/gateway-service.yaml |
2 replicas, 128Mi-256Mi mem, 100m-250m CPU, liveness+readiness on /health, ClusterIP port 8000 |
[ ] |
| 3.4 | Extract worker Deployment | k8s/base/extract-deployment.yaml |
1 replica, command: python -m worker.extract_runner |
[ ] |
| 3.5 | Classify worker Deployment | k8s/base/classify-deployment.yaml |
1 replica, 512Mi mem (sentence-transformers model ~80MB, process ~300-400MB total) | [ ] |
| 3.6 | Store worker Deployment | k8s/base/store-deployment.yaml |
1 replica, env includes database secret | [ ] |
| 3.7 | Ingress | k8s/base/ingress.yaml |
NGINX ingress routing to gateway service | [ ] |
| 3.8 | Kustomization | k8s/base/kustomization.yaml |
Lists all resources, common labels | [ ] |
Manifest patterns to demonstrate:
- Resource requests AND limits on every container
terminationGracePeriodSeconds: 30on workers (finish in-flight messages before SIGTERM)imagePullPolicy: Alwaysapplabel on all pods (used by KEDA selector and Chaos Mesh targeting)
Dependencies: Stage 2 code must be written (manifests reference worker runner commands and env vars).
| # | Task | Exit Criteria | Done |
|---|---|---|---|
| 4.1 | Build images to ACR | az acr build -r documentstreamacr -t gateway:latest and -t worker:latest |
Images visible in ACR |
| 4.2 | Create K8s secrets | kubectl create secret for PostgreSQL password and Blob connection string |
Secret exists in cluster |
| 4.3 | Initialize PostgreSQL schema | Run schema.sql against Azure PostgreSQL |
documents table exists with vector extension |
| 4.4 | Apply manifests | kubectl apply -k k8s/base/ |
All pods Running, gateway /health returns 200 via port-forward |
| 4.5 | End-to-end test on AKS | Upload PDF via /api/documents or hit /api/generate |
Document flows through all stages, lands in PostgreSQL |
Dependencies: Stages 1 (Azure running), 2 (code ready), 3 (manifests ready).
| # | Task | Files | Exit Criteria | Done |
|---|---|---|---|---|
| 5.1 | Extract ScaledObject | k8s/scaling/extract-scaledobject.yaml |
Redis Streams scaler, pollingInterval: 15, cooldownPeriod: 60, min 1 / max 8, lagCount: "5" |
[ ] |
| 5.2 | Classify ScaledObject | k8s/scaling/classify-scaledobject.yaml |
Same pattern as 5.1 | [ ] |
| 5.3 | Store ScaledObject | k8s/scaling/store-scaledobject.yaml |
Same pattern as 5.1 | [ ] |
| 5.4 | Verify scaling | -- | Generate 50+ docs. kubectl get pods -w shows workers scaling 1 -> 3+. After queue drains (60s), back to 1. |
[ ] |
Dependencies: Stage 4 (pipeline running on AKS).
Day 2 gate: Pipeline running on AKS with KEDA scaling visible.
| # | Task | Files | Exit Criteria | Done |
|---|---|---|---|---|
| 6.1 | Build dashboard from existing metrics | grafana/documentstream-dashboard.json |
6-8 panels using metrics already collected (no custom app instrumentation needed) | [ ] |
| 6.2 | Import dashboard into Grafana | -- | Dashboard visible in Grafana UI, all panels populated | [ ] |
Dashboard panels (all from existing Prometheus metrics):
| Panel | Metric Source |
|---|---|
| Pod count per deployment | kube_deployment_status_replicas (kube-state-metrics) |
| CPU usage per pod | container_cpu_usage_seconds_total (cAdvisor) |
| Memory usage per pod | container_memory_working_set_bytes (cAdvisor) |
| Redis stream lengths | Redis exporter (bitnami/redis chart) |
| Pod restarts (self-healing) | kube_pod_container_status_restarts_total |
| KEDA scaling decisions | keda_metrics_adapter_scaler_metrics_value |
Stretch: Add prometheus-client to gateway for custom metrics (request rate, latency histogram). Optional -- the dashboard above is sufficient for a compelling demo.
Dependencies: Stage 4 (cluster running with metrics flowing).
| # | Task | Files | Demo Value | Done |
|---|---|---|---|---|
| 7.1 | Pod kill experiment | k8s/chaos/pod-kill.yaml |
Kill 2 classify-worker pods. K8s restarts in seconds. Redis re-delivers unacked messages. Zero data loss. | [ ] |
| 7.2 | Network delay experiment | k8s/chaos/network-delay.yaml |
500ms latency on store-worker. Pipeline slows but doesn't break. Grafana shows latency spike. | [ ] |
| 7.3 | CPU stress experiment | k8s/chaos/cpu-stress.yaml |
80% CPU burn on classify-worker. KEDA scales up additional pods to compensate. | [ ] |
Dependencies: Stage 5 (KEDA must be active to show compensating scale-up for 7.3).
| # | Task | Files | Exit Criteria | Done |
|---|---|---|---|---|
| 8.1 | Write load test scenarios | locust/locustfile.py |
Two tasks: (1) upload pre-generated PDF, (2) hit /api/generate?count=1. Ramp 1 to 50 users. |
[ ] |
| 8.2 | Run against AKS | -- | Locust UI shows request rate climbing. KEDA scaling visible in kubectl get pods -w. |
[ ] |
Fallback: A bash for loop with curl is sufficient for the demo if Locust proves problematic.
Dependencies: Stage 4 (pipeline running on AKS).
| # | Task | Files | Exit Criteria | Done |
|---|---|---|---|---|
| 9.1 | Write deploy workflow | .github/workflows/deploy.yml |
On push to main: az acr build for both images, kubectl apply -k k8s/base/. Requires AZURE_CREDENTIALS GitHub secret. |
[ ] |
Dependencies: Stage 4 (AKS running, images in ACR).
| # | Task | Exit Criteria | Done |
|---|---|---|---|
| 10.1 | Prepare bad image tag | Use nonexistent tag like worker:v999 to trigger ImagePullBackOff. Old pods keep serving (rolling update strategy). kubectl rollout undo restores health. |
[ ] |
No new files. This is a live demo technique using existing manifests.
Dependencies: Stage 4 (pipeline running on AKS).
| # | Task | Exit Criteria | Done |
|---|---|---|---|
| 11.1 | Update docs to reflect actual deployment | docs/architecture.md, docs/demo-guide.md, README.md reflect real AKS state |
[ ] |
| 11.2 | Dry-run full 8-minute demo | Run every demo step against live cluster with timing | [ ] |
| 11.3 | Prepare printable materials | Architecture diagram, cost breakdown, K8s concepts list | [ ] |
| 11.4 | Stop cluster for cost management | az aks stop + az postgres flexible-server stop. Document restart commands. |
[ ] |
Dependencies: All previous stages complete (or explicitly cut).
| Time | Stage | Hours |
|---|---|---|
| Morning start | Stage 0: Tool setup | 0.25 |
| +15min | Stage 1: Azure infra (start AKS, write Stage 2 code while it provisions) | 1.75 |
| +2h | Stage 2: Redis Streams pipeline refactor | 2.5 |
| Break | -- | -- |
| Afternoon | Stage 3: K8s manifests | 2.0 |
| +2h | Stage 4: Build, push, deploy | 1.5 |
| +1.5h | Stage 5: KEDA autoscaling | 1.5 |
Day 2 gate: Pipeline running on AKS. KEDA scaling workers up and down under load.
| Time | Stage | Hours |
|---|---|---|
| Morning start | Stage 6: Grafana dashboard | 2.0 |
| +2h | Stage 7: Chaos Mesh experiments | 1.0 |
| +1h | Stage 8: Locust load testing | 1.0 |
| Break | -- | -- |
| Afternoon | Stage 9: CI/CD deploy workflow | 1.0 |
| +1h | Stage 10: Rolling update demo | 0.5 |
| +30min | Stage 11: Polish + demo rehearsal | 2.0 |
Day 3 gate: Full demo rehearsed end-to-end against live cluster. All docs updated.
| Risk | Mitigation | Fallback |
|---|---|---|
| AKS provisioning slow | Start Stage 1 FIRST. Write Stage 2 code in parallel. | Use Minikube locally for Day 1-2 dev. |
| Redis Streams integration bugs | Test locally with docker compose up before touching AKS. Keep sync fallback. |
Demo with synchronous mode if pipeline isn't ready. |
| KEDA Redis scaler misconfigured | Check kubectl logs -n keda deployment/keda-operator. Consumer group must exist first. |
Fall back to HPA with CPU-based scaling. |
| sentence-transformers OOM | Set memory limit to 768Mi or 1Gi. Model is ~80MB, process ~300-400MB total. | Use rule-based only, skip semantic. |
| Helm chart conflicts | Pin chart versions in helm-install.sh. |
Install components one at a time. |
| Running out of time | Stages 0-5 + 11 are MUST. Cut 6-10 from the bottom. | Even without Grafana, kubectl get pods -w shows scaling live. |
Infrastructure (3):
infra/setup.sh, infra/teardown.sh, infra/helm-install.sh
Worker code (7):
src/worker/queue.py, src/worker/extract_runner.py, src/worker/classify_runner.py,
src/worker/store.py, src/worker/store_runner.py, src/worker/schema.sql, src/worker/Dockerfile
K8s base manifests (8):
k8s/base/namespace.yaml, k8s/base/configmap.yaml, k8s/base/gateway-deployment.yaml,
k8s/base/gateway-service.yaml, k8s/base/extract-deployment.yaml,
k8s/base/classify-deployment.yaml, k8s/base/store-deployment.yaml,
k8s/base/ingress.yaml, k8s/base/kustomization.yaml
K8s scaling (3):
k8s/scaling/extract-scaledobject.yaml, k8s/scaling/classify-scaledobject.yaml,
k8s/scaling/store-scaledobject.yaml
K8s chaos (3):
k8s/chaos/pod-kill.yaml, k8s/chaos/network-delay.yaml, k8s/chaos/cpu-stress.yaml
Observability (1):
grafana/documentstream-dashboard.json
Load testing (1):
locust/locustfile.py
CI/CD (1):
.github/workflows/deploy.yml
Tests (1):
tests/test_queue.py