|
| 1 | +# DocumentStream |
| 2 | + |
| 3 | +A Kubernetes-native document processing pipeline for commercial real estate loan documents. |
| 4 | +Demonstrates production K8s patterns, CI/CD, event-driven autoscaling, and data engineering on Azure. |
| 5 | + |
| 6 | +## What It Does |
| 7 | + |
| 8 | +DocumentStream ingests PDF loan documents, extracts text, and classifies them across multiple |
| 9 | +dimensions using two complementary approaches: |
| 10 | + |
| 11 | +- **Rule-based classification** -- weighted keyword scoring for privacy levels (Public / Confidential / Secret) |
| 12 | +- **Semantic classification** -- sentence-transformer embeddings for environmental impact, industry sectors, and contextual privacy |
| 13 | + |
| 14 | +Documents flow through a pipeline: Upload -> Extract -> Classify -> Store. Currently runs |
| 15 | +synchronously via FastAPI; the target architecture uses Redis Streams with KEDA-scaled |
| 16 | +Kubernetes workers for each stage. |
| 17 | + |
| 18 | +## Quick Start |
| 19 | + |
| 20 | +```bash |
| 21 | +# Prerequisites: Python 3.13, uv (https://docs.astral.sh/uv/) |
| 22 | + |
| 23 | +# Install dependencies |
| 24 | +uv sync |
| 25 | + |
| 26 | +# Run tests |
| 27 | +make test |
| 28 | + |
| 29 | +# Start local dev stack (gateway + Redis + PostgreSQL) |
| 30 | +make dev |
| 31 | + |
| 32 | +# Open the web UI |
| 33 | +open http://localhost:8000 |
| 34 | +``` |
| 35 | + |
| 36 | +The web UI shows a dashboard with document stats, classification results, and a file upload form. |
| 37 | + |
| 38 | +### Generate Test Documents |
| 39 | + |
| 40 | +```bash |
| 41 | +# Generate 10 loan scenarios (50 PDFs) into generated_docs/ |
| 42 | +make generate |
| 43 | + |
| 44 | +# Or look at the committed samples |
| 45 | +ls demo_samples/CRE-729976/ |
| 46 | +``` |
| 47 | + |
| 48 | +Each loan scenario produces 5 linked PDFs sharing the same client, property, and loan data: |
| 49 | +loan application, valuation report, KYC report, contract, and invoice. |
| 50 | + |
| 51 | +## API |
| 52 | + |
| 53 | +| Endpoint | Method | Description | |
| 54 | +|---|---|---| |
| 55 | +| `/` | GET | Web UI dashboard | |
| 56 | +| `/health` | GET | Liveness probe (status, version, timestamp) | |
| 57 | +| `/api/documents` | POST | Upload a PDF for processing | |
| 58 | +| `/api/documents` | GET | List documents (filter by classification, limit 1-500) | |
| 59 | +| `/api/documents/{id}` | GET | Get a specific document's results | |
| 60 | +| `/api/generate` | POST | Generate N loan scenarios for demo/load testing | |
| 61 | + |
| 62 | +### Example: Upload and Classify a Document |
| 63 | + |
| 64 | +```bash |
| 65 | +curl -X POST http://localhost:8000/api/documents \ |
| 66 | + -F "file=@demo_samples/CRE-729976/kyc_report.pdf" |
| 67 | +``` |
| 68 | + |
| 69 | +Response includes both rule-based and semantic classification: |
| 70 | +```json |
| 71 | +{ |
| 72 | + "document_id": "...", |
| 73 | + "classification": "Secret", |
| 74 | + "confidence": 0.85, |
| 75 | + "semantic_privacy": "Secret", |
| 76 | + "environmental_impact": "Low", |
| 77 | + "industries": ["Financial Services", "Real Estate"] |
| 78 | +} |
| 79 | +``` |
| 80 | + |
| 81 | +## Classification |
| 82 | + |
| 83 | +### Rule-Based (Privacy) |
| 84 | + |
| 85 | +Weighted keyword scoring assigns a privacy level with confidence and explainability. |
| 86 | +Each keyword has a weight (e.g., "KYC" = 4.0, "due diligence" = 3.5). The classifier |
| 87 | +returns matched keywords and per-level scores, making decisions auditable. |
| 88 | + |
| 89 | +### Semantic (Environmental Impact + Industries) |
| 90 | + |
| 91 | +Uses `all-MiniLM-L6-v2` (384-dim embeddings) with descriptive anchor texts -- not keyword |
| 92 | +lists. This captures meaning: "textile dyeing facility" matches industrial contamination |
| 93 | +risk even without the word "contamination" appearing anywhere. |
| 94 | + |
| 95 | +Returns multi-label industry classifications (threshold 0.15) and environmental impact |
| 96 | +ratings (None / Low / Medium / High). The document embedding is stored for later |
| 97 | +pgvector semantic search. |
| 98 | + |
| 99 | +See [docs/classification.md](docs/classification.md) for the full deep dive. |
| 100 | + |
| 101 | +## Project Structure |
| 102 | + |
| 103 | +``` |
| 104 | +src/ |
| 105 | + gateway/ FastAPI API + web UI + Dockerfile |
| 106 | + worker/ Extract, classify, and semantic modules |
| 107 | + generator/ PDF document generator (5 templates, CLI) |
| 108 | +tests/ 51 pytest tests |
| 109 | +docs/ Architecture, classification, demo guide, dictionary |
| 110 | +demo_samples/ One committed loan scenario (5 PDFs) |
| 111 | +k8s/ Kubernetes manifests (base, scaling, chaos) |
| 112 | +infra/ Azure setup/teardown scripts |
| 113 | +locust/ Load testing |
| 114 | +grafana/ Dashboard JSON |
| 115 | +journal/ Development journal |
| 116 | +``` |
| 117 | + |
| 118 | +## Commands |
| 119 | + |
| 120 | +| Command | Description | |
| 121 | +|---|---| |
| 122 | +| `make test` | Run pytest | |
| 123 | +| `make test-cov` | Run tests + HTML coverage report | |
| 124 | +| `make lint` | Ruff check + format check | |
| 125 | +| `make lint-fix` | Auto-fix lint issues | |
| 126 | +| `make generate` | Generate 10 scenarios (50 PDFs) | |
| 127 | +| `make demo-samples` | Regenerate `demo_samples/` with one fresh scenario | |
| 128 | +| `make dev` | Start docker-compose (gateway, Redis, PostgreSQL) | |
| 129 | +| `make dev-down` | Tear down docker-compose | |
| 130 | +| `make clean` | Remove build artifacts and caches | |
| 131 | + |
| 132 | +## Architecture |
| 133 | + |
| 134 | +### Current (Synchronous) |
| 135 | + |
| 136 | +``` |
| 137 | +PDF Upload --> FastAPI Gateway --> Extract (PyMuPDF) |
| 138 | + --> Classify (rules + semantic) |
| 139 | + --> Return results (in-memory) |
| 140 | +``` |
| 141 | + |
| 142 | +### Target (Kubernetes + Redis Streams) |
| 143 | + |
| 144 | +``` |
| 145 | +PDF Upload |
| 146 | + | |
| 147 | + v |
| 148 | +Redis:raw-docs --> Extract Workers (PyMuPDF) |
| 149 | + | |
| 150 | + v |
| 151 | +Redis:extracted --> Classify Workers (rules + semantic) |
| 152 | + | |
| 153 | + v |
| 154 | +Redis:classified --> Store Workers --> PostgreSQL (pgvector) |
| 155 | + --> Azure Blob Storage |
| 156 | +``` |
| 157 | + |
| 158 | +Each stage runs as a separate K8s Deployment. KEDA monitors Redis Stream consumer group |
| 159 | +lag and scales workers based on queue depth. See [docs/architecture.md](docs/architecture.md) |
| 160 | +for the full design. |
| 161 | + |
| 162 | +## CI/CD |
| 163 | + |
| 164 | +**GitHub Actions workflows:** |
| 165 | + |
| 166 | +- **ci.yml** -- Lint (ruff) + test (pytest with coverage) on every push and PR |
| 167 | +- **docker.yml** -- Build and push Docker image to `ghcr.io/johnmathews/documentstream` on push to main |
| 168 | + |
| 169 | +## Key Design Decisions |
| 170 | + |
| 171 | +| Decision | Rationale | |
| 172 | +|---|---| |
| 173 | +| Redis Streams over Pub/Sub | At-least-once delivery with consumer group acknowledgment; crash-safe | |
| 174 | +| KEDA over HPA | Scale on queue depth (actual work), not CPU (misleading for queue workers) | |
| 175 | +| Two classifiers | Rules for structured dimensions, semantic for contextual ones | |
| 176 | +| pgvector over dedicated vector DB | Keeps architecture simple; PostgreSQL Flexible supports it natively | |
| 177 | +| Descriptive anchors (not keyword lists) | Embedding model captures meaning, not just string matches | |
| 178 | +| Local sentence-transformers | No API dependency, runs anywhere, free | |
| 179 | + |
| 180 | +## Documentation |
| 181 | + |
| 182 | +- [Architecture](docs/architecture.md) -- System design, pipeline flow, K8s target |
| 183 | +- [Classification](docs/classification.md) -- Rule-based vs semantic approaches in detail |
| 184 | +- [Demo Guide](docs/demo-guide.md) -- Step-by-step demo script with talking points |
| 185 | +- [Dictionary](docs/dictionary.md) -- K8s, Azure, and KEDA concepts |
| 186 | + |
| 187 | +## Stack |
| 188 | + |
| 189 | +Python 3.13, FastAPI, PyMuPDF, fpdf2, Faker, sentence-transformers, Redis, PostgreSQL (pgvector), |
| 190 | +Docker, GitHub Actions, uv, pytest, ruff. Target: AKS, KEDA, Prometheus, Grafana, Chaos Mesh. |
0 commit comments