Skip to content

Commit e1603ff

Browse files
johnmathewsclaude
andcommitted
Add README.md with project overview, quick start, and API docs
Comprehensive README covering the full project: what it does, how to run it, API reference, classification approaches, architecture diagrams, design decisions, and links to detailed docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 3a2554a commit e1603ff

2 files changed

Lines changed: 207 additions & 0 deletions

File tree

README.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
# DocumentStream
2+
3+
A Kubernetes-native document processing pipeline for commercial real estate loan documents.
4+
Demonstrates production K8s patterns, CI/CD, event-driven autoscaling, and data engineering on Azure.
5+
6+
## What It Does
7+
8+
DocumentStream ingests PDF loan documents, extracts text, and classifies them across multiple
9+
dimensions using two complementary approaches:
10+
11+
- **Rule-based classification** -- weighted keyword scoring for privacy levels (Public / Confidential / Secret)
12+
- **Semantic classification** -- sentence-transformer embeddings for environmental impact, industry sectors, and contextual privacy
13+
14+
Documents flow through a pipeline: Upload -> Extract -> Classify -> Store. Currently runs
15+
synchronously via FastAPI; the target architecture uses Redis Streams with KEDA-scaled
16+
Kubernetes workers for each stage.
17+
18+
## Quick Start
19+
20+
```bash
21+
# Prerequisites: Python 3.13, uv (https://docs.astral.sh/uv/)
22+
23+
# Install dependencies
24+
uv sync
25+
26+
# Run tests
27+
make test
28+
29+
# Start local dev stack (gateway + Redis + PostgreSQL)
30+
make dev
31+
32+
# Open the web UI
33+
open http://localhost:8000
34+
```
35+
36+
The web UI shows a dashboard with document stats, classification results, and a file upload form.
37+
38+
### Generate Test Documents
39+
40+
```bash
41+
# Generate 10 loan scenarios (50 PDFs) into generated_docs/
42+
make generate
43+
44+
# Or look at the committed samples
45+
ls demo_samples/CRE-729976/
46+
```
47+
48+
Each loan scenario produces 5 linked PDFs sharing the same client, property, and loan data:
49+
loan application, valuation report, KYC report, contract, and invoice.
50+
51+
## API
52+
53+
| Endpoint | Method | Description |
54+
|---|---|---|
55+
| `/` | GET | Web UI dashboard |
56+
| `/health` | GET | Liveness probe (status, version, timestamp) |
57+
| `/api/documents` | POST | Upload a PDF for processing |
58+
| `/api/documents` | GET | List documents (filter by classification, limit 1-500) |
59+
| `/api/documents/{id}` | GET | Get a specific document's results |
60+
| `/api/generate` | POST | Generate N loan scenarios for demo/load testing |
61+
62+
### Example: Upload and Classify a Document
63+
64+
```bash
65+
curl -X POST http://localhost:8000/api/documents \
66+
-F "file=@demo_samples/CRE-729976/kyc_report.pdf"
67+
```
68+
69+
Response includes both rule-based and semantic classification:
70+
```json
71+
{
72+
"document_id": "...",
73+
"classification": "Secret",
74+
"confidence": 0.85,
75+
"semantic_privacy": "Secret",
76+
"environmental_impact": "Low",
77+
"industries": ["Financial Services", "Real Estate"]
78+
}
79+
```
80+
81+
## Classification
82+
83+
### Rule-Based (Privacy)
84+
85+
Weighted keyword scoring assigns a privacy level with confidence and explainability.
86+
Each keyword has a weight (e.g., "KYC" = 4.0, "due diligence" = 3.5). The classifier
87+
returns matched keywords and per-level scores, making decisions auditable.
88+
89+
### Semantic (Environmental Impact + Industries)
90+
91+
Uses `all-MiniLM-L6-v2` (384-dim embeddings) with descriptive anchor texts -- not keyword
92+
lists. This captures meaning: "textile dyeing facility" matches industrial contamination
93+
risk even without the word "contamination" appearing anywhere.
94+
95+
Returns multi-label industry classifications (threshold 0.15) and environmental impact
96+
ratings (None / Low / Medium / High). The document embedding is stored for later
97+
pgvector semantic search.
98+
99+
See [docs/classification.md](docs/classification.md) for the full deep dive.
100+
101+
## Project Structure
102+
103+
```
104+
src/
105+
gateway/ FastAPI API + web UI + Dockerfile
106+
worker/ Extract, classify, and semantic modules
107+
generator/ PDF document generator (5 templates, CLI)
108+
tests/ 51 pytest tests
109+
docs/ Architecture, classification, demo guide, dictionary
110+
demo_samples/ One committed loan scenario (5 PDFs)
111+
k8s/ Kubernetes manifests (base, scaling, chaos)
112+
infra/ Azure setup/teardown scripts
113+
locust/ Load testing
114+
grafana/ Dashboard JSON
115+
journal/ Development journal
116+
```
117+
118+
## Commands
119+
120+
| Command | Description |
121+
|---|---|
122+
| `make test` | Run pytest |
123+
| `make test-cov` | Run tests + HTML coverage report |
124+
| `make lint` | Ruff check + format check |
125+
| `make lint-fix` | Auto-fix lint issues |
126+
| `make generate` | Generate 10 scenarios (50 PDFs) |
127+
| `make demo-samples` | Regenerate `demo_samples/` with one fresh scenario |
128+
| `make dev` | Start docker-compose (gateway, Redis, PostgreSQL) |
129+
| `make dev-down` | Tear down docker-compose |
130+
| `make clean` | Remove build artifacts and caches |
131+
132+
## Architecture
133+
134+
### Current (Synchronous)
135+
136+
```
137+
PDF Upload --> FastAPI Gateway --> Extract (PyMuPDF)
138+
--> Classify (rules + semantic)
139+
--> Return results (in-memory)
140+
```
141+
142+
### Target (Kubernetes + Redis Streams)
143+
144+
```
145+
PDF Upload
146+
|
147+
v
148+
Redis:raw-docs --> Extract Workers (PyMuPDF)
149+
|
150+
v
151+
Redis:extracted --> Classify Workers (rules + semantic)
152+
|
153+
v
154+
Redis:classified --> Store Workers --> PostgreSQL (pgvector)
155+
--> Azure Blob Storage
156+
```
157+
158+
Each stage runs as a separate K8s Deployment. KEDA monitors Redis Stream consumer group
159+
lag and scales workers based on queue depth. See [docs/architecture.md](docs/architecture.md)
160+
for the full design.
161+
162+
## CI/CD
163+
164+
**GitHub Actions workflows:**
165+
166+
- **ci.yml** -- Lint (ruff) + test (pytest with coverage) on every push and PR
167+
- **docker.yml** -- Build and push Docker image to `ghcr.io/johnmathews/documentstream` on push to main
168+
169+
## Key Design Decisions
170+
171+
| Decision | Rationale |
172+
|---|---|
173+
| Redis Streams over Pub/Sub | At-least-once delivery with consumer group acknowledgment; crash-safe |
174+
| KEDA over HPA | Scale on queue depth (actual work), not CPU (misleading for queue workers) |
175+
| Two classifiers | Rules for structured dimensions, semantic for contextual ones |
176+
| pgvector over dedicated vector DB | Keeps architecture simple; PostgreSQL Flexible supports it natively |
177+
| Descriptive anchors (not keyword lists) | Embedding model captures meaning, not just string matches |
178+
| Local sentence-transformers | No API dependency, runs anywhere, free |
179+
180+
## Documentation
181+
182+
- [Architecture](docs/architecture.md) -- System design, pipeline flow, K8s target
183+
- [Classification](docs/classification.md) -- Rule-based vs semantic approaches in detail
184+
- [Demo Guide](docs/demo-guide.md) -- Step-by-step demo script with talking points
185+
- [Dictionary](docs/dictionary.md) -- K8s, Azure, and KEDA concepts
186+
187+
## Stack
188+
189+
Python 3.13, FastAPI, PyMuPDF, fpdf2, Faker, sentence-transformers, Redis, PostgreSQL (pgvector),
190+
Docker, GitHub Actions, uv, pytest, ruff. Target: AKS, KEDA, Prometheus, Grafana, Chaos Mesh.

journal/260329-add-readme.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Add README.md
2+
3+
Created a comprehensive README.md for the project. The README serves as the primary entry
4+
point for anyone looking at the GitHub repo and covers:
5+
6+
- Project overview and what it does
7+
- Quick start instructions (4 commands from clone to running)
8+
- Full API reference with curl example
9+
- Classification approach (rule-based + semantic) explained concisely
10+
- Project structure as a scannable tree
11+
- All Makefile commands in a table
12+
- Architecture diagrams (current synchronous vs target K8s async)
13+
- CI/CD overview
14+
- Key design decisions with rationale
15+
- Links to all documentation in docs/
16+
17+
Everything points to real code and docs -- no filler or placeholder content.

0 commit comments

Comments
 (0)