DocumentStream — System Architecture

Overview

DocumentStream is a Kubernetes-native document processing pipeline for commercial real estate loan documents. It ingests PDFs, extracts text, classifies them across multiple dimensions (privacy level, environmental impact, industry sectors), stores embeddings for semantic search, and archives originals in blob storage.

The system is designed to demonstrate production K8s patterns: queue-based pipelines, event-driven autoscaling, self-healing, rolling updates, chaos engineering, and full observability.

System Diagram

                        ┌────────────────────────────────────────────────────────┐
                        │                     AKS Cluster                        │
                        │                                                        │
  ┌──────────┐          │  ┌──────────┐    ┌──────────┐    ┌────────────────┐   │
  │  Locust   │─────────│─▶│ Ingress  │───▶│ FastAPI  │───▶│  Redis Stream  │   │
  │ Load Gen  │         │  │ (nginx)  │    │ Gateway  │    │  (raw-docs)    │   │
  └──────────┘          │  └──────────┘    └──────────┘    └───────┬────────┘   │
                        │                                          │            │
  ┌──────────┐          │                                          ▼            │
  │  Chaos   │          │                               ┌────────────────────┐  │
  │  Mesh    │──────────│─▶ (CRDs in cluster)           │  Extract Workers   │  │
  │ Dashboard│          │                               │  (PyMuPDF, KEDA)   │  │
  └──────────┘          │                               └─────────┬──────────┘  │
                        │                                         │             │
  ┌──────────┐          │                    Redis Stream          │             │
  │ Grafana  │◀─────────│── Prometheus ◀──   (extracted)  ◀───────┘             │
  │Dashboard │          │    metrics                │                           │
  └──────────┘          │                           ▼                           │
                        │                ┌────────────────────┐                 │
                        │                │  Classify Workers   │                 │
                        │                │  Rules + Semantic   │                 │
                        │                │  (KEDA-scaled)      │                 │
                        │                └─────────┬──────────┘                 │
                        │                          │                            │
                        │                 Redis Stream                          │
                        │                 (classified)                          │
                        │                          │                            │
                        │                          ▼                            │
                        │                ┌────────────────────┐                 │
                        │                │   Store Workers     │                 │
                        │                │   (KEDA-scaled)     │                 │
                        │                └─────────┬──────────┘                 │
                        │                          │                            │
                        └──────────────────────────┼────────────────────────────┘
                                                   │
                                    ┌──────────────┼──────────────┐
                                    │  Azure Services             │
                                    │              │              │
                                    │     ┌────────▼────────┐     │
                                    │     │  PostgreSQL      │     │
                                    │     │  Flexible Server │     │
                                    │     │  + pgvector      │     │
                                    │     └─────────────────┘     │
                                    │                             │
                                    │     ┌─────────────────┐     │
                                    │     │  Blob Storage    │     │
                                    │     │  (original PDFs) │     │
                                    │     └─────────────────┘     │
                                    └─────────────────────────────┘

Pipeline Flow

Sync mode (no REDIS_URL — local dev, tests)

Upload PDF ──▶ Gateway (FastAPI)
                   │
                   ├── Extract (PyMuPDF)
                   ├── Rule-based classify
                   ├── Semantic classify
                   └── Return results (in-memory)

When REDIS_URL is not set, all processing happens synchronously in the gateway process. Results are stored in-memory (no persistence). This mode is used for testing and local development without Docker.

Async mode (REDIS_URL set — docker-compose, K8s)

Upload PDF
    │
    ▼
Redis:raw-docs ──▶ Extract Workers (PyMuPDF)
                        │
                        ▼
               Redis:extracted ──▶ Classify Workers
                                       │
                                  ┌────┴────┐
                                  │         │
                             Rule-based  Semantic
                             (privacy)   (env impact,
                                          industries,
                                          privacy)
                                  │         │
                                  └────┬────┘
                                       │
                                       ▼
                              Redis:classified ──▶ Store Workers
                                                       │
                                                  ┌────┴────┐
                                                  │         │
                                           PostgreSQL   Blob Storage
                                           (metadata,   (original PDF)
                                            embeddings,
                                            classifications)

Each pipeline stage is a separate process (docker-compose service or K8s Deployment) with its own KEDA ScaledObject. KEDA monitors the Redis stream depth for each stage and scales workers independently.

The gateway detects the mode automatically: if REDIS_URL is set, it publishes to Redis and returns status: queued immediately. Clients poll GET /api/documents/{id} to track progress through the pipeline stages (queued → extracting → classifying → storing → completed).

Component Details

Gateway (`src/gateway/`)

Aspect	Detail
Framework	FastAPI
Endpoints	`/health`, `/api/documents`, `/api/generate`, `/metrics`, `/` (web UI)
K8s role	Single Deployment with a Service and Ingress
Scaling	HPA based on CPU (not queue-based — it handles HTTP, not queue work)

Extract Workers (`src/worker/extract.py`)

Aspect	Detail
Library	PyMuPDF (fitz)
Input	Redis stream `raw-docs` (PDF bytes)
Output	Redis stream `extracted` (full text + metadata)
Performance	~10-50ms per page
Scaling	KEDA ScaledObject watching `raw-docs` stream depth

Classify Workers (`src/worker/classify.py` + `semantic.py`)

Aspect	Detail
Rule-based	Weighted keyword scoring → Privacy level (Public/Confidential/Secret)
Semantic	ONNX Runtime + HuggingFace tokenizer embedding → Environmental impact, Industry sectors, Privacy
Input	Redis stream `extracted`
Output	Redis stream `classified`
Model	all-MiniLM-L6-v2 (384 dimensions, ~100ms per document, ONNX backend)
Scaling	KEDA ScaledObject watching `extracted` stream depth

Store Workers (`src/worker/store.py`)

Aspect	Detail
Database	PostgreSQL with pgvector extension
Blob store	Azure Blob Storage (via `BLOB_CONNECTION_STRING` in K8s Secret)
Metrics	Prometheus counters: `documentstream_blob_uploads_total`, `documentstream_blob_bytes_total` (by doc_type)
Input	Redis stream `classified`
Stored data	Document metadata, doc_type, classification results, vector embedding (384 dims), blob URL
Scaling	KEDA ScaledObject watching `classified` stream depth

Queue Module (`src/worker/queue.py`)

Shared Redis Streams utilities used by all workers:

Function	Purpose
`publish()`	Add a message to a stream (auto-serializes dicts/lists to JSON)
`consume()`	Read from a consumer group (blocks until messages arrive)
`ack()`	Acknowledge successful processing
`set_doc_status()` / `get_doc_status()`	Track document progress in Redis hashes
`encode_pdf()` / `decode_pdf()`	Base64 encode/decode PDF bytes for Redis storage
`ensure_consumer_group()`	Create consumer group with MKSTREAM (idempotent)
`setup_shutdown_handler()`	SIGTERM/SIGINT handling for graceful pod shutdown

Infrastructure

Azure Services

Service	SKU	Purpose	Cost/hour
AKS	Free tier, 3x Standard_B2ms	Container orchestration	~€0.25
ACR	Basic	Container image registry	~€0.007
PostgreSQL Flexible	Burstable B1ms	Metadata + pgvector embeddings	~€0.017
Blob Storage	Hot, LRS	Original PDF archive	~€0.00
Managed Grafana	Essential (free)	Monitoring dashboards	€0.00
Total			~€0.28/hr

When stopped (az aks stop + az postgres flexible-server stop): ~€0.01/hr (disk only).

Helm Charts (Day 2)

Chart	Namespace	Purpose
kube-prometheus-stack	monitoring	Prometheus + Grafana
bitnami/redis	documentstream	Message queue (Redis Streams)
kedacore/keda	keda	Event-driven autoscaling
chaos-mesh/chaos-mesh	chaos-mesh	Chaos engineering
ingress-nginx	ingress-nginx	HTTP routing

All charts are installed and running on the live AKS cluster.

CI/CD

GitHub Actions workflows:

ci.yml — On push to main and PRs: ruff lint, ruff format check, pytest with coverage
docker.yml — On push to main: build and push images to ghcr.io/johnmathews/k8s
deploy.yml — On push to main (src/ or k8s/ changes): build → push to ACR → deploy to AKS

Data Model

LoanScenario

All documents in a scenario share linked data:

LoanScenario
  ├── loan_id: "CRE-917982"
  ├── client
  │     ├── company_name: "Van der Berg B.V."
  │     ├── registration_number: "KVK-12345678"
  │     ├── contact_name: "Jan de Vries"
  │     └── ...
  ├── property
  │     ├── address: "Herengracht 42"
  │     ├── city: "Amsterdam"
  │     ├── property_type: "Office building"
  │     └── ...
  ├── loan_amount_eur: 5_000_000
  ├── interest_rate_pct: 4.25
  └── dates (chronological)
        ├── application_date
        ├── valuation_date
        ├── kyc_date
        ├── contract_date
        └── invoice_date

Document Types

Type	Privacy	Text Volume	Semantic Search
Loan application	Confidential	Low (~100 words)	No
Valuation report	Confidential	High (~900 words)	Yes
KYC report	Secret	High (~500 words)	Yes
Loan agreement	Secret	High (~1100 words)	Yes
Contractor invoice	Public	Low (~100 words)	No

Sample PDFs for all 5 types are committed in demo_samples/ so the generated content is visible without running the generator. Regenerate with make demo-samples. Bulk generation for stress testing uses make generate (outputs to generated_docs/, gitignored).

Key Design Decisions

Redis Streams over Pub/Sub or Lists. Streams provide at-least-once delivery with consumer group acknowledgment. When a worker pod crashes, messages are re-delivered — zero data loss.
KEDA over HPA. Standard HPA scales on CPU/memory. KEDA scales on queue depth — the metric that actually matters for a pipeline.
Two classifiers (rules + semantic). Rules handle structured dimensions (privacy level) deterministically. Semantic handles contextual dimensions (environmental impact) that rules can't assess.
pgvector over a dedicated vector DB. Keeps the architecture simple — one database for metadata and vectors. Azure PostgreSQL Flexible Server supports pgvector natively. For production at scale, Azure AI Search would be the enterprise choice.
Descriptive anchor texts over keyword lists. Anchor texts are natural language paragraphs describing each category. The embedding model captures meaning, enabling detection of concepts expressed in different words.
Local ONNX Runtime inference over Azure OpenAI. No API dependency, free, runs anywhere. Direct ONNX Runtime inference (no PyTorch) keeps the worker image at ~190MB. For production, Azure OpenAI text-embedding-3-small would provide higher quality embeddings within Azure's data boundary.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocumentStream — System Architecture

Overview

System Diagram

Pipeline Flow

Sync mode (no REDIS_URL — local dev, tests)

Async mode (REDIS_URL set — docker-compose, K8s)

Component Details

Gateway (`src/gateway/`)

Extract Workers (`src/worker/extract.py`)

Classify Workers (`src/worker/classify.py` + `semantic.py`)

Store Workers (`src/worker/store.py`)

Queue Module (`src/worker/queue.py`)

Infrastructure

Azure Services

Helm Charts (Day 2)

CI/CD

Data Model

LoanScenario

Document Types

Key Design Decisions

FilesExpand file tree

architecture.md

Latest commit

History

architecture.md

File metadata and controls

DocumentStream — System Architecture

Overview

System Diagram

Pipeline Flow

Sync mode (no REDIS_URL — local dev, tests)

Async mode (REDIS_URL set — docker-compose, K8s)

Component Details

Gateway (src/gateway/)

Extract Workers (src/worker/extract.py)

Classify Workers (src/worker/classify.py + semantic.py)

Store Workers (src/worker/store.py)

Queue Module (src/worker/queue.py)

Infrastructure

Azure Services

Helm Charts (Day 2)

CI/CD

Data Model

LoanScenario

Document Types

Key Design Decisions

Gateway (`src/gateway/`)

Extract Workers (`src/worker/extract.py`)

Classify Workers (`src/worker/classify.py` + `semantic.py`)

Store Workers (`src/worker/store.py`)

Queue Module (`src/worker/queue.py`)