Skip to content

Commit 6e94ac2

Browse files
johnmathewsclaude
andcommitted
Switch embedding inference to ONNX Runtime and update all docs
- Replace PyTorch with ONNX Runtime for sentence-transformers inference (~50MB vs ~5GB, faster pod startup for KEDA scaling) - Update dictionary with ONNX Runtime entry - Update architecture, classification, demo guide with ONNX details - Update implementation plan with Stage 13 - Update journal with ONNX switch details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent dc6cb72 commit 6e94ac2

10 files changed

Lines changed: 210 additions & 81 deletions

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ loan documents.
1111
- **PDF generation:** fpdf2 + Faker (nl_NL locale)
1212
- **Text extraction:** PyMuPDF (fitz)
1313
- **Rule-based classification:** Weighted keyword scoring (privacy level)
14-
- **Semantic classification:** sentence-transformers + anchor embeddings (environmental impact, industries)
14+
- **Semantic classification:** sentence-transformers + ONNX Runtime + anchor embeddings (environmental impact, industries)
1515
- **Testing:** pytest + coverage
1616
- **Linting:** ruff
1717
- **Deps:** uv

docs/architecture.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -154,10 +154,10 @@ storing → completed).
154154
| Aspect | Detail |
155155
|---|---|
156156
| Rule-based | Weighted keyword scoring → Privacy level (Public/Confidential/Secret) |
157-
| Semantic | sentence-transformers embedding → Environmental impact, Industry sectors, Privacy |
157+
| Semantic | sentence-transformers + ONNX Runtime embedding → Environmental impact, Industry sectors, Privacy |
158158
| Input | Redis stream `extracted` |
159159
| Output | Redis stream `classified` |
160-
| Model | all-MiniLM-L6-v2 (384 dimensions, ~100ms per document) |
160+
| Model | all-MiniLM-L6-v2 (384 dimensions, ~100ms per document, ONNX backend) |
161161
| Scaling | KEDA ScaledObject watching `extracted` stream depth |
162162

163163
### Store Workers (`src/worker/store.py`)
@@ -291,6 +291,8 @@ gitignored).
291291
language paragraphs describing each category. The embedding model captures
292292
meaning, enabling detection of concepts expressed in different words.
293293

294-
6. **Local sentence-transformers over Azure OpenAI.** No API dependency, free,
295-
runs anywhere. For production, Azure OpenAI text-embedding-3-small would
294+
6. **Local sentence-transformers with ONNX Runtime over Azure OpenAI.** No API
295+
dependency, free, runs anywhere. ONNX Runtime replaces PyTorch for inference,
296+
reducing container image size and memory usage (~50MB runtime vs ~5GB for
297+
full PyTorch). For production, Azure OpenAI text-embedding-3-small would
296298
provide higher quality embeddings within Azure's data boundary.

docs/classification.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -184,8 +184,11 @@ We use pgvector (PostgreSQL extension) for the demo. For production at a bank,
184184
185185
### Embedding model
186186

187-
We use `all-MiniLM-L6-v2` (sentence-transformers, 384 dimensions) — free,
188-
runs locally, no API dependency. For production:
187+
We use `all-MiniLM-L6-v2` (sentence-transformers, 384 dimensions) with **ONNX Runtime**
188+
as the inference backend instead of PyTorch. This reduces the runtime from ~5GB (full
189+
PyTorch) to ~50MB (ONNX Runtime), which means faster container image pulls, faster pod
190+
startup when KEDA scales workers, and lower memory per pod. Free, runs locally, no API
191+
dependency. For production:
189192

190193
**Azure OpenAI `text-embedding-3-small`** would be preferred:
191194
- Higher quality embeddings (1536 dimensions)

docs/demo-guide.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -205,14 +205,17 @@ for RAG workloads on Azure.
205205
> production, if throughput requirements grew beyond what a single Redis instance
206206
> handles, I'd consider Azure Event Hubs, which is Kafka-compatible and fully managed."
207207
208-
### Sentence-Transformers vs Azure OpenAI
209-
**When to mention:** When asked about the embedding model.
208+
### Sentence-Transformers + ONNX Runtime vs Azure OpenAI
209+
**When to mention:** When asked about the embedding model or container optimization.
210210

211211
**What to say:**
212-
> "I used sentence-transformers locally — specifically all-MiniLM-L6-v2 — because
213-
> it runs without API dependencies and is free. For production, I'd use Azure OpenAI's
214-
> text-embedding-3-small model. The embeddings are higher quality, and the data stays
215-
> within Azure's boundary — important for a bank's compliance requirements."
212+
> "I used sentence-transformers locally — specifically all-MiniLM-L6-v2 — with ONNX
213+
> Runtime as the inference backend instead of PyTorch. ONNX Runtime is about 50MB
214+
> versus 5GB for full PyTorch, which makes a big difference in a K8s environment —
215+
> faster image pulls, faster pod startup when KEDA scales up, and lower memory per
216+
> worker. For production, I'd use Azure OpenAI's text-embedding-3-small model. The
217+
> embeddings are higher quality, and the data stays within Azure's boundary —
218+
> important for a bank's compliance requirements."
216219
217220
### Chaos Engineering
218221
**When to mention:** When they seem impressed by the self-healing demo.

docs/dictionary.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -401,6 +401,19 @@ concept — industrial pollution.
401401
We use the `all-MiniLM-L6-v2` model (sentence-transformers) which produces 384-dimensional
402402
vectors. "384-dimensional" means each text becomes a list of 384 numbers.
403403

404+
### ONNX Runtime
405+
An optimized inference engine for running machine learning models. Instead of using
406+
full PyTorch (~5GB) to run the embedding model, we export the model to ONNX format
407+
and run it with ONNX Runtime (~50MB). Same model, same results, much smaller footprint.
408+
409+
sentence-transformers supports this with a single parameter:
410+
```python
411+
SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
412+
```
413+
414+
This is important for K8s because smaller images mean faster pod startup (KEDA can
415+
scale workers up more quickly) and lower memory usage per pod.
416+
404417
### Cosine Similarity
405418
A way to measure how similar two vectors are. Returns a value between -1 and 1:
406419
- 1.0 = identical meaning

docs/implementation-plan.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@
2525
| 11 | Polish and demo rehearsal | MUST | 1.5-2h | TODO |
2626
| -- | **Day 4: Enhancements** | -- | -- | **DONE** |
2727
| 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
28+
| 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (sentence-transformers backend="onnx", ~50MB vs ~5GB PyTorch) |
2829

2930
**If time runs short:** Cut from the bottom. Stages 0-5 + 11 are non-negotiable. Stage 6 (Grafana) is
3031
the most important "nice to have" because it's the visual centerpiece of the demo. Stages 7-10 can

journal/260401-azure-blob-storage-and-metrics.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,15 @@ to the Grafana dashboard, completing the storage layer of the pipeline.
8484
- `grafana/documentstream-dashboard.json` — 2 new blob storage panels
8585
- `tests/test_store.py` — 9 new tests (doc_type inference, blob upload, record fields)
8686

87+
### Switch to ONNX Runtime
88+
89+
- Changed `SentenceTransformer(MODEL_NAME)` to `SentenceTransformer(MODEL_NAME, backend="onnx")`
90+
in `src/worker/semantic.py`
91+
- Replaced `sentence-transformers` dependency with `sentence-transformers[onnx]` in `pyproject.toml`
92+
(pulls in `onnxruntime` and `optimum`)
93+
- ONNX Runtime is ~50MB vs ~5GB for full PyTorch — faster image pulls, faster KEDA scale-up,
94+
lower memory per classify-worker pod
95+
8796
## Test Status
8897

8998
- 92 tests passing (up from 83), lint clean

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ dependencies = [
1111
"pymupdf>=1.25.0",
1212
"redis>=5.2.0",
1313
"psycopg[binary]>=3.2.0",
14-
"sentence-transformers>=4.1.0",
14+
"sentence-transformers[onnx]>=4.1.0",
1515
"jinja2>=3.1.0",
1616
"python-multipart>=0.0.20",
1717
"azure-storage-blob>=12.24.0",

src/worker/semantic.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
@lru_cache(maxsize=1)
3939
def _get_model() -> SentenceTransformer:
4040
"""Load the embedding model (cached, loaded once)."""
41-
return SentenceTransformer(MODEL_NAME)
41+
return SentenceTransformer(MODEL_NAME, backend="onnx")
4242

4343

4444
def embed_text(text: str) -> NDArray[np.float32]:

0 commit comments

Comments
 (0)