Skip to content

Commit 3bb7869

Browse files
johnmathewsclaude
andcommitted
Update docs for direct ONNX Runtime inference and CI/CD pipeline
- Remove sentence-transformers/PyTorch references from all docs - Update image size references (~190MB, not ~3GB or ~50MB) - Update demo talking points for ONNX approach - Add CI/CD pipeline and ONNX migration to journal Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 24c1596 commit 3bb7869

6 files changed

Lines changed: 53 additions & 45 deletions

File tree

docs/architecture.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ storing → completed).
154154
| Aspect | Detail |
155155
|---|---|
156156
| Rule-based | Weighted keyword scoring → Privacy level (Public/Confidential/Secret) |
157-
| Semantic | sentence-transformers + ONNX Runtime embedding → Environmental impact, Industry sectors, Privacy |
157+
| Semantic | ONNX Runtime + HuggingFace tokenizer embedding → Environmental impact, Industry sectors, Privacy |
158158
| Input | Redis stream `extracted` |
159159
| Output | Redis stream `classified` |
160160
| Model | all-MiniLM-L6-v2 (384 dimensions, ~100ms per document, ONNX backend) |
@@ -291,8 +291,7 @@ gitignored).
291291
language paragraphs describing each category. The embedding model captures
292292
meaning, enabling detection of concepts expressed in different words.
293293

294-
6. **Local sentence-transformers with ONNX Runtime over Azure OpenAI.** No API
295-
dependency, free, runs anywhere. ONNX Runtime replaces PyTorch for inference,
296-
reducing container image size and memory usage (~50MB runtime vs ~5GB for
297-
full PyTorch). For production, Azure OpenAI text-embedding-3-small would
294+
6. **Local ONNX Runtime inference over Azure OpenAI.** No API dependency, free,
295+
runs anywhere. Direct ONNX Runtime inference (no PyTorch) keeps the worker
296+
image at ~190MB. For production, Azure OpenAI text-embedding-3-small would
298297
provide higher quality embeddings within Azure's data boundary.

docs/classification.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -73,8 +73,8 @@ approach is appropriate — a key interview talking point.
7373
> below sea level in a polder, near river flood plains, or in a zone
7474
> requiring active water management infrastructure."
7575
76-
2. **Embed the anchors** at startup using sentence-transformers (all-MiniLM-L6-v2,
77-
384 dimensions). Each anchor becomes a 384-number vector that captures its meaning.
76+
2. **Embed the anchors** at startup using all-MiniLM-L6-v2 via ONNX Runtime (384
77+
dimensions). Each anchor becomes a 384-number vector that captures its meaning.
7878

7979
3. **Embed each document** — the full extracted text becomes a vector.
8080

@@ -184,11 +184,11 @@ We use pgvector (PostgreSQL extension) for the demo. For production at a bank,
184184
185185
### Embedding model
186186

187-
We use `all-MiniLM-L6-v2` (sentence-transformers, 384 dimensions) with **ONNX Runtime**
188-
as the inference backend instead of PyTorch. This reduces the runtime from ~5GB (full
189-
PyTorch) to ~50MB (ONNX Runtime), which means faster container image pulls, faster pod
190-
startup when KEDA scales workers, and lower memory per pod. Free, runs locally, no API
191-
dependency. For production:
187+
We use `all-MiniLM-L6-v2` (384 dimensions) with **ONNX Runtime** for inference,
188+
loaded directly via `huggingface-hub` and tokenized with `transformers`. No PyTorch
189+
dependency — the worker image is ~190MB instead of ~3GB. This means faster container
190+
image pulls, faster pod startup when KEDA scales workers, and lower memory per pod.
191+
Free, runs locally, no API dependency. For production:
192192

193193
**Azure OpenAI `text-embedding-3-small`** would be preferred:
194194
- Higher quality embeddings (1536 dimensions)

docs/demo-guide.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -215,13 +215,14 @@ for RAG workloads on Azure.
215215
**When to mention:** When asked about the embedding model or container optimization.
216216

217217
**What to say:**
218-
> "I used sentence-transformers locally — specifically all-MiniLM-L6-v2 — with ONNX
219-
> Runtime as the inference backend instead of PyTorch. ONNX Runtime is about 50MB
220-
> versus 5GB for full PyTorch, which makes a big difference in a K8s environment —
221-
> faster image pulls, faster pod startup when KEDA scales up, and lower memory per
222-
> worker. For production, I'd use Azure OpenAI's text-embedding-3-small model. The
223-
> embeddings are higher quality, and the data stays within Azure's boundary —
224-
> important for a bank's compliance requirements."
218+
> "I run all-MiniLM-L6-v2 locally with ONNX Runtime — no PyTorch at all. The
219+
> worker image is about 190MB instead of 3GB, which makes a big difference in a
220+
> K8s environment — faster image pulls, faster pod startup when KEDA scales up,
221+
> and lower memory per worker. I load the ONNX model from HuggingFace Hub and
222+
> tokenize with the HuggingFace tokenizer, then run inference directly through
223+
> ONNX Runtime. For production, I'd use Azure OpenAI's text-embedding-3-small
224+
> model. The embeddings are higher quality, and the data stays within Azure's
225+
> boundary — important for a bank's compliance requirements."
225226
226227
### Chaos Engineering
227228
**When to mention:** When they seem impressed by the self-healing demo.

docs/dictionary.md

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -398,18 +398,15 @@ words. For example, "the site was a textile dyeing factory" and "ground contamin
398398
industrial chemical processing" would have similar embeddings because they're about the same
399399
concept — industrial pollution.
400400

401-
We use the `all-MiniLM-L6-v2` model (sentence-transformers) which produces 384-dimensional
402-
vectors. "384-dimensional" means each text becomes a list of 384 numbers.
401+
We use the `all-MiniLM-L6-v2` model which produces 384-dimensional vectors.
402+
"384-dimensional" means each text becomes a list of 384 numbers.
403403

404404
### ONNX Runtime
405-
An optimized inference engine for running machine learning models. Instead of using
406-
full PyTorch (~5GB) to run the embedding model, we export the model to ONNX format
407-
and run it with ONNX Runtime (~50MB). Same model, same results, much smaller footprint.
408-
409-
sentence-transformers supports this with a single parameter:
410-
```python
411-
SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
412-
```
405+
An optimized inference engine for running machine learning models. We load the
406+
ONNX model directly from HuggingFace Hub and run inference with ONNX Runtime +
407+
a HuggingFace tokenizer — no PyTorch dependency at all. This keeps the worker
408+
container image at ~190MB instead of ~3GB. Same model, same results, much smaller
409+
footprint.
413410

414411
This is important for K8s because smaller images mean faster pod startup (KEDA can
415412
scale workers up more quickly) and lower memory usage per pod.

docs/implementation-plan.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
| 11 | Polish and demo rehearsal | MUST | 1.5-2h | DONE (full demo rehearsal completed 2026-04-02) |
2626
| -- | **Day 4: Enhancements** | -- | -- | **DONE** |
2727
| 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
28-
| 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (sentence-transformers backend="onnx", ~50MB vs ~5GB PyTorch) |
28+
| 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (direct ONNX Runtime, no PyTorch/sentence-transformers, image ~190MB vs ~3GB) |
2929

3030
**If time runs short:** Cut from the bottom. Stages 0-5 + 11 are non-negotiable. Stage 6 (Grafana) is
3131
the most important "nice to have" because it's the visual centerpiece of the demo. Stages 7-10 can

journal/260402-chaos-mesh-testing-and-demo-rehearsal.md

Lines changed: 26 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -71,18 +71,29 @@ Applied via `helm upgrade --install`. Chaos daemons restarted and all experiment
7171
names across all docs, updated implementation plan progress.
7272
- **CI fix:** `src/worker/store.py` had a ruff format issue (function args on one line).
7373

74-
## Files changed
75-
76-
- `k8s/chaos/pod-kill.yaml` — changed mode from `fixed`/`value: "2"` to `one`
77-
- `infra/helm-install.sh` — added containerd runtime settings for Chaos Mesh
78-
- `infra/setup.sh` — corrected storage account name
79-
- `docs/chaos-experiments.md` — added containerd prerequisite note and verified results
80-
- `docs/demo-guide.md` — updated rolling update section, removed stale Postgres checklist item
81-
- `docs/implementation-plan.md` — marked chaos, rolling update, demo rehearsal as DONE
82-
- `locust/locustfile.py` — reweighted for async pipeline, removed sync generate task
83-
- `src/worker/store.py` — simplified blob path (filename only, no UUID prefix)
84-
- `tests/test_store.py` — updated blob path assertions
85-
- `README.md` — corrected test count (51 → 92)
86-
- `CLAUDE.md` — removed chaos mesh and demo rehearsal from "Not yet done"
87-
- `.engineering-team/architecture-plan.md` — updated demo script and Azure resource names
88-
- `.github/workflows/deploy.yml` — corrected AKS cluster name and resource group
74+
## CI/CD pipeline
75+
76+
- **docker.yml:** Now builds images once and pushes to both ghcr.io and ACR. ACR push
77+
is gated on `ACR_LOGIN_SERVER` variable + `ACR_CLIENT_ID`/`ACR_CLIENT_SECRET` secrets.
78+
- **deploy.yml:** Triggers via `workflow_run` after docker.yml completes (no duplicate builds).
79+
Uses `azure/login@v2` with service principal creds JSON for AKS access.
80+
- **Secrets configured:** `ACR_CLIENT_ID`, `ACR_CLIENT_SECRET`, `AZURE_TENANT_ID`,
81+
`AZURE_SUBSCRIPTION_ID`, `AZURE_CREDENTIALS`, plus `ACR_LOGIN_SERVER` variable.
82+
- **Full pipeline working:** push → CI (lint+test) + Docker (build+push to ghcr.io+ACR) → Deploy (AKS).
83+
84+
## Remove torch: direct ONNX Runtime inference
85+
86+
Replaced `sentence-transformers` (which requires torch ~2GB) with direct ONNX Runtime
87+
inference. The worker image dropped from **~3GB to ~190MB** (93% reduction).
88+
89+
- `src/worker/semantic.py` — replaced `SentenceTransformer.encode()` with manual
90+
tokenization (HuggingFace `AutoTokenizer`) + ONNX inference + numpy mean pooling
91+
+ L2 normalization. Same model (all-MiniLM-L6-v2), same 384-dim output.
92+
- `pyproject.toml` — replaced `sentence-transformers[onnx]` with `onnxruntime`,
93+
`transformers`, `huggingface-hub`. Removed 31 packages from lock file.
94+
- ONNX model downloaded from HuggingFace Hub on first use via `hf_hub_download`.
95+
96+
## Redis OOM fix
97+
98+
Redis was OOMKilled with 128Mi memory limit after accumulating a large backlog from
99+
Locust testing. Increased to 512Mi limit / 256Mi request via kubectl patch.

0 commit comments

Comments
 (0)