Update docs for direct ONNX Runtime inference and CI/CD pipeline

johnmathews · claude · johnmathews · commit 3bb7869b9b8b · 2026-04-03T00:42:42.000+02:00
- Remove sentence-transformers/PyTorch references from all docs
- Update image size references (~190MB, not ~3GB or ~50MB)
- Update demo talking points for ONNX approach
- Add CI/CD pipeline and ONNX migration to journal

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/architecture.md b/docs/architecture.md
@@ -154,7 +154,7 @@ storing → completed).
 | Aspect | Detail |
 |---|---|
 | Rule-based | Weighted keyword scoring → Privacy level (Public/Confidential/Secret) |
-| Semantic | sentence-transformers + ONNX Runtime embedding → Environmental impact, Industry sectors, Privacy |
+| Semantic | ONNX Runtime + HuggingFace tokenizer embedding → Environmental impact, Industry sectors, Privacy |
 | Input | Redis stream `extracted` |
 | Output | Redis stream `classified` |
 | Model | all-MiniLM-L6-v2 (384 dimensions, ~100ms per document, ONNX backend) |
@@ -291,8 +291,7 @@ gitignored).
    language paragraphs describing each category. The embedding model captures
    meaning, enabling detection of concepts expressed in different words.
 
-6. **Local sentence-transformers with ONNX Runtime over Azure OpenAI.** No API
-   dependency, free, runs anywhere. ONNX Runtime replaces PyTorch for inference,
-   reducing container image size and memory usage (~50MB runtime vs ~5GB for
-   full PyTorch). For production, Azure OpenAI text-embedding-3-small would
+6. **Local ONNX Runtime inference over Azure OpenAI.** No API dependency, free,
+   runs anywhere. Direct ONNX Runtime inference (no PyTorch) keeps the worker
+   image at ~190MB. For production, Azure OpenAI text-embedding-3-small would
    provide higher quality embeddings within Azure's data boundary.
diff --git a/docs/classification.md b/docs/classification.md
@@ -73,8 +73,8 @@ approach is appropriate — a key interview talking point.
    > below sea level in a polder, near river flood plains, or in a zone
    > requiring active water management infrastructure."
 
-2. **Embed the anchors** at startup using sentence-transformers (all-MiniLM-L6-v2,
-   384 dimensions). Each anchor becomes a 384-number vector that captures its meaning.
+2. **Embed the anchors** at startup using all-MiniLM-L6-v2 via ONNX Runtime (384
+   dimensions). Each anchor becomes a 384-number vector that captures its meaning.
 
 3. **Embed each document** — the full extracted text becomes a vector.
 
@@ -184,11 +184,11 @@ We use pgvector (PostgreSQL extension) for the demo. For production at a bank,
 
 ### Embedding model
 
-We use `all-MiniLM-L6-v2` (sentence-transformers, 384 dimensions) with **ONNX Runtime**
-as the inference backend instead of PyTorch. This reduces the runtime from ~5GB (full
-PyTorch) to ~50MB (ONNX Runtime), which means faster container image pulls, faster pod
-startup when KEDA scales workers, and lower memory per pod. Free, runs locally, no API
-dependency. For production:
+We use `all-MiniLM-L6-v2` (384 dimensions) with **ONNX Runtime** for inference,
+loaded directly via `huggingface-hub` and tokenized with `transformers`. No PyTorch
+dependency — the worker image is ~190MB instead of ~3GB. This means faster container
+image pulls, faster pod startup when KEDA scales workers, and lower memory per pod.
+Free, runs locally, no API dependency. For production:
 
 **Azure OpenAI `text-embedding-3-small`** would be preferred:
 - Higher quality embeddings (1536 dimensions)
diff --git a/docs/demo-guide.md b/docs/demo-guide.md
@@ -215,13 +215,14 @@ for RAG workloads on Azure.
 **When to mention:** When asked about the embedding model or container optimization.
 
 **What to say:**
-> "I used sentence-transformers locally — specifically all-MiniLM-L6-v2 — with ONNX
-> Runtime as the inference backend instead of PyTorch. ONNX Runtime is about 50MB
-> versus 5GB for full PyTorch, which makes a big difference in a K8s environment —
-> faster image pulls, faster pod startup when KEDA scales up, and lower memory per
-> worker. For production, I'd use Azure OpenAI's text-embedding-3-small model. The
-> embeddings are higher quality, and the data stays within Azure's boundary —
-> important for a bank's compliance requirements."
+> "I run all-MiniLM-L6-v2 locally with ONNX Runtime — no PyTorch at all. The
+> worker image is about 190MB instead of 3GB, which makes a big difference in a
+> K8s environment — faster image pulls, faster pod startup when KEDA scales up,
+> and lower memory per worker. I load the ONNX model from HuggingFace Hub and
+> tokenize with the HuggingFace tokenizer, then run inference directly through
+> ONNX Runtime. For production, I'd use Azure OpenAI's text-embedding-3-small
+> model. The embeddings are higher quality, and the data stays within Azure's
+> boundary — important for a bank's compliance requirements."
 
 ### Chaos Engineering
 **When to mention:** When they seem impressed by the self-healing demo.
diff --git a/docs/dictionary.md b/docs/dictionary.md
@@ -398,18 +398,15 @@ words. For example, "the site was a textile dyeing factory" and "ground contamin
 industrial chemical processing" would have similar embeddings because they're about the same
 concept — industrial pollution.
 
-We use the `all-MiniLM-L6-v2` model (sentence-transformers) which produces 384-dimensional
-vectors. "384-dimensional" means each text becomes a list of 384 numbers.
+We use the `all-MiniLM-L6-v2` model which produces 384-dimensional vectors.
+"384-dimensional" means each text becomes a list of 384 numbers.
 
 ### ONNX Runtime
-An optimized inference engine for running machine learning models. Instead of using
-full PyTorch (~5GB) to run the embedding model, we export the model to ONNX format
-and run it with ONNX Runtime (~50MB). Same model, same results, much smaller footprint.
-
-sentence-transformers supports this with a single parameter:
-```python
-SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
-```
+An optimized inference engine for running machine learning models. We load the
+ONNX model directly from HuggingFace Hub and run inference with ONNX Runtime +
+a HuggingFace tokenizer — no PyTorch dependency at all. This keeps the worker
+container image at ~190MB instead of ~3GB. Same model, same results, much smaller
+footprint.
 
 This is important for K8s because smaller images mean faster pod startup (KEDA can
 scale workers up more quickly) and lower memory usage per pod.
diff --git a/docs/implementation-plan.md b/docs/implementation-plan.md
@@ -25,7 +25,7 @@
 | 11 | Polish and demo rehearsal | MUST | 1.5-2h | DONE (full demo rehearsal completed 2026-04-02) |
 | -- | **Day 4: Enhancements** | -- | -- | **DONE** |
 | 12 | Azure Blob Storage integration | HIGH | 2h | DONE (PDFs stored in Azure, metrics in Grafana) |
-| 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (sentence-transformers backend="onnx", ~50MB vs ~5GB PyTorch) |
+| 13 | Switch to ONNX Runtime | MEDIUM | 15min | DONE (direct ONNX Runtime, no PyTorch/sentence-transformers, image ~190MB vs ~3GB) |
 
 **If time runs short:** Cut from the bottom. Stages 0-5 + 11 are non-negotiable. Stage 6 (Grafana) is
 the most important "nice to have" because it's the visual centerpiece of the demo. Stages 7-10 can
diff --git a/journal/260402-chaos-mesh-testing-and-demo-rehearsal.md b/journal/260402-chaos-mesh-testing-and-demo-rehearsal.md
@@ -71,18 +71,29 @@ Applied via `helm upgrade --install`. Chaos daemons restarted and all experiment
   names across all docs, updated implementation plan progress.
 - **CI fix:** `src/worker/store.py` had a ruff format issue (function args on one line).
 
-## Files changed
-
-- `k8s/chaos/pod-kill.yaml` — changed mode from `fixed`/`value: "2"` to `one`
-- `infra/helm-install.sh` — added containerd runtime settings for Chaos Mesh
-- `infra/setup.sh` — corrected storage account name
-- `docs/chaos-experiments.md` — added containerd prerequisite note and verified results
-- `docs/demo-guide.md` — updated rolling update section, removed stale Postgres checklist item
-- `docs/implementation-plan.md` — marked chaos, rolling update, demo rehearsal as DONE
-- `locust/locustfile.py` — reweighted for async pipeline, removed sync generate task
-- `src/worker/store.py` — simplified blob path (filename only, no UUID prefix)
-- `tests/test_store.py` — updated blob path assertions
-- `README.md` — corrected test count (51 → 92)
-- `CLAUDE.md` — removed chaos mesh and demo rehearsal from "Not yet done"
-- `.engineering-team/architecture-plan.md` — updated demo script and Azure resource names
-- `.github/workflows/deploy.yml` — corrected AKS cluster name and resource group
+## CI/CD pipeline
+
+- **docker.yml:** Now builds images once and pushes to both ghcr.io and ACR. ACR push
+  is gated on `ACR_LOGIN_SERVER` variable + `ACR_CLIENT_ID`/`ACR_CLIENT_SECRET` secrets.
+- **deploy.yml:** Triggers via `workflow_run` after docker.yml completes (no duplicate builds).
+  Uses `azure/login@v2` with service principal creds JSON for AKS access.
+- **Secrets configured:** `ACR_CLIENT_ID`, `ACR_CLIENT_SECRET`, `AZURE_TENANT_ID`,
+  `AZURE_SUBSCRIPTION_ID`, `AZURE_CREDENTIALS`, plus `ACR_LOGIN_SERVER` variable.
+- **Full pipeline working:** push → CI (lint+test) + Docker (build+push to ghcr.io+ACR) → Deploy (AKS).
+
+## Remove torch: direct ONNX Runtime inference
+
+Replaced `sentence-transformers` (which requires torch ~2GB) with direct ONNX Runtime
+inference. The worker image dropped from **~3GB to ~190MB** (93% reduction).
+
+- `src/worker/semantic.py` — replaced `SentenceTransformer.encode()` with manual
+  tokenization (HuggingFace `AutoTokenizer`) + ONNX inference + numpy mean pooling
+  + L2 normalization. Same model (all-MiniLM-L6-v2), same 384-dim output.
+- `pyproject.toml` — replaced `sentence-transformers[onnx]` with `onnxruntime`,
+  `transformers`, `huggingface-hub`. Removed 31 packages from lock file.
+- ONNX model downloaded from HuggingFace Hub on first use via `hf_hub_download`.
+
+## Redis OOM fix
+
+Redis was OOMKilled with 128Mi memory limit after accumulating a large backlog from
+Locust testing. Increased to 512Mi limit / 256Mi request via kubectl patch.