From 243b07e90dcc0e979717703186c3ac9baf2c9bd7 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Sun, 20 Jul 2025 11:21:10 +0700
Subject: [PATCH 01/12] added documentation

---
 README.md                    |  20 ++
 lorax_deployment_playbook.md | 383 +++++++++++++++++++++++++++++++++++
 server/requirements.txt      |  14 --
 3 files changed, 403 insertions(+), 14 deletions(-)
 create mode 100644 lorax_deployment_playbook.md

diff --git a/README.md b/README.md
index 8998cfbdc..d6d8582a5 100644
--- a/README.md
+++ b/README.md
@@ -16,6 +16,19 @@ _LoRAX: Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs_
 
 LoRAX (LoRA eXchange) is a framework that allows users to serve thousands of fine-tuned models on a single GPU, dramatically reducing the cost of serving without compromising on throughput or latency.
 
+---
+
+**🚀 Start Here: For a Robust & Reliable LoRAX Deployment**
+
+While this `README.md` provides a general overview, setting up a performant LoRAX server involves specific hardware, software, and environment configurations. To ensure a smooth, "impossible-to-fail" deployment experience, we highly recommend consulting our detailed **[LoRAX Deployment Playbook](lorax_deployment_playbook.md)**. This guide covers:
+
+* **[Bulletproof Host System Setup](lorax_deployment_playbook.md#phase-1-host-setup):** NVIDIA drivers, Docker, `nvidia-container-toolkit`, and crucial user permissions.
+* **[GPU VRAM Considerations](lorax_deployment_playbook.md#phase-2-deploy-lorax):** Understanding LLM memory requirements and selecting compatible models for your hardware.
+* **[Pre-Built vs. Source Deployment](lorax_deployment_playbook.md#phase-2-deploy-lorax):** Choosing the fastest path or building from source with all CUDA kernels.
+* **[Common Pitfalls & Troubleshooting](lorax_deployment_playbook.md#troubleshooting-guide):** Solutions for Hugging Face authentication, model download stalls, and more.
+
+---
+
 ## 📖 Table of contents
 
 - [📖 Table of contents](#-table-of-contents)
@@ -59,6 +72,9 @@ Base models can be loaded in fp16 or quantized with `bitsandbytes`, [GPT-Q](http
 
 Supported adapters include LoRA adapters trained using the [PEFT](https://github.com/huggingface/peft) and [Ludwig](https://ludwig.ai/) libraries. Any of the linear layers in the model can be adapted via LoRA and loaded in LoRAX.
 
+**⚙️ Model Compatibility & VRAM:** Selecting the right model for your GPU's VRAM is crucial. Not all quantized models are plug-and-play due to varying toolchains. For detailed guidance on VRAM limitations and troubleshooting quantized model errors (e.g., `CUDA out of memory`, `RuntimeError`), refer to **[Phase 2: Deploy LoRAX](lorax_deployment_playbook.md#phase-2-deploy-lorax)** in the LoRAX Deployment Playbook.
+
+
 ## 🏃‍♂️ Getting Started
 
 We recommend starting with our pre-built Docker image to avoid compiling custom CUDA kernels and other dependencies.
@@ -72,6 +88,8 @@ The minimum system requirements need to run LoRAX include:
 - Linux OS
 - Docker (for this guide)
 
+**🚨 Critical Setup Note:** Meeting these requirements can be complex. For a step-by-step, verified guide on installing GPU drivers, Docker Engine, and `nvidia-container-toolkit` (including essential user permissions), please follow **[Phase 1: Host Setup](lorax_deployment_playbook.md#phase-1-host-setup)** in the LoRAX Deployment Playbook. Incorrect setup here is the most common cause of deployment failures.
+
 ### Launch LoRAX Server
 
 #### Prerequisites
@@ -80,6 +98,8 @@ Then
  - `sudo systemctl daemon-reload`
  - `sudo systemctl restart docker`
 
+**💡 For the most reliable and fully explained `docker run` command, including critical flags (`-e HUGGING_FACE_HUB_TOKEN`, `--user`), model selection based on GPU VRAM, and troubleshooting common issues like model download stalls or quantized model compatibility, refer to our comprehensive guide: [Phase 2: Deploy LoRAX](lorax_deployment_playbook.md#phase-2-deploy-lorax) and [Phase 3: Test the API](lorax_deployment_playbook.md#phase-3-test-the-api) in the LoRAX Deployment Playbook.**
+
 ```shell
 model=mistralai/Mistral-7B-Instruct-v0.1
 volume=$PWD/data
diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
new file mode 100644
index 000000000..ac858d46d
--- /dev/null
+++ b/lorax_deployment_playbook.md
@@ -0,0 +1,383 @@
+# 🚀 LoRAX Deployment Playbook
+
+Welcome to the **LoRAX Deployment Playbook**! This guide is designed for **first-time operators** setting up a **LoRAX server** on a fresh **Ubuntu 22.04** GPU host with **sudo** access. We'll walk you through each step, explain *why* it matters, and provide quick fixes for common issues. Let's get your **LoRAX server** up and running! 🎉
+
+> **Goal:** Deploy a working **LoRAX server** with a chosen model, understand the process, and troubleshoot issues fast.
+
+---
+
+## 📋 Overview
+
+To deploy **LoRAX**, you need these components in order:
+
+1. **GPU Driver** – Verify `nvidia-smi` works on the host.
+2. **Docker Engine** – Ensure the user is in the `docker` group.
+3. **NVIDIA Container Runtime** – Make GPUs accessible inside containers.
+4. **LoRAX Container** – Pull or build the container image.
+5. **Model Files** – Download or cache model files.
+6. **API** – Confirm the server is listening and passes a basic inference test.
+
+> **Quick Sanity Check:** Stop at the first failure in this sequence:
+> - **A.** Run `nvidia-smi` on the host.
+> - **B.** Test GPU access in a container: `docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi`.
+> - **C.** Launch **LoRAX** with `MODEL_ID=gpt2`.
+> - **D.** Test the API with `curl`.
+> - **E.** Scale up to a larger model.
+
+---
+
+## Phase 1: Host Setup
+
+### 1. Verify NVIDIA Driver ✅
+
+Ensure your **NVIDIA driver** is working correctly.
+
+```bash
+nvidia-smi
+```
+
+**Success:** Displays a table with the driver version and GPU details.  
+**Common Failures:**
+- *`command not found`* → Driver not installed or PATH issue.
+- *“NVIDIA-SMI has failed”* → Kernel module mismatch or Secure Boot blocking.
+
+> **Fix:** Reinstall the NVIDIA driver or disable/enroll MOK for Secure Boot.
+
+---
+
+### 2. Install Docker 🐳
+
+Set up **Docker** to run containers on **Ubuntu 22.04**.
+
+```bash
+sudo apt update
+sudo apt install -y ca-certificates curl
+sudo install -m 0755 -d /etc/apt/keyrings
+sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
+sudo chmod a+r /etc/apt/keyrings/docker.asc
+echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+sudo apt update
+sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+```
+
+**What This Does:**
+- Updates package metadata.
+- Installs tools for HTTPS repositories.
+- Sets up Docker’s GPG key and repository.
+- Installs **Docker Engine**, CLI, and plugins.
+
+**Success:** Run `docker --version` and `systemctl status docker` (should show *active (running)*).  
+**Common Failures:**
+- GPG/repo errors (“NO_PUBKEY”, “Unsigned”) → Key issue; redo key setup.
+- Architecture mismatch on non-x86 hosts.
+
+> **Fix:** Re-run key download steps and `apt update`.
+
+---
+
+### 3. Install NVIDIA Container Toolkit 🔧
+
+Enable GPU access inside **Docker containers**.
+
+```bash
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+curl -s -L https://nvidia.github.io/libnvidia-container/ubuntu22.04/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list > /dev/null
+sudo apt update
+sudo apt install -y nvidia-container-toolkit
+sudo nvidia-ctk runtime configure --runtime=docker
+sudo systemctl restart docker
+```
+
+**What This Does:**
+- Adds the NVIDIA Container Toolkit repository.
+- Installs the toolkit and configures Docker to use NVIDIA GPUs.
+
+**Success:** Check `/etc/docker/daemon.json` for `runtimes.nvidia`. Test with a CUDA container (Step 5).  
+**Common Failures:**
+- `nvidia-ctk: command not found` → Installation failed; redo apt steps.
+- “Could not select device driver” → Runtime misconfigured; re-run configure and restart.
+
+> **Fix:** Re-run the toolkit installation and configuration steps.
+
+---
+
+### 4. Add User to Docker Group 👤
+
+Allow running **Docker** commands without `sudo`.
+
+```bash
+sudo usermod -aG docker $USER
+newgrp docker
+```
+
+**Success:** `groups` shows `docker`; `docker ps` works without `sudo`.  
+**Common Failure:** Commands still require `sudo` → Log out and back in.
+
+> **Tip:** Log out and log back in to apply group changes.
+
+---
+
+### 5. Verify GPU in Container 🖥️
+
+Confirm GPUs are accessible inside a **Docker container**.
+
+```bash
+docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
+```
+
+**Success:** Displays a table similar to `nvidia-smi` on the host.  
+**Common Failures:**
+- “Unknown runtime specified nvidia” → Toolkit setup incomplete (redo Step 3).
+- “CUDA driver version insufficient” → Host driver outdated; update it.
+- “Could not select device driver” → Runtime misconfigured; redo Step 3.
+
+> **Fix:** Revisit NVIDIA Container Toolkit setup or update the host driver.
+
+---
+
+## Phase 2: Deploy LoRAX
+
+Choose one deployment path:
+- **(A) Pre-built Image** – Fastest option, recommended for most users.
+- **(B) Build from Source** – Only for custom changes or unreleased patches.
+
+### Option A: Pre-built Image 🎉
+
+#### 1. Pull the LoRAX Image
+
+```bash
+docker pull ghcr.io/predibase/lorax:main
+```
+
+
+**Success:** Image downloads successfully.  
+**Common Failure:** Network timeout → Retry or check connectivity.
+
+> **Tip:** This is a public image, so no authentication issues are expected.
+
+---
+
+#### 2. Choose Your Model 📊
+
+Start with **`gpt2`** for a quick test (it’s small and fast). Larger models require careful **VRAM** planning to avoid `CUDA out of memory` errors.
+
+| **Model** | **Params** | **VRAM (FP16/BF16)** | **Notes** |
+|-----------|------------|-----------------------|-----------|
+| `gpt2` | 0.1B | ~0.5 GB | Perfect for testing; fits any GPU. |
+| `bigcode/starcoder2-3b` | 3B | ~6–7 GB | Works on 8 GB VRAM GPUs. |
+| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | ~14–15 GB | Needs 16–24 GB VRAM. |
+| `meta-llama/Meta-Llama-3-8B-Instruct` | 8B | ~16 GB | Tight on 16 GB; better with 24 GB. |
+| `TheBloke/Mistral-7B-Instruct-v0.3-GPTQ` | 7B (4-bit) | ~8–10 GB | Quantized; fits 12–16 GB VRAM. |
+| `meta-llama/Meta-Llama-3-13B-Instruct` | 13B | ~26 GB | Requires 24–26 GB VRAM. |
+| `meta-llama/Meta-Llama-3-70B-Instruct` | 70B | 135–140 GB | Needs multi-GPU or heavy quantization. |
+
+> **VRAM Tips:**
+> - Keep **10–15% VRAM free** for KV cache and overhead.
+> - **6–8 GB GPUs**: Stick to `gpt2` or quantized 7B models.
+> - **12–16 GB GPUs**: Comfortable for 7B; tight for 8B.
+> - **24 GB+ GPUs**: Suitable for 13B or multi-instance setups.
+
+---
+
+#### 3. Run the LoRAX Container
+
+```bash
+MODEL_ID="gpt2"
+SHARDED_MODEL="false"
+PORT=80
+
+docker run --rm 
+
+--name lorax 
+
+--gpus all 
+
+-e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" 
+
+-e TRANSFORMERS_CACHE=/data 
+
+-v "$HOME/lorax_model_cache":/data 
+
+-v "$HOME/lorax_outlines_cache":/root/.cache/outlines 
+
+--user "$(id -u):$(id -g)" 
+
+-p ${PORT}:80 
+
+ghcr.io/predibase/lorax:main 
+
+--model-id "$MODEL_ID" 
+
+--sharded "$SHARDED_MODEL"
+```
+
+
+**What This Does:**
+- Starts the **LoRAX container** with GPU access.
+- Mounts model cache to persist downloads.
+- Maps port **80** (container) to your chosen **host port**.
+- Loads the specified **model** (start with `gpt2`).
+
+**Success:** Logs show model download/cache hit and “Model loaded”; health endpoint responds.  
+**Common Failures:**
+- Stalled download → Network or Hugging Face rate limits.
+- `CUDA out of memory` → Model too large for GPU VRAM.
+
+> **Fix for Stalled Downloads:**
+> 1. Visit the model’s Hugging Face page (e.g., `https://huggingface.co/<model_id>/tree/main`).
+> 2. Note the commit hash from the URL or “Files and Versions.”
+> 3. Create the cache path: `$HOME/lorax_model_cache/<model_id>/snapshots/<commit_hash>/`.
+> 4. Download all model files (config, tokenizer, `.safetensors`, etc.) to that directory.
+> 5. Re-run the container; it should use the cached files.
+
+---
+
+### Option B: Build from Source 🛠️
+
+Use this if you need custom changes or unreleased patches.
+
+#### 1. Clone the Repository
+
+```bash
+git clone https://github.com/predibase/lorax.git
+cd lorax
+```
+
+#### 2. Initialize Submodules (if needed)
+
+```bash
+git submodule update --init --recursive
+```
+
+
+#### 3. Build the Image
+
+```bash
+docker build -t my-lorax-server -f Dockerfile .
+```
+
+
+**Common Failures:**
+- Build stalls → Add `--network=host` to the build command.
+- Version conflicts → Adjust base image or dependencies.
+
+#### 4. Run the Container
+
+Use the same `docker run` command as in Option A, replacing `ghcr.io/predibase/lorax:main` with `my-lorax-server`.
+
+**Common Failures:**
+- “Exec format error” → Image built for wrong architecture.
+- Immediate exit → Library mismatch; rebuild with compatible CUDA base.
+
+---
+
+## Phase 3: Test the API
+
+Once logs show the server is ready, test the **LoRAX API**.
+
+```bash
+curl http://localhost:80/
+```
+
+
+**Example Inference:**
+
+```bash
+curl -X POST http://localhost:80/generate 
+
+-H 'Content-Type: application/json' 
+
+-d '{"prompt":"Hello","max_tokens":32}'
+```
+
+
+**Success:** Returns JSON with generated text.  
+**Common Failures:**
+- Connection refused → Container not running or wrong port (`docker ps`).
+- 404 → Wrong endpoint; check root docs.
+- 500 → Model not loaded or OOM (`docker logs lorax`).
+
+> **Fix:** Check logs with `docker logs lorax` and verify port mapping.
+
+---
+
+## Phase 4: Performance & Scaling Tips
+
+- **Concurrency:** Increase only after single-request stability (KV cache can cause OOM).
+- **Tuning Options:** Adjust `--max-concurrent-requests`, batching, or tensor parallelization (if supported).
+- **Monitor GPUs:**
+
+```bash
+watch -n1 nvidia-smi
+```
+
+
+---
+
+## Troubleshooting Guide
+
+**Format:** [Stage] Symptom → Cause → Fix
+
+- **[Host]** `nvidia-smi` fails → Driver issue → Check `dmesg | grep -i nvidia | tail -n5`; reinstall driver or fix Secure Boot.
+- **[Container]** “Could not select device driver” → Runtime misconfigured → Verify `/etc/docker/daemon.json`; redo toolkit setup.
+- **[Docker]** Cache permission denied → Root-owned files → Run `sudo chown -R $(id -u):$(id -g) $HOME/lorax_model_cache`.
+- **[Model Load]** CUDA OOM → Model too large → Check `nvidia-smi`; use smaller/quantized model.
+- **[Model Load]** Download stalls → Network issue → Use manual download workaround.
+- **[Model Load]** `RuntimeError: weight not found` → Quantized model incompatibility → Try FP16 or a different quantized model.
+- **[API]** 404 on generate → Wrong route → Check `curl http://localhost:80/`; adjust client.
+- **[API]** 500 error → OOM or bad params → Check `docker logs --tail 100 lorax | grep -i error`; reduce `max_tokens`.
+- **[Performance]** Slow first call → Warmup overhead → Send a short warmup prompt.
+- **[Performance]** Low GPU usage (<30%) → Small batches → Enable batching or increase concurrency.
+- **[Stability]** Exit code 137 → Host OOM → Check `dmesg | tail`; reduce model size.
+
+---
+
+## 🧠 Decision Matrix
+
+| **Situation** | **Action** |
+|---------------|------------|
+| `nvidia-smi` broken | Fix driver first. |
+| Container `nvidia-smi` fails | Fix NVIDIA runtime config. |
+| `gpt2` fails to load | Check environment/image. |
+| `gpt2` works, larger model fails | Address VRAM/quantization issues. |
+| API fails | Check routes, params, or logs. |
+| API slow | Optimize concurrency or use smaller model. |
+
+---
+
+## 🧹 Cleanup & Reset
+
+```bash
+docker stop lorax
+docker system prune -f
+rm -rf $HOME/lorax_model_cache/*
+sudo chown -R $(id -u):$(id -g) $HOME/lorax_model_cache
+```
+
+
+---
+
+## 📜 Quick Command Recap
+
+```bash
+nvidia-smi
+docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
+docker pull ghcr.io/predibase/lorax:main
+MODEL_ID="gpt2"; docker run --rm --gpus all -v "$HOME/lorax_model_cache":/data -p 80:80 ghcr.io/predibase/lorax:main --model-id "$MODEL_ID" --sharded false
+curl http://localhost:80/
+```
+
+
+---
+
+## 🌟 Next Steps
+
+- **Monitoring:** Add logging/metrics with Prometheus or parse stdout.
+- **Security:** Set up a reverse proxy (nginx/traefik) with TLS for public access.
+- **Automation:** Create health/warmup scripts (e.g., systemd or Docker Compose).
+- **Reliability:** Add watchdog with `Restart=on-failure` (systemd or Docker policies).
+
+---
+
+**Happy Deploying!** 🎉
+
diff --git a/server/requirements.txt b/server/requirements.txt
index 036f3be8a..c808e2032 100644
--- a/server/requirements.txt
+++ b/server/requirements.txt
@@ -32,19 +32,6 @@ mpmath==1.3.0 ; python_version >= "3.9" and python_version < "4.0"
 multidict==6.1.0 ; python_version >= "3.9" and python_version < "4.0"
 networkx==3.2.1 ; python_version >= "3.9" and python_version < "4.0"
 numpy==1.26.4 ; python_version >= "3.9" and python_version < "4.0"
-nvidia-cublas-cu12==12.1.3.1 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-cuda-cupti-cu12==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-cuda-nvrtc-cu12==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-cuda-runtime-cu12==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-cudnn-cu12==9.1.0.70 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-cufft-cu12==11.0.2.54 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-curand-cu12==10.3.2.106 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-cusolver-cu12==11.4.5.107 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-cusparse-cu12==12.1.0.106 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-ml-py==12.570.86 ; python_version >= "3.9" and python_version < "4.0"
-nvidia-nccl-cu12==2.20.5 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-nvjitlink-cu12==12.8.61 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
-nvidia-nvtx-cu12==12.1.105 ; platform_system == "Linux" and platform_machine == "x86_64" and python_version >= "3.9" and python_version < "4.0"
 opentelemetry-api==1.21.0 ; python_version >= "3.9" and python_version < "4.0"
 opentelemetry-exporter-otlp-proto-common==1.21.0 ; python_version >= "3.9" and python_version < "4.0"
 opentelemetry-exporter-otlp-proto-grpc==1.21.0 ; python_version >= "3.9" and python_version < "4.0"
@@ -74,7 +61,6 @@ stanford-stk==0.7.1 ; python_version >= "3.9" and python_version < "4.0" and sys
 sympy==1.13.3 ; python_version >= "3.9" and python_version < "4.0"
 tiktoken==0.5.2 ; python_version >= "3.9" and python_version < "4.0"
 tokenizers==0.21.0 ; python_version >= "3.9" and python_version < "4.0"
-torch==2.6.0 ; python_version >= "3.9" and python_version < "4.0"
 tqdm==4.67.1 ; python_version >= "3.9" and python_version < "4.0"
 transformers==4.49.0 ; python_version >= "3.9" and python_version < "4.0"
 triton==3.0.0 ; python_version >= "3.9" and sys_platform == "linux" and python_version < "4.0" or python_version >= "3.9" and python_version < "3.13" and platform_machine == "x86_64" and platform_system == "Linux"

From f6f7b0ecdf8124b62822fe67f788339b596087a2 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Mon, 21 Jul 2025 11:06:22 +0700
Subject: [PATCH 02/12] edit deployment_playbook.md

---
 lorax_deployment_playbook.md | 409 +++++++++++++++++++++++++----------
 1 file changed, 296 insertions(+), 113 deletions(-)

diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index ac858d46d..7e1055daf 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -28,7 +28,9 @@ To deploy **LoRAX**, you need these components in order:
 
 ## Phase 1: Host Setup
 
-### 1. Verify NVIDIA Driver ✅
+Before diving into installations, let's quickly check if your system already has the necessary components. Run the `Check` command for each step. If it passes, you can **skip** the corresponding installation section. If it fails, expand the "Installation Guide" to proceed.
+
+### 1. Check NVIDIA Driver ✅
 
 Ensure your **NVIDIA driver** is working correctly.
 
@@ -39,23 +41,62 @@ nvidia-smi
 **Success:** Displays a table with the driver version and GPU details.  
 **Common Failures:**
 - *`command not found`* → Driver not installed or PATH issue.
-- *“NVIDIA-SMI has failed”* → Kernel module mismatch or Secure Boot blocking.
+- *"NVIDIA-SMI has failed"* → Kernel module mismatch or Secure Boot blocking.
+
+<details>
+<summary>Click to expand: NVIDIA Driver Installation Guide</summary>
+
+Installing NVIDIA drivers can be complex and varies greatly by OS and GPU. **We strongly recommend following the official NVIDIA documentation for your specific GPU and Linux distribution.** Example: [NVIDIA Drivers Downloads](https://www.nvidia.com/Download/index.aspx).
 
-> **Fix:** Reinstall the NVIDIA driver or disable/enroll MOK for Secure Boot.
+</details>
 
 ---
 
-### 2. Install Docker 🐳
+### 2. Check Docker Engine Installation 🐳
+
+Run this command to check if Docker is installed and running:
+
+```bash
+# Check if we're inside a containerized environment where Docker can't run
+if grep -qa 'docker\|lxc' /proc/1/cgroup || [ -f /.dockerenv ]; then
+    echo "⚠️  Detected: This environment is containerized (Docker/LXC)."
+    echo "You CANNOT start Docker inside a container on most cloud GPU providers."
+    echo "👉  If you need full Docker access, deploy on a bare-metal or privileged VM."
+    echo "The script will exit in 10 seconds. Hit Ctrl+C to abort immediately."
+    sleep 10
+    echo "Exiting script."
+    exit 0
+fi
+
+if command -v docker >/dev/null 2>&1 && docker info >/dev/null 2>&1; then
+    echo "Docker Engine: Installed and running. ✅"
+else
+    echo "Docker Engine: NOT detected or NOT running. ❌"
+fi
+```
+
+> **Outcome:** If you see "Docker is installed and running.", you can skip the installation below.
+
+<details>
+<summary>Click to expand: Install Docker Engine</summary>
 
 Set up **Docker** to run containers on **Ubuntu 22.04**.
 
 ```bash
+sudo apt-get purge -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
+sudo apt-get autoremove -y --purge
+sudo rm -rf /var/lib/docker /var/lib/containerd
+
 sudo apt update
 sudo apt install -y ca-certificates curl
 sudo install -m 0755 -d /etc/apt/keyrings
 sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
 sudo chmod a+r /etc/apt/keyrings/docker.asc
-echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+echo \
+  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
+  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
+  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
+
 sudo apt update
 sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
 ```
@@ -63,29 +104,66 @@ sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin d
 **What This Does:**
 - Updates package metadata.
 - Installs tools for HTTPS repositories.
-- Sets up Docker’s GPG key and repository.
+- Sets up Docker's GPG key and repository.
 - Installs **Docker Engine**, CLI, and plugins.
 
 **Success:** Run `docker --version` and `systemctl status docker` (should show *active (running)*).  
 **Common Failures:**
-- GPG/repo errors (“NO_PUBKEY”, “Unsigned”) → Key issue; redo key setup.
+- GPG/repo errors ("NO_PUBKEY", "Unsigned") → Key issue; redo key setup.
 - Architecture mismatch on non-x86 hosts.
 
 > **Fix:** Re-run key download steps and `apt update`.
 
+</details>
+
 ---
 
-### 3. Install NVIDIA Container Toolkit 🔧
+### 3. Check NVIDIA Container Toolkit 🔧
+
+Run this command to verify GPU access within a container (requires Docker and Toolkit):
+
+```bash
+docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
+```
+
+> **Outcome:** If you see GPU details (similar to `nvidia-smi` on host), you can skip the installation below.
+> **Common Failures:** "Unknown runtime specified nvidia" or "Could not select device driver" means the Toolkit is not correctly installed or configured.
+
+<details>
+<summary>Click to expand: Install NVIDIA Container Toolkit</summary>
 
 Enable GPU access inside **Docker containers**.
 
 ```bash
-curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
-curl -s -L https://nvidia.github.io/libnvidia-container/ubuntu22.04/libnvidia-container.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list > /dev/null
-sudo apt update
-sudo apt install -y nvidia-container-toolkit
+# SHORT, FORCEFUL NVIDIA TOOLKIT INSTALL FOR UBUNTU 22.04 (Vast Mystery Box)
+set -euo pipefail
+
+# -- CRITICAL CHECKS --
+[[ "$(lsb_release -rs)" = "22.04" ]] || echo "[WARNING] Not Ubuntu 22.04. You WILL break stuff." 
+command -v docker >/dev/null || { echo "[FATAL] Docker not found."; exit 1; }
+
+# -- FORCE OVERWRITE EXISTING GPG KEY --
+sudo rm -f /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+
+# -- ADD REPO & KEY (no prompt) --
+curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
+  sudo gpg --yes --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
+
+curl -fsSL https://nvidia.github.io/libnvidia-container/ubuntu22.04/libnvidia-container.list \
+| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
+| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list > /dev/null
+
+# -- INSTALL --
+sudo apt-get update
+sudo apt-get install -y nvidia-container-toolkit
+
+# -- CONFIGURE --
 sudo nvidia-ctk runtime configure --runtime=docker
 sudo systemctl restart docker
+
+# -- SANITY TEST --
+docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi \
+|| echo "[FATAL] Docker can't see your GPU. Drivers likely broken. Try 'nvidia-smi' on host."
 ```
 
 **What This Does:**
@@ -95,13 +173,26 @@ sudo systemctl restart docker
 **Success:** Check `/etc/docker/daemon.json` for `runtimes.nvidia`. Test with a CUDA container (Step 5).  
 **Common Failures:**
 - `nvidia-ctk: command not found` → Installation failed; redo apt steps.
-- “Could not select device driver” → Runtime misconfigured; re-run configure and restart.
+- "Could not select device driver" → Runtime misconfigured; re-run configure and restart.
 
 > **Fix:** Re-run the toolkit installation and configuration steps.
 
+</details>
+
 ---
 
-### 4. Add User to Docker Group 👤
+### 4. Check User in Docker Group 👤
+
+Run this command to check if your user is already in the 'docker' group:
+
+```bash
+groups | grep -q docker && echo "User is in the docker group." || echo "User is NOT in the docker group. Permissions needed."
+```
+
+> **Outcome:** If you see "User is in the docker group.", you can skip the steps below.
+
+<details>
+<summary>Click to expand: Add User to Docker Group</summary>
 
 Allow running **Docker** commands without `sudo`.
 
@@ -115,23 +206,63 @@ newgrp docker
 
 > **Tip:** Log out and log back in to apply group changes.
 
+</details>
+
 ---
 
-### 5. Verify GPU in Container 🖥️
+### 5. Hugging Face Authentication 🔑
+
+Some models on Hugging Face require authentication to download. This is especially true for "gated" models like Mistral, Llama, and other proprietary models. You'll need a **Hugging Face Hub Token** to access these models.
+
+**What is a Hugging Face Hub Token?**
+A personal access token that acts like a password for programmatic access to Hugging Face. It allows LoRAX to download models on your behalf.
 
-Confirm GPUs are accessible inside a **Docker container**.
+Run this command to check if your `HUGGING_FACE_HUB_TOKEN` is already set as an environment variable:
 
 ```bash
-docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
+if [ -n "$HUGGING_FACE_HUB_TOKEN" ]; then
+    echo "HUGGING_FACE_HUB_TOKEN is set. ✅"
+else
+    echo "HUGGING_FACE_HUB_TOKEN is NOT set. ❌"
+fi
 ```
 
-**Success:** Displays a table similar to `nvidia-smi` on the host.  
-**Common Failures:**
-- “Unknown runtime specified nvidia” → Toolkit setup incomplete (redo Step 3).
-- “CUDA driver version insufficient” → Host driver outdated; update it.
-- “Could not select device driver” → Runtime misconfigured; redo Step 3.
+> **Outcome:** If you see "HUGGING_FACE_HUB_TOKEN is set. ✅", you can skip the manual setup steps below.
+
+<details>
+<summary>Click to expand: Set up HUGGING_FACE_HUB_TOKEN</summary>
+
+#### Get Your Hugging Face Token
+
+1. **Visit the token page:** Go to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
+2. **Generate a new token:**
+   - Click "New token"
+   - Give it a name (e.g., "LoRAX Deployment")
+   - Select "Read" role (sufficient for downloading models)
+   - Click "Generate token"
+3. **Copy the token:** It will look like `hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx`
+4. **Request model access:** For gated models, visit their Hugging Face page and click "Request access" (e.g., [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3))
+
+#### Set the Environment Variable
+
+Add the token to your shell configuration so it's available for Docker:
+
+```bash
+# Add this line to your ~/.bashrc or ~/.zshrc file
+export HUGGING_FACE_HUB_TOKEN='hf_YOUR_TOKEN_HERE'
+
+# Reload your shell configuration
+source ~/.bashrc  # or source ~/.zshrc if using zsh
+
+# Verify it's set
+echo $HUGGING_FACE_HUB_TOKEN
+```
+
+> **Important:** Replace `hf_YOUR_TOKEN_HERE` with your actual token. The Docker container will pick up this environment variable when passed with the `-e` flag.
+
+> **Note:** For public models like `gpt2`, you don't need a token, but having one set up allows you to easily switch to gated models later.
 
-> **Fix:** Revisit NVIDIA Container Toolkit setup or update the host driver.
+</details>
 
 ---
 
@@ -141,6 +272,36 @@ Choose one deployment path:
 - **(A) Pre-built Image** – Fastest option, recommended for most users.
 - **(B) Build from Source** – Only for custom changes or unreleased patches.
 
+---
+
+### Common Failures during Container Launch
+
+<details>
+<summary>Click to expand: Common Failures during Container Launch</summary>
+
+These issues can occur when attempting to run *any* LoRAX Docker container, regardless of whether it's pre-built or from source.
+
+* **`docker: Error response from daemon: Conflict. The container name "/lorax" is already in use...`**: This means a container named `lorax` is already running or exists from a previous session. You need to stop and remove it first.
+    ```bash
+    docker stop lorax # Stop the running container
+    docker rm lorax   # Remove the stopped container (optional, if --rm was not used or failed previously)
+    ```
+    Then, re-run your `docker run` command.
+* **`docker: invalid reference format`, `--gpus: command not found`, etc.**: You likely copied the `docker run` command incorrectly. Ensure there are **no spaces** after the backslash `\` at the end of each line, and copy the entire block at once.
+* **`CUDA out of memory`** → The model you are trying to load is too large for your GPU's VRAM. Refer to the [GPU VRAM vs. Model Size Compatibility](#option-b-build-from-source-🛠️) table and choose a smaller or more quantized model.
+* **Stalled model download** → Indicates a network issue or Hugging Face rate limit when downloading the model weights inside the container.
+    > **Fix for Stalled Downloads:**
+    > 1.  Visit the model’s Hugging Face page (e.g., `https://huggingface.co/<model_id>/tree/main`).
+    > 2.  Note the commit hash from the URL or “Files and Versions.”
+    > 3.  Create the cache path on your host: `$HOME/lorax_model_cache/<model_id>/snapshots/<commit_hash>/`.
+    > 4.  Download all model files (config, tokenizer, `.safetensors`, etc.) to that directory.
+    > 5.  Re-run the container; it should now use the cached files.
+* **`RuntimeError: weight not found`** or **`TypeError`** → Model or quantization incompatibility with the pre-built image. For broader model compatibility, custom configurations, or support for a wider range of quantized models, please proceed with [Option B: Build from Source](#option-b-build-from-source-🛠️).
+
+</details>
+
+---
+
 ### Option A: Pre-built Image 🎉
 
 #### 1. Pull the LoRAX Image
@@ -159,69 +320,48 @@ docker pull ghcr.io/predibase/lorax:main
 
 #### 2. Choose Your Model 📊
 
-Start with **`gpt2`** for a quick test (it’s small and fast). Larger models require careful **VRAM** planning to avoid `CUDA out of memory` errors.
-
-| **Model** | **Params** | **VRAM (FP16/BF16)** | **Notes** |
-|-----------|------------|-----------------------|-----------|
-| `gpt2` | 0.1B | ~0.5 GB | Perfect for testing; fits any GPU. |
-| `bigcode/starcoder2-3b` | 3B | ~6–7 GB | Works on 8 GB VRAM GPUs. |
-| `mistralai/Mistral-7B-Instruct-v0.3` | 7B | ~14–15 GB | Needs 16–24 GB VRAM. |
-| `meta-llama/Meta-Llama-3-8B-Instruct` | 8B | ~16 GB | Tight on 16 GB; better with 24 GB. |
-| `TheBloke/Mistral-7B-Instruct-v0.3-GPTQ` | 7B (4-bit) | ~8–10 GB | Quantized; fits 12–16 GB VRAM. |
-| `meta-llama/Meta-Llama-3-13B-Instruct` | 13B | ~26 GB | Requires 24–26 GB VRAM. |
-| `meta-llama/Meta-Llama-3-70B-Instruct` | 70B | 135–140 GB | Needs multi-GPU or heavy quantization. |
+**Critical Compatibility Note:** Due to internal versioning and optimization, the `ghcr.io/predibase/lorax:main` pre-built Docker image is **only consistently compatible with `mistralai/Mistral-7B-Instruct-v0.1`** at this time. Attempts to load other models (including `gpt2`, `starcoder2-3b`, or any other quantized models) may result in `TypeError`, `RuntimeError: weight ... does not exist`, or other internal loading failures. For broader model compatibility, custom configurations, or support for a wider range of quantized models, please proceed with **Option B: Build from Source**.
 
-> **VRAM Tips:**
-> - Keep **10–15% VRAM free** for KV cache and overhead.
-> - **6–8 GB GPUs**: Stick to `gpt2` or quantized 7B models.
-> - **12–16 GB GPUs**: Comfortable for 7B; tight for 8B.
-> - **24 GB+ GPUs**: Suitable for 13B or multi-instance setups.
+For `mistralai/Mistral-7B-Instruct-v0.1`, a GPU with **16-24 GB VRAM is recommended** to ensure smooth operation and sufficient KV cache.
 
 ---
 
 #### 3. Run the LoRAX Container
 
 ```bash
-MODEL_ID="gpt2"
-SHARDED_MODEL="false"
-PORT=80
-
-docker run --rm 
-
---name lorax 
-
---gpus all 
-
--e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" 
-
--e TRANSFORMERS_CACHE=/data 
-
--v "$HOME/lorax_model_cache":/data 
-
--v "$HOME/lorax_outlines_cache":/root/.cache/outlines 
-
---user "$(id -u):$(id -g)" 
-
--p ${PORT}:80 
-
-ghcr.io/predibase/lorax:main 
-
---model-id "$MODEL_ID" 
-
---sharded "$SHARDED_MODEL"
+# Define your variables (MODEL_ID is set to the only supported model)
+MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"
+SHARDED_MODEL="false" # Set to 'true' for sharded (multi-GPU) models like 70B
+PORT=80 # Host port to access the LoRAX server
+
+docker run --rm \
+  --name lorax \
+  --gpus all \
+  -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
+  -e TRANSFORMERS_CACHE=/data \
+  -v "$HOME/lorax_model_cache":/data \
+  -v "$HOME/lorax_outlines_cache":/root/.cache/outlines \
+  --user "$(id -u):$(id -g)" \
+  -p ${PORT}:80 \
+  ghcr.io/predibase/lorax:main \
+  --model-id "$MODEL_ID" \
+  --sharded "$SHARDED_MODEL"
 ```
 
+<details>
+<summary>Click to expand: Explanation of Docker Run Flags</summary>
 
 **What This Does:**
-- Starts the **LoRAX container** with GPU access.
-- Mounts model cache to persist downloads.
+- Starts the **LoRAX container** named `lorax` with GPU access.
+- Mounts model cache to persist downloads between container restarts.
 - Maps port **80** (container) to your chosen **host port**.
-- Loads the specified **model** (start with `gpt2`).
+- Loads the specified **model** (now only `mistralai/Mistral-7B-Instruct-v0.1`).
+- Uses your Hugging Face token for authenticated model downloads.
+
+</details>
 
 **Success:** Logs show model download/cache hit and “Model loaded”; health endpoint responds.  
-**Common Failures:**
-- Stalled download → Network or Hugging Face rate limits.
-- `CUDA out of memory` → Model too large for GPU VRAM.
+**Common Failures:** Refer to [Common Failures during Container Launch](#common-failures-during-container-launch)
 
 > **Fix for Stalled Downloads:**
 > 1. Visit the model’s Hugging Face page (e.g., `https://huggingface.co/<model_id>/tree/main`).
@@ -234,23 +374,40 @@ ghcr.io/predibase/lorax:main
 
 ### Option B: Build from Source 🛠️
 
-Use this if you need custom changes or unreleased patches.
+Use this if you need custom changes or unreleased patches, or if you want to run models other than `mistralai/Mistral-7B-Instruct-v0.1`.
 
-#### 1. Clone the Repository
+#### GPU VRAM vs. Model Size Compatibility
 
-```bash
-git clone https://github.com/predibase/lorax.git
-cd lorax
-```
+When building from source, you gain the flexibility to choose a wider range of models. Use the following table as a guide for VRAM compatibility:
 
-#### 2. Initialize Submodules (if needed)
+| **Model** | **Params** | **VRAM (FP16/BF16)** | **Notes** |
+|-----------|------------|-----------------------|-----------|
+| `gpt2` | 0.1B | ~0.5 GB | Perfect for testing; fits any GPU. |
+| `bigcode/starcoder2-3b` | 3B | ~6–7 GB | Works on 8 GB VRAM GPUs. |
+| `mistralai/Mistral-7B-Instruct-v0.1` | 7B | ~14–15 GB | Needs 16–24 GB VRAM. |
+| `meta-llama/Meta-Llama-3-8B-Instruct` | 8B | ~16 GB | Tight on 16 GB; better with 24 GB. |
+| `meta-llama/Meta-Llama-3-13B-Instruct` | 13B | ~26 GB | Requires 24–26 GB VRAM. |
+| `meta-llama/Meta-Llama-3-70B-Instruct` | 70B | 135–140 GB | Needs multi-GPU or heavy quantization. |
+
+> **VRAM Tips:**
+> - Keep **10–15% VRAM free** for KV cache and overhead.
+> - **6–8 GB GPUs**: Stick to quantized 7B models.
+> - **12–16 GB GPUs**: Comfortable for 7B; tight for 8B.
+> - **24 GB+ GPUs**: Suitable for 13B or multi-instance setups.
+
+#### 1. Clone the LoRAX Repository (Including all necessary Submodules)
+
+**Problem:** To build LoRAX from source, you need not only the main repository but also its nested external dependencies, which are managed as Git submodules (e.g., `flashinfer` for custom CUDA kernels). Skipping this can lead to "No such file or directory" errors during the build.
+
+**Action:** First, clone the main repository, then immediately initialize and update all its submodules.
 
 ```bash
+git clone https://github.com/predibase/lorax.git
+cd lorax
 git submodule update --init --recursive
 ```
 
-
-#### 3. Build the Image
+#### 2. Build the Image
 
 ```bash
 docker build -t my-lorax-server -f Dockerfile .
@@ -275,47 +432,52 @@ Use the same `docker run` command as in Option A, replacing `ghcr.io/predibase/l
 
 Once logs show the server is ready, test the **LoRAX API**.
 
-```bash
-curl http://localhost:80/
-```
-
-
 **Example Inference:**
 
 ```bash
-curl -X POST http://localhost:80/generate 
-
--H 'Content-Type: application/json' 
+curl 127.0.0.1:80/generate \
+    -X POST \
+    -d '{
+        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
+        "parameters": {
+            "max_new_tokens": 64
+        }
+    }' \
+    -H 'Content-Type: application/json'
+```  
+
+If you're using a base model that supports LoRA adapters (like Mistral-7B) and have an adapter ID, you can test prompting a specific fine-tuned adapter.
 
--d '{"prompt":"Hello","max_tokens":32}'
+```bash
+curl 127.0.0.1:8080/generate \
+    -X POST \
+    -d '{
+        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
+        "parameters": {
+            "max_new_tokens": 64,
+            "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"
+        }
+    }' \
+    -H 'Content-Type: application/json'
 ```
 
+Note: Replace vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k with an adapter_id that is compatible with 
+your chosen base model.
 
-**Success:** Returns JSON with generated text.  
-**Common Failures:**
-- Connection refused → Container not running or wrong port (`docker ps`).
-- 404 → Wrong endpoint; check root docs.
-- 500 → Model not loaded or OOM (`docker logs lorax`).
-
-> **Fix:** Check logs with `docker logs lorax` and verify port mapping.
-
----
-
-## Phase 4: Performance & Scaling Tips
-
-- **Concurrency:** Increase only after single-request stability (KV cache can cause OOM).
-- **Tuning Options:** Adjust `--max-concurrent-requests`, batching, or tensor parallelization (if supported).
-- **Monitor GPUs:**
+**Success:** Logs show model download/cache hit and “Model loaded”; health endpoint responds.  
+<details>
+<summary>Click to expand: Common Failures during API Test</summary>
 
-```bash
-watch -n1 nvidia-smi
-```
+**Common Failures:** Refer to [Common Failures during Container Launch](#common-failures-during-container-launch)
 
+</details>
 
----
 
 ## Troubleshooting Guide
 
+<details>
+<summary>Click to expand: Comprehensive Troubleshooting Guide</summary>
+
 **Format:** [Stage] Symptom → Cause → Fix
 
 - **[Host]** `nvidia-smi` fails → Driver issue → Check `dmesg | grep -i nvidia | tail -n5`; reinstall driver or fix Secure Boot.
@@ -323,30 +485,40 @@ watch -n1 nvidia-smi
 - **[Docker]** Cache permission denied → Root-owned files → Run `sudo chown -R $(id -u):$(id -g) $HOME/lorax_model_cache`.
 - **[Model Load]** CUDA OOM → Model too large → Check `nvidia-smi`; use smaller/quantized model.
 - **[Model Load]** Download stalls → Network issue → Use manual download workaround.
-- **[Model Load]** `RuntimeError: weight not found` → Quantized model incompatibility → Try FP16 or a different quantized model.
+- **[Model Load]** `RuntimeError: weight not found` or **`TypeError`** → Model or quantization incompatibility with the pre-built image. For broader model compatibility, custom configurations, or support for a wider range of quantized models, proceed with Option B: Build from Source.
 - **[API]** 404 on generate → Wrong route → Check `curl http://localhost:80/`; adjust client.
 - **[API]** 500 error → OOM or bad params → Check `docker logs --tail 100 lorax | grep -i error`; reduce `max_tokens`.
 - **[Performance]** Slow first call → Warmup overhead → Send a short warmup prompt.
 - **[Performance]** Low GPU usage (<30%) → Small batches → Enable batching or increase concurrency.
 - **[Stability]** Exit code 137 → Host OOM → Check `dmesg | tail`; reduce model size.
 
+</details>
+
 ---
 
 ## 🧠 Decision Matrix
 
+<details>
+<summary>Click to expand: Quick Decision Matrix</summary>
+
 | **Situation** | **Action** |
 |---------------|------------|
 | `nvidia-smi` broken | Fix driver first. |
 | Container `nvidia-smi` fails | Fix NVIDIA runtime config. |
-| `gpt2` fails to load | Check environment/image. |
-| `gpt2` works, larger model fails | Address VRAM/quantization issues. |
+| `gpt2` fails to load | Check environment/image. If you need broader model compatibility, proceed with Option B: Build from Source. |
+| `gpt2` works, larger model fails | Address VRAM/quantization issues or use Option B for more models. |
 | API fails | Check routes, params, or logs. |
 | API slow | Optimize concurrency or use smaller model. |
 
+</details>
+
 ---
 
 ## 🧹 Cleanup & Reset
 
+<details>
+<summary>Click to expand: Cleanup & Reset Your Environment</summary>
+
 ```bash
 docker stop lorax
 docker system prune -f
@@ -354,16 +526,22 @@ rm -rf $HOME/lorax_model_cache/*
 sudo chown -R $(id -u):$(id -g) $HOME/lorax_model_cache
 ```
 
+</details>
 
 ---
 
 ## 📜 Quick Command Recap
 
 ```bash
+# Check GPU access
 nvidia-smi
 docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
+
+# Pull and run LoRAX
 docker pull ghcr.io/predibase/lorax:main
-MODEL_ID="gpt2"; docker run --rm --gpus all -v "$HOME/lorax_model_cache":/data -p 80:80 ghcr.io/predibase/lorax:main --model-id "$MODEL_ID" --sharded false
+MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"; docker run --rm --name lorax --gpus all -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" -e TRANSFORMERS_CACHE=/data -v "$HOME/lorax_model_cache":/data -v "$HOME/lorax_outlines_cache":/root/.cache/outlines --user "$(id -u):$(id -g)" -p 80:80 ghcr.io/predibase/lorax:main --model-id "$MODEL_ID" --sharded false
+
+# Test the API
 curl http://localhost:80/
 ```
 
@@ -372,11 +550,16 @@ curl http://localhost:80/
 
 ## 🌟 Next Steps
 
+<details>
+<summary>Click to expand: Beyond Basic Deployment (Next Steps)</summary>
+
 - **Monitoring:** Add logging/metrics with Prometheus or parse stdout.
 - **Security:** Set up a reverse proxy (nginx/traefik) with TLS for public access.
 - **Automation:** Create health/warmup scripts (e.g., systemd or Docker Compose).
 - **Reliability:** Add watchdog with `Restart=on-failure` (systemd or Docker policies).
 
+</details>
+
 ---
 
 **Happy Deploying!** 🎉

From 97d3e995e516c391282516397dcf8a89926a38a8 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Mon, 21 Jul 2025 11:36:07 +0700
Subject: [PATCH 03/12] speed up dockerfile through using all cpu cores

---
 Dockerfile                   | 23 +++++++++++++----------
 lorax_deployment_playbook.md |  2 +-
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index 0988daf58..ed616935e 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -86,42 +86,45 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
     ninja-build cmake \
     && rm -rf /var/lib/apt/lists/*
 
+# Add this for robust parallel builds
+ENV MAX_JOBS=$(nproc)
+
 # Build Flash Attention CUDA kernels
 FROM kernel-builder as flash-att-builder
 WORKDIR /usr/src
 COPY server/Makefile-flash-att Makefile
-RUN make build-flash-attention
+RUN make build-flash-attention -j$(nproc)
 
 # Build Flash Attention v2 CUDA kernels
 FROM kernel-builder as flash-att-v2-builder
 WORKDIR /usr/src
 COPY server/Makefile-flash-att-v2 Makefile
-RUN make build-flash-attention-v2-cuda
+RUN make build-flash-attention-v2-cuda -j$(nproc)
 
 # Build Transformers exllama kernels
 FROM kernel-builder as exllama-kernels-builder
 WORKDIR /usr/src
 COPY server/exllama_kernels/ .
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" MAX_JOBS=$(nproc) python setup.py build
 
 # Build Transformers exllama kernels
 FROM kernel-builder as exllamav2-kernels-builder
 WORKDIR /usr/src
 COPY server/exllamav2_kernels/ .
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" MAX_JOBS=$(nproc) python setup.py build
 
 # Build Transformers awq kernels
 FROM kernel-builder as awq-kernels-builder
 WORKDIR /usr/src
 COPY server/Makefile-awq Makefile
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-awq
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-awq -j$(nproc)
 
 # Build Transformers CUDA kernels
 FROM kernel-builder as custom-kernels-builder
 WORKDIR /usr/src
 COPY server/custom_kernels/ .
 # Build specific version of transformers
-RUN python setup.py build
+RUN MAX_JOBS=$(nproc) python setup.py build
 
 # Build vllm CUDA kernels
 FROM kernel-builder as vllm-builder
@@ -136,14 +139,14 @@ RUN ln -s "$(pwd)/cmake-3.30.0-linux-x86_64/bin/cmake" /usr/local/bin/cmake
 ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
 COPY server/Makefile-vllm Makefile
 # Build specific version of vllm
-RUN make build-vllm-cuda
+RUN make build-vllm-cuda -j$(nproc)
 
 # Build megablocks kernels
 FROM kernel-builder as megablocks-kernels-builder
 WORKDIR /usr/src
 COPY server/Makefile-megablocks Makefile
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-RUN make build-megablocks
+RUN make build-megablocks -j$(nproc)
 
 # Build punica CUDA kernels
 FROM kernel-builder as punica-builder
@@ -151,14 +154,14 @@ WORKDIR /usr/src
 COPY server/punica_kernels/ .
 # Build specific version of punica
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-RUN python setup.py build
+RUN MAX_JOBS=$(nproc) python setup.py build
 
 # Build eetq kernels
 FROM kernel-builder as eetq-kernels-builder
 WORKDIR /usr/src
 COPY server/Makefile-eetq Makefile
 # Build specific version of transformers
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-eetq
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-eetq -j$(nproc)
 
 # LoRAX base image
 FROM nvidia/cuda:12.4.0-base-ubuntu22.04 as base
diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index 7e1055daf..9ee9f1d33 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -402,7 +402,7 @@ When building from source, you gain the flexibility to choose a wider range of m
 **Action:** First, clone the main repository, then immediately initialize and update all its submodules.
 
 ```bash
-git clone https://github.com/predibase/lorax.git
+git clone https://github.com/minhkhoango/lorax.git
 cd lorax
 git submodule update --init --recursive
 ```

From 38ed39c1c179f046c14dbcf4085dd97f0ca7b125 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Mon, 21 Jul 2025 12:02:28 +0700
Subject: [PATCH 04/12] speed up dockerfile through using all cpu cores

---
 Dockerfile | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index ed616935e..c0a2fd8d1 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -86,45 +86,45 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
     ninja-build cmake \
     && rm -rf /var/lib/apt/lists/*
 
-# Add this for robust parallel builds
-ENV MAX_JOBS=$(nproc)
+# Add this for robust parallel builds, adjusted for memory constraints
+ENV MAX_JOBS=16
 
 # Build Flash Attention CUDA kernels
 FROM kernel-builder as flash-att-builder
 WORKDIR /usr/src
 COPY server/Makefile-flash-att Makefile
-RUN make build-flash-attention -j$(nproc)
+RUN make build-flash-attention -j$(MAX_JOBS)
 
 # Build Flash Attention v2 CUDA kernels
 FROM kernel-builder as flash-att-v2-builder
 WORKDIR /usr/src
 COPY server/Makefile-flash-att-v2 Makefile
-RUN make build-flash-attention-v2-cuda -j$(nproc)
+RUN make build-flash-attention-v2-cuda -j$(MAX_JOBS)
 
 # Build Transformers exllama kernels
 FROM kernel-builder as exllama-kernels-builder
 WORKDIR /usr/src
 COPY server/exllama_kernels/ .
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" MAX_JOBS=$(nproc) python setup.py build
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
 
 # Build Transformers exllama kernels
 FROM kernel-builder as exllamav2-kernels-builder
 WORKDIR /usr/src
 COPY server/exllamav2_kernels/ .
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" MAX_JOBS=$(nproc) python setup.py build
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
 
 # Build Transformers awq kernels
 FROM kernel-builder as awq-kernels-builder
 WORKDIR /usr/src
 COPY server/Makefile-awq Makefile
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-awq -j$(nproc)
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-awq -j$(MAX_JOBS)
 
 # Build Transformers CUDA kernels
 FROM kernel-builder as custom-kernels-builder
 WORKDIR /usr/src
 COPY server/custom_kernels/ .
 # Build specific version of transformers
-RUN MAX_JOBS=$(nproc) python setup.py build
+RUN python setup.py build
 
 # Build vllm CUDA kernels
 FROM kernel-builder as vllm-builder
@@ -139,14 +139,14 @@ RUN ln -s "$(pwd)/cmake-3.30.0-linux-x86_64/bin/cmake" /usr/local/bin/cmake
 ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
 COPY server/Makefile-vllm Makefile
 # Build specific version of vllm
-RUN make build-vllm-cuda -j$(nproc)
+RUN make build-vllm-cuda -j$(MAX_JOBS)
 
 # Build megablocks kernels
 FROM kernel-builder as megablocks-kernels-builder
 WORKDIR /usr/src
 COPY server/Makefile-megablocks Makefile
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-RUN make build-megablocks -j$(nproc)
+RUN make build-megablocks -j$(MAX_JOBS)
 
 # Build punica CUDA kernels
 FROM kernel-builder as punica-builder
@@ -154,14 +154,14 @@ WORKDIR /usr/src
 COPY server/punica_kernels/ .
 # Build specific version of punica
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-RUN MAX_JOBS=$(nproc) python setup.py build
+RUN python setup.py build
 
 # Build eetq kernels
 FROM kernel-builder as eetq-kernels-builder
 WORKDIR /usr/src
 COPY server/Makefile-eetq Makefile
 # Build specific version of transformers
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-eetq -j$(nproc)
+RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-eetq -j$(MAX_JOBS)
 
 # LoRAX base image
 FROM nvidia/cuda:12.4.0-base-ubuntu22.04 as base

From b194ee0277a53b927d30fc8cd1e19bf1bbd22db5 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Mon, 21 Jul 2025 12:43:22 +0700
Subject: [PATCH 05/12] speed up dockerfile through using all cpu cores

---
 Dockerfile                   | 11 +++++++++--
 lorax_deployment_playbook.md | 15 ++++++++++++++-
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index c0a2fd8d1..fba6207e6 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -86,8 +86,15 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
     ninja-build cmake \
     && rm -rf /var/lib/apt/lists/*
 
-# Add this for robust parallel builds, adjusted for memory constraints
-ENV MAX_JOBS=16
+# This environment variable controls the number of parallel compilation jobs.
+# It is set to a conservative value (2) by default for stability on machines
+# with limited RAM.
+#
+# If you have more RAM (e.g., 96GB+), you can increase this value
+# (e.g., to 16, 24, or 32) to significantly speed up the build.
+# Always monitor RAM usage (htop) to avoid Out-Of-Memory (OOM) crashes.
+ENV MAX_JOBS=2
+# If you encounter OOM errors even with this value, try reducing it to 1.
 
 # Build Flash Attention CUDA kernels
 FROM kernel-builder as flash-att-builder
diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index 9ee9f1d33..b012ae847 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -402,7 +402,7 @@ When building from source, you gain the flexibility to choose a wider range of m
 **Action:** First, clone the main repository, then immediately initialize and update all its submodules.
 
 ```bash
-git clone https://github.com/minhkhoango/lorax.git
+git clone -b feat/deployment-playbook-enhancements https://github.com/minhkhoango/lorax.git
 cd lorax
 git submodule update --init --recursive
 ```
@@ -418,6 +418,19 @@ docker build -t my-lorax-server -f Dockerfile .
 - Build stalls → Add `--network=host` to the build command.
 - Version conflicts → Adjust base image or dependencies.
 
+> **Important Note on Build Parallelism (`MAX_JOBS`) & Memory:**
+> Building custom CUDA kernels from source is a memory-intensive process. The `Dockerfile` is configured with `ENV MAX_JOBS=2` as a **very conservative default** for parallel compilation. This value aims to provide the highest stability and prevent Out-Of-Memory (OOM) crashes on a wide range of hardware, including instances with limited RAM relative to CPU cores.
+>
+> * **To Optimize for Faster Builds (Recommended):**
+>     If you have significantly more RAM (e.g., 96GB or more) and want to speed up compilation, you can safely **increase `MAX_JOBS`**.
+>     1.  **Open the `Dockerfile`** in your cloned `lorax` directory using your preferred text editor (e.g., `nano Dockerfile` or `code Dockerfile`).
+>     2.  **Find the line:** `ENV MAX_JOBS=2` (it will be surrounded by comments explaining its purpose)
+>     3.  **Change the value** to a higher number (e.g., `16`, `24`, or `32`). *Always monitor your RAM usage (`htop`) during the build to avoid crashes.*
+>     4.  **Save the `Dockerfile`** and restart your build command (`docker build -t my-lorax-server -f Dockerfile .`).
+>
+> * **If your build still crashes with an OOM error:**
+>     This indicates you have very limited RAM or other processes are consuming it. You **must reduce `MAX_JOBS` further**. Edit the `Dockerfile` as described above and change the value to `1`. Then, restart the build.
+
 #### 4. Run the Container
 
 Use the same `docker run` command as in Option A, replacing `ghcr.io/predibase/lorax:main` with `my-lorax-server`.

From 33f8cefe4c95f1835bc200daefa5a4d97726264d Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Mon, 21 Jul 2025 16:22:08 +0700
Subject: [PATCH 06/12] update make-vllm, speed up dockerfile, and edit
 playbook

---
 Dockerfile                   | 41 ++++++++++------------
 lorax_deployment_playbook.md | 66 +++++++++++++++++++++++++++++++++++-
 server/Makefile-vllm         |  2 +-
 3 files changed, 83 insertions(+), 26 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index fba6207e6..2d17f01dc 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -48,14 +48,19 @@ ARG INSTALL_CHANNEL=pytorch
 ARG TARGETPLATFORM
 
 ENV PATH /opt/conda/bin:$PATH
+# For build-time CUDA memory resilience
+ENV PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
 
 RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
     build-essential \
     ca-certificates \
     ccache \
     curl \
-    git && \
-    rm -rf /var/lib/apt/lists/*
+    git \
+    ninja-build \
+    cmake \
+    wget \
+    && rm -rf /var/lib/apt/lists/*
 
 # Install conda
 # translating Docker's TARGETPLATFORM into mamba arches
@@ -80,19 +85,14 @@ RUN case ${TARGETPLATFORM} in \
 # CUDA kernels builder image
 FROM pytorch-install as kernel-builder
 
-ARG MAX_JOBS=2
-
-RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
-    ninja-build cmake \
-    && rm -rf /var/lib/apt/lists/*
-
-# This environment variable controls the number of parallel compilation jobs.
+# This environment variable controls the number of parallel compilation jobs for CUDA kernels.
 # It is set to a conservative value (2) by default for stability on machines
-# with limited RAM.
+# with limited RAM relative to CPU cores, preventing Out-Of-Memory (OOM) crashes during build.
 #
-# If you have more RAM (e.g., 96GB+), you can increase this value
-# (e.g., to 16, 24, or 32) to significantly speed up the build.
-# Always monitor RAM usage (htop) to avoid Out-Of-Memory (OOM) crashes.
+# You can adjust this value to optimize build speed based on your system's RAM:
+# - If you have more RAM (e.g., 96GB+), you can increase this value (e.g., to 16, 24, or 32)
+#   to significantly speed up the build. Always monitor RAM usage (htop) to avoid OOM crashes.
+# - If you encounter OOM errors even with this value, try reducing it further to 1.
 ENV MAX_JOBS=2
 # If you encounter OOM errors even with this value, try reducing it to 1.
 
@@ -112,13 +112,13 @@ RUN make build-flash-attention-v2-cuda -j$(MAX_JOBS)
 FROM kernel-builder as exllama-kernels-builder
 WORKDIR /usr/src
 COPY server/exllama_kernels/ .
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
+RUN MAX_JOBS=$(MAX_JOBS) TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
 
 # Build Transformers exllama kernels
 FROM kernel-builder as exllamav2-kernels-builder
 WORKDIR /usr/src
 COPY server/exllamav2_kernels/ .
-RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
+RUN MAX_JOBS=$(MAX_JOBS) TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
 
 # Build Transformers awq kernels
 FROM kernel-builder as awq-kernels-builder
@@ -131,18 +131,11 @@ FROM kernel-builder as custom-kernels-builder
 WORKDIR /usr/src
 COPY server/custom_kernels/ .
 # Build specific version of transformers
-RUN python setup.py build
+RUN MAX_JOBS=$(MAX_JOBS) python setup.py build
 
 # Build vllm CUDA kernels
 FROM kernel-builder as vllm-builder
 WORKDIR /usr/src
-RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends \
-    wget \
-    && rm -rf /var/lib/apt/lists/*
-RUN DEBIAN_FRONTEND=noninteractive apt purge -y --auto-remove cmake
-RUN wget 'https://github.com/Kitware/CMake/releases/download/v3.30.0/cmake-3.30.0-linux-x86_64.tar.gz'
-RUN tar xzvf 'cmake-3.30.0-linux-x86_64.tar.gz'
-RUN ln -s "$(pwd)/cmake-3.30.0-linux-x86_64/bin/cmake" /usr/local/bin/cmake
 ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
 COPY server/Makefile-vllm Makefile
 # Build specific version of vllm
@@ -161,7 +154,7 @@ WORKDIR /usr/src
 COPY server/punica_kernels/ .
 # Build specific version of punica
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
-RUN python setup.py build
+RUN MAX_JOBS=$(MAX_JOBS) python setup.py build
 
 # Build eetq kernels
 FROM kernel-builder as eetq-kernels-builder
diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index b012ae847..f30ddb21e 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -179,6 +179,45 @@ docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi \
 
 </details>
 
+#### Advanced NVIDIA Container Toolkit Configuration & Troubleshooting
+Some cloud environments or specific Docker/NVIDIA driver versions may require additional configuration for GPU access inside containers.
+
+* **Cgroup V2 Compatibility (`"Failed to initialize NVML: Unknown Error"`)**:
+    * **Problem:** On systems using `cgroup v2` (common in newer Linux distributions), rootless Docker or certain container runtime setups can conflict with NVIDIA's cgroup requirements, leading to errors like `"Failed to initialize NVML: Unknown Error"` or `"no such file or directory"` when trying to read cgroup events.
+    * **Solution:** You might need to adjust kernel parameters or `nvidia-container-runtime` settings.
+        * **Kernel Parameter:** Add `systemd.unified_cgroup_hierarchy=false` to your kernel boot parameters (e.g., in GRUB, then reboot).
+        * **Runtime Config:** Ensure `no-cgroups = false` is set in `/etc/nvidia-container-runtime/config.toml`.
+* **Device Symlink Issues (`/dev/char` missing)**:
+    * **Problem:** In some cases, necessary device symlinks for NVIDIA GPUs (`/dev/char/...`) might be missing, preventing containers from accessing the GPU.
+    * **Solution:** Use `nvidia-ctk system create-dev-char-symlinks --create-all` to create them, and ensure accompanying udev rules are in place.
+* **Essential Docker Daemon Configuration (`daemon.json`)**:
+    * **Problem:** Incorrect Docker daemon configuration can prevent GPUs from being exposed to containers.
+    * **Solution:** Verify your `/etc/docker/daemon.json` includes the following, then restart Docker (`sudo systemctl restart docker`):
+        ```json
+        {
+          "runtimes": {
+            "nvidia": {
+              "path": "/usr/bin/nvidia-container-runtime",
+              "runtimeArgs": []
+            }
+          },
+          "default-runtime": "nvidia",
+          "node-generic-resources": ["gpu=GPU-{uuid}"]
+        }
+        ```
+
+#### Cloud Platform-Specific Pitfalls
+Deployment on various cloud GPU providers can introduce unique challenges.
+
+* **Vast.ai Specific Problems:**
+    * **Docker Image Pull Throttling:** Frequent `"image pull is throttled"` errors can occur due to rate limits.
+        * **Solution:** Implement retry strategies for `docker pull`, or consider using pre-configured Vast.ai instances or custom images with required dependencies pre-installed.
+    * **Environment Variable Issues (`$DISPLAY`)**: Some GUI-dependent tools or environment checks might fail if `$DISPLAY` is not exported. This is less common for LoRAX but can impact debugging.
+* **RunPod Deployment Challenges:**
+    * **Network Configuration:** Ensure proper port mapping and understanding of shared volumes for model caching.
+    * **Secret Management:** HuggingFace tokens must be properly configured as secrets or environment variables for secure access.
+    * **Resource Allocation:** Always double-check adequate GPU memory allocation for the specific model you intend to load.
+
 ---
 
 ### 4. Check User in Docker Group 👤
@@ -413,7 +452,6 @@ git submodule update --init --recursive
 docker build -t my-lorax-server -f Dockerfile .
 ```
 
-
 **Common Failures:**
 - Build stalls → Add `--network=host` to the build command.
 - Version conflicts → Adjust base image or dependencies.
@@ -421,6 +459,9 @@ docker build -t my-lorax-server -f Dockerfile .
 > **Important Note on Build Parallelism (`MAX_JOBS`) & Memory:**
 > Building custom CUDA kernels from source is a memory-intensive process. The `Dockerfile` is configured with `ENV MAX_JOBS=2` as a **very conservative default** for parallel compilation. This value aims to provide the highest stability and prevent Out-Of-Memory (OOM) crashes on a wide range of hardware, including instances with limited RAM relative to CPU cores.
 >
+> **Advanced Build-Time Memory Management:**
+> For systems with very limited RAM or during memory-intensive CUDA kernel compilations, setting `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` can help PyTorch manage memory more flexibly during the build process, potentially reducing Out-Of-Memory (OOM) crashes. This setting is now automatically applied via an environment variable in the `Dockerfile` for relevant build stages.
+>
 > * **To Optimize for Faster Builds (Recommended):**
 >     If you have significantly more RAM (e.g., 96GB or more) and want to speed up compilation, you can safely **increase `MAX_JOBS`**.
 >     1.  **Open the `Dockerfile`** in your cloned `lorax` directory using your preferred text editor (e.g., `nano Dockerfile` or `code Dockerfile`).
@@ -439,6 +480,21 @@ Use the same `docker run` command as in Option A, replacing `ghcr.io/predibase/l
 - “Exec format error” → Image built for wrong architecture.
 - Immediate exit → Library mismatch; rebuild with compatible CUDA base.
 
+### Model Compatibility Beyond Mistral-7B (Build from Source)
+
+If you attempt to load a model other than `mistralai/Mistral-7B-Instruct-v0.1` and encounter errors such as `TypeError: TensorParallelColumnLinear.load_multi()` or `RuntimeError: weight ... does not exist`, these errors typically indicate version incompatibilities between PEFT, Transformers, and TGI components. The root issue is that the `fan_in_fan_out` parameter conflicts with TGI's tensor parallel implementations, and `TensorParallelColumnLinear` expects certain `base_layer` attributes that may not be present in all model variants or library versions.
+
+To attempt compatibility with a different model (e.g., `gpt2`):
+
+1. The `vLLM` inference engine version is crucial. In LoRAX, `vLLM` is pinned to a specific Git commit for stability. To change it, you need to **edit `server/Makefile-vllm`**.
+2. Rebuild the Docker image after making any changes.
+3. If you encounter errors related to missing weights or quantization, check the model's compatibility with the current `transformers` and `vLLM` versions.
+4. Change the commit hash (e.g., `766435e660a786933392eb8ef0a873bc38cf0c8b`) to **`9985d06add07a4cc691dc54a7e34f54205c04d40`** (a `vLLM 0.7.3+` version known for broader compatibility, including `gpt2`).
+
+* **Potential `transformers` version adjustments:** If changing the `vLLM` commit doesn't resolve the issue, you *might* also need to modify the `transformers` version in `server/requirements.txt`. Research suggests `Transformers 4.49.0+` provides stable `gpt2` support with `vLLM 0.7.3+`. **Avoid `Transformers 4.48.x` with `vLLM 0.7.2` due to known Qwen model compatibility issues.**
+
+* **Using `--model-impl transformers`:** For certain models, particularly `gpt2`, if issues persist, you may need to add the `--model-impl transformers` flag to your `lorax-launcher` command to explicitly force the Transformers backend for inference.
+
 ---
 
 ## Phase 3: Test the API
@@ -575,5 +631,13 @@ curl http://localhost:80/
 
 ---
 
+## Lessons Learned
+
+- **Kernel Cohesion:** Kernel upgrades Must Be Cohesive: Partial migrations (e.g., AWQ v0.0.4 → v0.0.6) exposed ABI mismatches; coordinated bump strategies are now mandated. This directly explains why versions like `vLLM` commits are so critical.
+- **Memory Virtualization:** Early investment in memory virtualization (like v0.9 memory pool overhaul) is crucial for runtime OOM reduction.
+- **Schema-Driven API Design:** Tightened Pydantic models prevented API inconsistencies and improved reliability.
+
+---
+
 **Happy Deploying!** 🎉
 
diff --git a/server/Makefile-vllm b/server/Makefile-vllm
index 4c92391b3..7fc5adeb9 100644
--- a/server/Makefile-vllm
+++ b/server/Makefile-vllm
@@ -4,7 +4,7 @@ vllm-cuda:
 	git clone https://github.com/vllm-project/vllm.git vllm
 
 build-vllm-cuda: vllm-cuda
-	cd vllm && git fetch && git checkout 766435e660a786933392eb8ef0a873bc38cf0c8b
+	cd vllm && git fetch && git checkout 9985d06add07a4cc691dc54a7e34f54205c04d40
 	cd vllm && python setup.py build
 
 install-vllm-cuda: build-vllm-cuda

From 21a7e25d4f0d9a7f43d9524bb69e3719ce0d5220 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Tue, 22 Jul 2025 08:51:02 +0700
Subject: [PATCH 07/12] update playbook

---
 Dockerfile                   |  16 +-
 lorax_deployment_playbook.md | 285 ++++++++++++++---------------------
 server/Makefile-vllm         |   2 +-
 3 files changed, 126 insertions(+), 177 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index 2d17f01dc..b7e75b83a 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -58,10 +58,18 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
     curl \
     git \
     ninja-build \
-    cmake \
     wget \
     && rm -rf /var/lib/apt/lists/*
 
+# Add these lines to install a *newer* CMake version
+RUN apt-get update && \
+    DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends software-properties-common && \
+    wget -O - https://apt.kitware.com/keys/kitware-archive-latest.asc 2>/dev/null | gpg --dearmor - | tee /usr/share/keyrings/kitware-archive-keyring.gpg >/dev/null && \
+    echo "deb [signed-by=/usr/share/keyrings/kitware-archive-keyring.gpg] https://apt.kitware.com/ubuntu/ $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/kitware.list >/dev/null && \
+    apt-get update && \
+    rm -f /etc/apt/sources.list.d/cmake.list && \
+    apt-get install -y --no-install-recommends cmake
+
 # Install conda
 # translating Docker's TARGETPLATFORM into mamba arches
 RUN case ${TARGETPLATFORM} in \
@@ -95,7 +103,7 @@ FROM pytorch-install as kernel-builder
 # - If you encounter OOM errors even with this value, try reducing it further to 1.
 ENV MAX_JOBS=2
 # If you encounter OOM errors even with this value, try reducing it to 1.
-
+RUN pip install setuptools_scm --no-cache-dir
 # Build Flash Attention CUDA kernels
 FROM kernel-builder as flash-att-builder
 WORKDIR /usr/src
@@ -151,8 +159,10 @@ RUN make build-megablocks -j$(MAX_JOBS)
 # Build punica CUDA kernels
 FROM kernel-builder as punica-builder
 WORKDIR /usr/src
-COPY server/punica_kernels/ .
+COPY server/punica_kernels/ ./server/punica_kernels/
 # Build specific version of punica
+COPY flashinfer/ ./server/punica_kernels/third_party/flashinfer/
+COPY cutlass/ ./server/punica_kernels/third_party/cutlass/
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
 RUN MAX_JOBS=$(MAX_JOBS) python setup.py build
 
diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index f30ddb21e..06ff52ddf 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -20,7 +20,7 @@ To deploy **LoRAX**, you need these components in order:
 > **Quick Sanity Check:** Stop at the first failure in this sequence:
 > - **A.** Run `nvidia-smi` on the host.
 > - **B.** Test GPU access in a container: `docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi`.
-> - **C.** Launch **LoRAX** with `MODEL_ID=gpt2`.
+> - **C.** Launch **LoRAX** with `MODEL_ID=mistralai/Mistral-7B-Instruct-v0.1` (the pre-built image is recommended for this check).
 > - **D.** Test the API with `curl`.
 > - **E.** Scale up to a larger model.
 
@@ -37,12 +37,15 @@ Ensure your **NVIDIA driver** is working correctly.
 ```bash
 nvidia-smi
 ```
+**Success:** Displays a table with the driver version and GPU details.
+<details>
+<summary>Click to expand: Common Failures & Troubleshooting</summary>
 
-**Success:** Displays a table with the driver version and GPU details.  
-**Common Failures:**
 - *`command not found`* → Driver not installed or PATH issue.
 - *"NVIDIA-SMI has failed"* → Kernel module mismatch or Secure Boot blocking.
 
+</details>
+
 <details>
 <summary>Click to expand: NVIDIA Driver Installation Guide</summary>
 
@@ -57,25 +60,21 @@ Installing NVIDIA drivers can be complex and varies greatly by OS and GPU. **We
 Run this command to check if Docker is installed and running:
 
 ```bash
-# Check if we're inside a containerized environment where Docker can't run
-if grep -qa 'docker\|lxc' /proc/1/cgroup || [ -f /.dockerenv ]; then
-    echo "⚠️  Detected: This environment is containerized (Docker/LXC)."
-    echo "You CANNOT start Docker inside a container on most cloud GPU providers."
-    echo "👉  If you need full Docker access, deploy on a bare-metal or privileged VM."
-    echo "The script will exit in 10 seconds. Hit Ctrl+C to abort immediately."
-    sleep 10
-    echo "Exiting script."
-    exit 0
-fi
-
 if command -v docker >/dev/null 2>&1 && docker info >/dev/null 2>&1; then
     echo "Docker Engine: Installed and running. ✅"
 else
     echo "Docker Engine: NOT detected or NOT running. ❌"
 fi
 ```
+**Success:** `Docker Engine: Installed and running. ✅`
+<details>
+<summary>Click to expand: Common Failures & Troubleshooting</summary>
 
-> **Outcome:** If you see "Docker is installed and running.", you can skip the installation below.
+- `Docker Engine: NOT detected or NOT running. ❌`
+- *GPG/repo errors ("NO_PUBKEY", "Unsigned")* → Key issue; redo key setup.
+- *Architecture mismatch* on non-x86 hosts.
+
+</details>
 
 <details>
 <summary>Click to expand: Install Docker Engine</summary>
@@ -125,9 +124,13 @@ Run this command to verify GPU access within a container (requires Docker and To
 ```bash
 docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
 ```
+**Success:** Displays GPU details (similar to `nvidia-smi` on host).
+<details>
+<summary>Click to expand: Common Failures & Troubleshooting</summary>
 
-> **Outcome:** If you see GPU details (similar to `nvidia-smi` on host), you can skip the installation below.
-> **Common Failures:** "Unknown runtime specified nvidia" or "Could not select device driver" means the Toolkit is not correctly installed or configured.
+- *"Unknown runtime specified nvidia"* or *"Could not select device driver"* → Toolkit not correctly installed or configured.
+
+</details>
 
 <details>
 <summary>Click to expand: Install NVIDIA Container Toolkit</summary>
@@ -178,46 +181,6 @@ docker run --rm --gpus all nvidia/cuda:12.3.0-base-ubuntu22.04 nvidia-smi \
 > **Fix:** Re-run the toolkit installation and configuration steps.
 
 </details>
-
-#### Advanced NVIDIA Container Toolkit Configuration & Troubleshooting
-Some cloud environments or specific Docker/NVIDIA driver versions may require additional configuration for GPU access inside containers.
-
-* **Cgroup V2 Compatibility (`"Failed to initialize NVML: Unknown Error"`)**:
-    * **Problem:** On systems using `cgroup v2` (common in newer Linux distributions), rootless Docker or certain container runtime setups can conflict with NVIDIA's cgroup requirements, leading to errors like `"Failed to initialize NVML: Unknown Error"` or `"no such file or directory"` when trying to read cgroup events.
-    * **Solution:** You might need to adjust kernel parameters or `nvidia-container-runtime` settings.
-        * **Kernel Parameter:** Add `systemd.unified_cgroup_hierarchy=false` to your kernel boot parameters (e.g., in GRUB, then reboot).
-        * **Runtime Config:** Ensure `no-cgroups = false` is set in `/etc/nvidia-container-runtime/config.toml`.
-* **Device Symlink Issues (`/dev/char` missing)**:
-    * **Problem:** In some cases, necessary device symlinks for NVIDIA GPUs (`/dev/char/...`) might be missing, preventing containers from accessing the GPU.
-    * **Solution:** Use `nvidia-ctk system create-dev-char-symlinks --create-all` to create them, and ensure accompanying udev rules are in place.
-* **Essential Docker Daemon Configuration (`daemon.json`)**:
-    * **Problem:** Incorrect Docker daemon configuration can prevent GPUs from being exposed to containers.
-    * **Solution:** Verify your `/etc/docker/daemon.json` includes the following, then restart Docker (`sudo systemctl restart docker`):
-        ```json
-        {
-          "runtimes": {
-            "nvidia": {
-              "path": "/usr/bin/nvidia-container-runtime",
-              "runtimeArgs": []
-            }
-          },
-          "default-runtime": "nvidia",
-          "node-generic-resources": ["gpu=GPU-{uuid}"]
-        }
-        ```
-
-#### Cloud Platform-Specific Pitfalls
-Deployment on various cloud GPU providers can introduce unique challenges.
-
-* **Vast.ai Specific Problems:**
-    * **Docker Image Pull Throttling:** Frequent `"image pull is throttled"` errors can occur due to rate limits.
-        * **Solution:** Implement retry strategies for `docker pull`, or consider using pre-configured Vast.ai instances or custom images with required dependencies pre-installed.
-    * **Environment Variable Issues (`$DISPLAY`)**: Some GUI-dependent tools or environment checks might fail if `$DISPLAY` is not exported. This is less common for LoRAX but can impact debugging.
-* **RunPod Deployment Challenges:**
-    * **Network Configuration:** Ensure proper port mapping and understanding of shared volumes for model caching.
-    * **Secret Management:** HuggingFace tokens must be properly configured as secrets or environment variables for secure access.
-    * **Resource Allocation:** Always double-check adequate GPU memory allocation for the specific model you intend to load.
-
 ---
 
 ### 4. Check User in Docker Group 👤
@@ -227,8 +190,14 @@ Run this command to check if your user is already in the 'docker' group:
 ```bash
 groups | grep -q docker && echo "User is in the docker group." || echo "User is NOT in the docker group. Permissions needed."
 ```
+**Success:** `User is in the docker group.`
+<details>
+<summary>Click to expand: Common Failures & Troubleshooting</summary>
 
-> **Outcome:** If you see "User is in the docker group.", you can skip the steps below.
+- `User is NOT in the docker group. Permissions needed.`
+- *Commands still require `sudo`* → Log out and back in.
+
+</details>
 
 <details>
 <summary>Click to expand: Add User to Docker Group</summary>
@@ -265,8 +234,13 @@ else
     echo "HUGGING_FACE_HUB_TOKEN is NOT set. ❌"
 fi
 ```
+**Success:** `HUGGING_FACE_HUB_TOKEN is set. ✅`
+<details>
+<summary>Click to expand: Common Failures & Troubleshooting</summary>
+
+- `HUGGING_FACE_HUB_TOKEN is NOT set. ❌` → Token missing or not exported correctly.
 
-> **Outcome:** If you see "HUGGING_FACE_HUB_TOKEN is set. ✅", you can skip the manual setup steps below.
+</details>
 
 <details>
 <summary>Click to expand: Set up HUGGING_FACE_HUB_TOKEN</summary>
@@ -313,34 +287,6 @@ Choose one deployment path:
 
 ---
 
-### Common Failures during Container Launch
-
-<details>
-<summary>Click to expand: Common Failures during Container Launch</summary>
-
-These issues can occur when attempting to run *any* LoRAX Docker container, regardless of whether it's pre-built or from source.
-
-* **`docker: Error response from daemon: Conflict. The container name "/lorax" is already in use...`**: This means a container named `lorax` is already running or exists from a previous session. You need to stop and remove it first.
-    ```bash
-    docker stop lorax # Stop the running container
-    docker rm lorax   # Remove the stopped container (optional, if --rm was not used or failed previously)
-    ```
-    Then, re-run your `docker run` command.
-* **`docker: invalid reference format`, `--gpus: command not found`, etc.**: You likely copied the `docker run` command incorrectly. Ensure there are **no spaces** after the backslash `\` at the end of each line, and copy the entire block at once.
-* **`CUDA out of memory`** → The model you are trying to load is too large for your GPU's VRAM. Refer to the [GPU VRAM vs. Model Size Compatibility](#option-b-build-from-source-🛠️) table and choose a smaller or more quantized model.
-* **Stalled model download** → Indicates a network issue or Hugging Face rate limit when downloading the model weights inside the container.
-    > **Fix for Stalled Downloads:**
-    > 1.  Visit the model’s Hugging Face page (e.g., `https://huggingface.co/<model_id>/tree/main`).
-    > 2.  Note the commit hash from the URL or “Files and Versions.”
-    > 3.  Create the cache path on your host: `$HOME/lorax_model_cache/<model_id>/snapshots/<commit_hash>/`.
-    > 4.  Download all model files (config, tokenizer, `.safetensors`, etc.) to that directory.
-    > 5.  Re-run the container; it should now use the cached files.
-* **`RuntimeError: weight not found`** or **`TypeError`** → Model or quantization incompatibility with the pre-built image. For broader model compatibility, custom configurations, or support for a wider range of quantized models, please proceed with [Option B: Build from Source](#option-b-build-from-source-🛠️).
-
-</details>
-
----
-
 ### Option A: Pre-built Image 🎉
 
 #### 1. Pull the LoRAX Image
@@ -349,7 +295,6 @@ These issues can occur when attempting to run *any* LoRAX Docker container, rega
 docker pull ghcr.io/predibase/lorax:main
 ```
 
-
 **Success:** Image downloads successfully.  
 **Common Failure:** Network timeout → Retry or check connectivity.
 
@@ -363,12 +308,9 @@ docker pull ghcr.io/predibase/lorax:main
 
 For `mistralai/Mistral-7B-Instruct-v0.1`, a GPU with **16-24 GB VRAM is recommended** to ensure smooth operation and sufficient KV cache.
 
----
-
-#### 3. Run the LoRAX Container
+#### 3. Run the Container
 
 ```bash
-# Define your variables (MODEL_ID is set to the only supported model)
 MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"
 SHARDED_MODEL="false" # Set to 'true' for sharded (multi-GPU) models like 70B
 PORT=80 # Host port to access the LoRAX server
@@ -391,49 +333,25 @@ docker run --rm \
 <summary>Click to expand: Explanation of Docker Run Flags</summary>
 
 **What This Does:**
-- Starts the **LoRAX container** named `lorax` with GPU access.
-- Mounts model cache to persist downloads between container restarts.
-- Maps port **80** (container) to your chosen **host port**.
-- Loads the specified **model** (now only `mistralai/Mistral-7B-Instruct-v0.1`).
-- Uses your Hugging Face token for authenticated model downloads.
+- `docker run --rm --name lorax`: Starts a new container, removes it on exit, and names it `lorax`.
+- `--gpus all`: Grants the container access to all available GPUs.
+- `-e HUGGING_FACE_HUB_TOKEN`: Passes your Hugging Face authentication token.
+- `-v "$HOME/lorax_model_cache":/data`: Mounts a local directory for persistent model caching.
+- `-v "$HOME/lorax_outlines_cache":/root/.cache/outlines`: Mounts cache for Outlines library.
+- `--user "$(id -u):$(id -g)"`: Runs the container process as your host user for permission consistency.
+- `-p ${PORT}:80`: Maps the container's internal port 80 to your specified host port.
+- `ghcr.io/predibase/lorax:main`: Specifies the Docker image to use.
+- `--model-id "$MODEL_ID"`: Sets the Hugging Face model to load.
+- `--sharded "$SHARDED_MODEL"`: Configures for multi-GPU sharding if set to `true`.
 
 </details>
 
-**Success:** Logs show model download/cache hit and “Model loaded”; health endpoint responds.  
-**Common Failures:** Refer to [Common Failures during Container Launch](#common-failures-during-container-launch)
-
-> **Fix for Stalled Downloads:**
-> 1. Visit the model’s Hugging Face page (e.g., `https://huggingface.co/<model_id>/tree/main`).
-> 2. Note the commit hash from the URL or “Files and Versions.”
-> 3. Create the cache path: `$HOME/lorax_model_cache/<model_id>/snapshots/<commit_hash>/`.
-> 4. Download all model files (config, tokenizer, `.safetensors`, etc.) to that directory.
-> 5. Re-run the container; it should use the cached files.
-
 ---
 
 ### Option B: Build from Source 🛠️
 
 Use this if you need custom changes or unreleased patches, or if you want to run models other than `mistralai/Mistral-7B-Instruct-v0.1`.
 
-#### GPU VRAM vs. Model Size Compatibility
-
-When building from source, you gain the flexibility to choose a wider range of models. Use the following table as a guide for VRAM compatibility:
-
-| **Model** | **Params** | **VRAM (FP16/BF16)** | **Notes** |
-|-----------|------------|-----------------------|-----------|
-| `gpt2` | 0.1B | ~0.5 GB | Perfect for testing; fits any GPU. |
-| `bigcode/starcoder2-3b` | 3B | ~6–7 GB | Works on 8 GB VRAM GPUs. |
-| `mistralai/Mistral-7B-Instruct-v0.1` | 7B | ~14–15 GB | Needs 16–24 GB VRAM. |
-| `meta-llama/Meta-Llama-3-8B-Instruct` | 8B | ~16 GB | Tight on 16 GB; better with 24 GB. |
-| `meta-llama/Meta-Llama-3-13B-Instruct` | 13B | ~26 GB | Requires 24–26 GB VRAM. |
-| `meta-llama/Meta-Llama-3-70B-Instruct` | 70B | 135–140 GB | Needs multi-GPU or heavy quantization. |
-
-> **VRAM Tips:**
-> - Keep **10–15% VRAM free** for KV cache and overhead.
-> - **6–8 GB GPUs**: Stick to quantized 7B models.
-> - **12–16 GB GPUs**: Comfortable for 7B; tight for 8B.
-> - **24 GB+ GPUs**: Suitable for 13B or multi-instance setups.
-
 #### 1. Clone the LoRAX Repository (Including all necessary Submodules)
 
 **Problem:** To build LoRAX from source, you need not only the main repository but also its nested external dependencies, which are managed as Git submodules (e.g., `flashinfer` for custom CUDA kernels). Skipping this can lead to "No such file or directory" errors during the build.
@@ -449,6 +367,7 @@ git submodule update --init --recursive
 #### 2. Build the Image
 
 ```bash
+export DOCKER_BUILDKIT=1
 docker build -t my-lorax-server -f Dockerfile .
 ```
 
@@ -456,6 +375,9 @@ docker build -t my-lorax-server -f Dockerfile .
 - Build stalls → Add `--network=host` to the build command.
 - Version conflicts → Adjust base image or dependencies.
 
+<details>
+<summary>Click to expand: Advanced Build-Time Optimizations & Troubleshooting (MAX_JOBS, OOM)</summary>
+
 > **Important Note on Build Parallelism (`MAX_JOBS`) & Memory:**
 > Building custom CUDA kernels from source is a memory-intensive process. The `Dockerfile` is configured with `ENV MAX_JOBS=2` as a **very conservative default** for parallel compilation. This value aims to provide the highest stability and prevent Out-Of-Memory (OOM) crashes on a wide range of hardware, including instances with limited RAM relative to CPU cores.
 >
@@ -472,28 +394,70 @@ docker build -t my-lorax-server -f Dockerfile .
 > * **If your build still crashes with an OOM error:**
 >     This indicates you have very limited RAM or other processes are consuming it. You **must reduce `MAX_JOBS` further**. Edit the `Dockerfile` as described above and change the value to `1`. Then, restart the build.
 
-#### 4. Run the Container
+</details>
 
-Use the same `docker run` command as in Option A, replacing `ghcr.io/predibase/lorax:main` with `my-lorax-server`.
+#### 3. Choose Your Model
 
-**Common Failures:**
-- “Exec format error” → Image built for wrong architecture.
-- Immediate exit → Library mismatch; rebuild with compatible CUDA base.
+Refer to the compatibility table below to select a model that fits your hardware and requirements.
+
+---
+
+| **Model** | **Params** | **VRAM (FP16/BF16)** | **Notes** |
+|-----------|------------|-----------------------|-----------|
+| `mistralai/Mistral-7B-Instruct-v0.1` | 7B | ~14–15 GB | Needs 16–24 GB VRAM. |
+| `meta-llama/Meta-Llama-3-8B-Instruct` | 8B | ~16 GB | Tight on 16 GB; better with 24 GB. |
+| `meta-llama/Meta-Llama-3-70B-Instruct` | 70B | 135–140 GB | Needs multi-GPU or heavy quantization. |
+| `mistralai/Mixtral-8x7B-Instruct-v0.1` | 8x7B (MoE) | ~90-100 GB (FP16/BF16) | **Disk Required: ~130 GB.** Often runs via expert routing; requires heavy quantization (e.g., Q8_0) or multiple H100s/A100s. |
+
+> **VRAM Tips:**
+> - Keep **10–15% VRAM free** for KV cache and overhead.
+> - **6–8 GB GPUs**: Stick to quantized 7B models.
+> - **12–16 GB GPUs**: Comfortable for 7B; tight for 8B.
+> - **24 GB+ GPUs**: Suitable for 13B or multi-instance setups.
+> - **MoE Models (e.g., Mixtral 8x7B)**: These models consume VRAM differently, and also have significant disk footprint. A full 8x7B in FP16/BF16 will require significantly more than 48GB VRAM (closer to 90-100GB), and around **130 GB of disk space for the weights**. Consider heavy quantization (e.g., Q8_0) or multi-GPU systems like multiple H100s for deployment.
+
+<details>
+<summary>Click to expand: Troubleshooting Model Compatibility (Build from Source)</summary>
 
 ### Model Compatibility Beyond Mistral-7B (Build from Source)
 
 If you attempt to load a model other than `mistralai/Mistral-7B-Instruct-v0.1` and encounter errors such as `TypeError: TensorParallelColumnLinear.load_multi()` or `RuntimeError: weight ... does not exist`, these errors typically indicate version incompatibilities between PEFT, Transformers, and TGI components. The root issue is that the `fan_in_fan_out` parameter conflicts with TGI's tensor parallel implementations, and `TensorParallelColumnLinear` expects certain `base_layer` attributes that may not be present in all model variants or library versions.
 
+- **Note:** If your model requires `--trust-remote-code`, this is a flag and should be passed as `--trust-remote-code` (no value, not `--trust-remote-code=True`).
+- **`ImportError: No module named 'msgspec'` (for Qwen or other vLLM-dependent models):** `vLLM` may require the `msgspec` Python library. Add `msgspec` to `server/requirements.txt` and rebuild your Docker image with `--no-cache`.
+- **`TypeError` for `gpt2` (fan_in_fan_out):** This is a specific API mismatch between LoRAX's custom `FlashGPT2` modeling and the `vLLM` version. Ensure the `vLLM` commit in `server/Makefile-vllm` is `9985d06add07a4cc691dc54a7e34f54205c04d40` (the stable `0.7.3+` version) or a later compatible version like `0.8.2+`, and rebuild. The `--model-impl transformers` flag does *not* exist in `lorax-launcher`.
+
 To attempt compatibility with a different model (e.g., `gpt2`):
 
 1. The `vLLM` inference engine version is crucial. In LoRAX, `vLLM` is pinned to a specific Git commit for stability. To change it, you need to **edit `server/Makefile-vllm`**.
 2. Rebuild the Docker image after making any changes.
 3. If you encounter errors related to missing weights or quantization, check the model's compatibility with the current `transformers` and `vLLM` versions.
-4. Change the commit hash (e.g., `766435e660a786933392eb8ef0a873bc38cf0c8b`) to **`9985d06add07a4cc691dc54a7e34f54205c04d40`** (a `vLLM 0.7.3+` version known for broader compatibility, including `gpt2`).
+4. Change the commit hash (e.g., `766435e660a786933392eb8ef0a873bc38cf0c8b`) to **`9985d06add07a4cc691dc54a7e34f54205c04d40`** (a `vLLM 0.7.3+` version known for broader compatibility, including `gpt2`), or try a later compatible version such as `0.8.2+`.
 
 * **Potential `transformers` version adjustments:** If changing the `vLLM` commit doesn't resolve the issue, you *might* also need to modify the `transformers` version in `server/requirements.txt`. Research suggests `Transformers 4.49.0+` provides stable `gpt2` support with `vLLM 0.7.3+`. **Avoid `Transformers 4.48.x` with `vLLM 0.7.2` due to known Qwen model compatibility issues.**
 
-* **Using `--model-impl transformers`:** For certain models, particularly `gpt2`, if issues persist, you may need to add the `--model-impl transformers` flag to your `lorax-launcher` command to explicitly force the Transformers backend for inference.
+</details>
+
+#### 4. Run the Container
+
+```bash
+MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"
+SHARDED_MODEL="false" # Set to 'true' for sharded (multi-GPU) models like 70B
+PORT=80 # Host port to access the LoRAX server
+
+docker run --rm \
+  --name lorax \
+  --gpus all \
+  -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
+  -e TRANSFORMERS_CACHE=/data \
+  -v "$HOME/lorax_model_cache":/data \
+  -v "$HOME/lorax_outlines_cache":/root/.cache/outlines \
+  --user "$(id -u):$(id -g)" \
+  -p ${PORT}:80 \
+  my-lorax-server \
+  --model-id "$MODEL_ID" \
+  --sharded "$SHARDED_MODEL"
+```
 
 ---
 
@@ -506,12 +470,7 @@ Once logs show the server is ready, test the **LoRAX API**.
 ```bash
 curl 127.0.0.1:80/generate \
     -X POST \
-    -d '{
-        "inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]",
-        "parameters": {
-            "max_new_tokens": 64
-        }
-    }' \
+    -d '{ "inputs": "[INST] What LLM model are you? [/INST]", "parameters": { "max_new_tokens": 64 } }' \
     -H 'Content-Type: application/json'
 ```  
 
@@ -537,7 +496,7 @@ your chosen base model.
 <details>
 <summary>Click to expand: Common Failures during API Test</summary>
 
-**Common Failures:** Refer to [Common Failures during Container Launch](#common-failures-during-container-launch)
+**Common Failures:** Refer to the Comprehensive Troubleshooting Guide below.
 
 </details>
 
@@ -554,9 +513,8 @@ your chosen base model.
 - **[Docker]** Cache permission denied → Root-owned files → Run `sudo chown -R $(id -u):$(id -g) $HOME/lorax_model_cache`.
 - **[Model Load]** CUDA OOM → Model too large → Check `nvidia-smi`; use smaller/quantized model.
 - **[Model Load]** Download stalls → Network issue → Use manual download workaround.
-- **[Model Load]** `RuntimeError: weight not found` or **`TypeError`** → Model or quantization incompatibility with the pre-built image. For broader model compatibility, custom configurations, or support for a wider range of quantized models, proceed with Option B: Build from Source.
-- **[API]** 404 on generate → Wrong route → Check `curl http://localhost:80/`; adjust client.
-- **[API]** 500 error → OOM or bad params → Check `docker logs --tail 100 lorax | grep -i error`; reduce `max_tokens`.
+- **[Model Load]** `RuntimeError: weight not found` or **`TypeError`** → Model or quantization incompatibility with the pre-built image. For detailed fixes, see the "Troubleshooting Model Compatibility (Build from Source)" section above.
+- **[Download]** `UserWarning: Not enough free disk space` or `No space left on device` (during model download/caching):** The mounted model cache directory has insufficient space. Check `df -h $HOME/lorax_model_cache`, then `rm -rf` unused model folders. Consider larger disk if needed.
 - **[Performance]** Slow first call → Warmup overhead → Send a short warmup prompt.
 - **[Performance]** Low GPU usage (<30%) → Small batches → Enable batching or increase concurrency.
 - **[Stability]** Exit code 137 → Host OOM → Check `dmesg | tail`; reduce model size.
@@ -565,24 +523,6 @@ your chosen base model.
 
 ---
 
-## 🧠 Decision Matrix
-
-<details>
-<summary>Click to expand: Quick Decision Matrix</summary>
-
-| **Situation** | **Action** |
-|---------------|------------|
-| `nvidia-smi` broken | Fix driver first. |
-| Container `nvidia-smi` fails | Fix NVIDIA runtime config. |
-| `gpt2` fails to load | Check environment/image. If you need broader model compatibility, proceed with Option B: Build from Source. |
-| `gpt2` works, larger model fails | Address VRAM/quantization issues or use Option B for more models. |
-| API fails | Check routes, params, or logs. |
-| API slow | Optimize concurrency or use smaller model. |
-
-</details>
-
----
-
 ## 🧹 Cleanup & Reset
 
 <details>
@@ -606,12 +546,19 @@ sudo chown -R $(id -u):$(id -g) $HOME/lorax_model_cache
 nvidia-smi
 docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
 
-# Pull and run LoRAX
-docker pull ghcr.io/predibase/lorax:main
-MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"; docker run --rm --name lorax --gpus all -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" -e TRANSFORMERS_CACHE=/data -v "$HOME/lorax_model_cache":/data -v "$HOME/lorax_outlines_cache":/root/.cache/outlines --user "$(id -u):$(id -g)" -p 80:80 ghcr.io/predibase/lorax:main --model-id "$MODEL_ID" --sharded false
+# Pull and run LoRAX (Pre-built Image)
+MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"; \
+docker run --rm --name lorax --gpus all -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
+  -e TRANSFORMERS_CACHE=/data -v "$HOME/lorax_model_cache":/data \
+  -v "$HOME/lorax_outlines_cache":/root/.cache/outlines \
+  --user "$(id -u):$(id -g)" -p 80:80 \
+  ghcr.io/predibase/lorax:main --model-id "$MODEL_ID" --sharded false
 
 # Test the API
-curl http://localhost:80/
+curl 127.0.0.1:80/generate \
+    -X POST \
+    -d '{ "inputs": "[INST] What LLM model are you? [/INST]", "parameters": { "max_new_tokens": 64 } }' \
+    -H 'Content-Type: application/json'
 ```
 
 
@@ -631,13 +578,5 @@ curl http://localhost:80/
 
 ---
 
-## Lessons Learned
-
-- **Kernel Cohesion:** Kernel upgrades Must Be Cohesive: Partial migrations (e.g., AWQ v0.0.4 → v0.0.6) exposed ABI mismatches; coordinated bump strategies are now mandated. This directly explains why versions like `vLLM` commits are so critical.
-- **Memory Virtualization:** Early investment in memory virtualization (like v0.9 memory pool overhaul) is crucial for runtime OOM reduction.
-- **Schema-Driven API Design:** Tightened Pydantic models prevented API inconsistencies and improved reliability.
-
----
-
 **Happy Deploying!** 🎉
 
diff --git a/server/Makefile-vllm b/server/Makefile-vllm
index 7fc5adeb9..4baa87120 100644
--- a/server/Makefile-vllm
+++ b/server/Makefile-vllm
@@ -4,7 +4,7 @@ vllm-cuda:
 	git clone https://github.com/vllm-project/vllm.git vllm
 
 build-vllm-cuda: vllm-cuda
-	cd vllm && git fetch && git checkout 9985d06add07a4cc691dc54a7e34f54205c04d40
+	cd vllm && git fetch --tags && git checkout v0.7.3
 	cd vllm && python setup.py build
 
 install-vllm-cuda: build-vllm-cuda

From 3f7b7181874921aa6dc736af061d4d011c0a8db7 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Tue, 22 Jul 2025 10:01:01 +0700
Subject: [PATCH 08/12] Fix Dockerfile

---
 Dockerfile | 2 --
 1 file changed, 2 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index b7e75b83a..c6bf13576 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -161,8 +161,6 @@ FROM kernel-builder as punica-builder
 WORKDIR /usr/src
 COPY server/punica_kernels/ ./server/punica_kernels/
 # Build specific version of punica
-COPY flashinfer/ ./server/punica_kernels/third_party/flashinfer/
-COPY cutlass/ ./server/punica_kernels/third_party/cutlass/
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
 RUN MAX_JOBS=$(MAX_JOBS) python setup.py build
 

From 64f4aa51b95810ef238340ec1adc8c3f603bc7f0 Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Tue, 22 Jul 2025 17:31:40 +0700
Subject: [PATCH 09/12] update

---
 Dockerfile                   |  62 +++++++----
 lorax_deployment_playbook.md | 204 +++++++++++++----------------------
 2 files changed, 115 insertions(+), 151 deletions(-)

diff --git a/Dockerfile b/Dockerfile
index c6bf13576..3bcc39c48 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -1,10 +1,13 @@
 # Rust builder
+
 FROM lukemathwalker/cargo-chef:latest-rust-1.83 AS chef
+
 WORKDIR /usr/src
 
 ARG CARGO_REGISTRIES_CRATES_IO_PROTOCOL=sparse
 
 FROM chef as planner
+
 COPY Cargo.toml Cargo.toml
 COPY rust-toolchain.toml rust-toolchain.toml
 COPY proto proto
@@ -48,6 +51,7 @@ ARG INSTALL_CHANNEL=pytorch
 ARG TARGETPLATFORM
 
 ENV PATH /opt/conda/bin:$PATH
+
 # For build-time CUDA memory resilience
 ENV PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
 
@@ -73,10 +77,10 @@ RUN apt-get update && \
 # Install conda
 # translating Docker's TARGETPLATFORM into mamba arches
 RUN case ${TARGETPLATFORM} in \
-    "linux/arm64")  MAMBA_ARCH=aarch64  ;; \
-    *)              MAMBA_ARCH=x86_64   ;; \
+    "linux/arm64") MAMBA_ARCH=aarch64 ;; \
+    *) MAMBA_ARCH=x86_64 ;; \
     esac && \
-    curl -fsSL -v -o ~/mambaforge.sh -O  "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh"
+    curl -fsSL -v -o ~/mambaforge.sh -O "https://github.com/conda-forge/miniforge/releases/download/${MAMBA_VERSION}/Mambaforge-${MAMBA_VERSION}-Linux-${MAMBA_ARCH}.sh"
 RUN chmod +x ~/mambaforge.sh && \
     bash ~/mambaforge.sh -b -p /opt/conda && \
     rm ~/mambaforge.sh
@@ -84,9 +88,9 @@ RUN chmod +x ~/mambaforge.sh && \
 # Install pytorch
 # On arm64 we exit with an error code
 RUN case ${TARGETPLATFORM} in \
-    "linux/arm64")  exit 1 ;; \
-    *)              /opt/conda/bin/conda update -y conda &&  \
-    /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y "python=${PYTHON_VERSION}" "pytorch=$PYTORCH_VERSION" "pytorch-cuda=$(echo $CUDA_VERSION | cut -d'.' -f 1-2)"  ;; \
+    "linux/arm64") exit 1 ;; \
+    *) /opt/conda/bin/conda update -y conda && \
+    /opt/conda/bin/conda install -c "${INSTALL_CHANNEL}" -c "${CUDA_CHANNEL}" -y "python=${PYTHON_VERSION}" "pytorch=$PYTORCH_VERSION" "pytorch-cuda=$(echo $CUDA_VERSION | cut -d'.' -f 1-2)" ;; \
     esac && \
     /opt/conda/bin/conda clean -ya
 
@@ -99,43 +103,51 @@ FROM pytorch-install as kernel-builder
 #
 # You can adjust this value to optimize build speed based on your system's RAM:
 # - If you have more RAM (e.g., 96GB+), you can increase this value (e.g., to 16, 24, or 32)
-#   to significantly speed up the build. Always monitor RAM usage (htop) to avoid OOM crashes.
+# to significantly speed up the build. Always monitor RAM usage (htop) to avoid OOM crashes.
 # - If you encounter OOM errors even with this value, try reducing it further to 1.
 ENV MAX_JOBS=2
 # If you encounter OOM errors even with this value, try reducing it to 1.
+
 RUN pip install setuptools_scm --no-cache-dir
+
 # Build Flash Attention CUDA kernels
 FROM kernel-builder as flash-att-builder
+
 WORKDIR /usr/src
 COPY server/Makefile-flash-att Makefile
 RUN make build-flash-attention -j$(MAX_JOBS)
 
 # Build Flash Attention v2 CUDA kernels
 FROM kernel-builder as flash-att-v2-builder
+
 WORKDIR /usr/src
 COPY server/Makefile-flash-att-v2 Makefile
 RUN make build-flash-attention-v2-cuda -j$(MAX_JOBS)
 
 # Build Transformers exllama kernels
 FROM kernel-builder as exllama-kernels-builder
+
 WORKDIR /usr/src
 COPY server/exllama_kernels/ .
 RUN MAX_JOBS=$(MAX_JOBS) TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
 
 # Build Transformers exllama kernels
 FROM kernel-builder as exllamav2-kernels-builder
+
 WORKDIR /usr/src
 COPY server/exllamav2_kernels/ .
 RUN MAX_JOBS=$(MAX_JOBS) TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" python setup.py build
 
 # Build Transformers awq kernels
 FROM kernel-builder as awq-kernels-builder
+
 WORKDIR /usr/src
 COPY server/Makefile-awq Makefile
 RUN TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX" make build-awq -j$(MAX_JOBS)
 
 # Build Transformers CUDA kernels
 FROM kernel-builder as custom-kernels-builder
+
 WORKDIR /usr/src
 COPY server/custom_kernels/ .
 # Build specific version of transformers
@@ -143,6 +155,7 @@ RUN MAX_JOBS=$(MAX_JOBS) python setup.py build
 
 # Build vllm CUDA kernels
 FROM kernel-builder as vllm-builder
+
 WORKDIR /usr/src
 ENV TORCH_CUDA_ARCH_LIST="7.0 7.5 8.0 8.6 8.9 9.0+PTX"
 COPY server/Makefile-vllm Makefile
@@ -151,6 +164,7 @@ RUN make build-vllm-cuda -j$(MAX_JOBS)
 
 # Build megablocks kernels
 FROM kernel-builder as megablocks-kernels-builder
+
 WORKDIR /usr/src
 COPY server/Makefile-megablocks Makefile
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
@@ -159,13 +173,15 @@ RUN make build-megablocks -j$(MAX_JOBS)
 # Build punica CUDA kernels
 FROM kernel-builder as punica-builder
 WORKDIR /usr/src
-COPY server/punica_kernels/ ./server/punica_kernels/
+
+COPY server/punica_kernels/ .
 # Build specific version of punica
 ENV TORCH_CUDA_ARCH_LIST="8.0;8.6+PTX"
 RUN MAX_JOBS=$(MAX_JOBS) python setup.py build
 
 # Build eetq kernels
 FROM kernel-builder as eetq-kernels-builder
+
 WORKDIR /usr/src
 COPY server/Makefile-eetq Makefile
 # Build specific version of transformers
@@ -196,32 +212,36 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
 COPY --from=pytorch-install /opt/conda /opt/conda
 
 # Copy build artifacts from flash attention builder
-COPY --from=flash-att-builder /usr/src/flash-attention/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
-COPY --from=flash-att-builder /usr/src/flash-attention/csrc/layer_norm/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
-COPY --from=flash-att-builder /usr/src/flash-attention/csrc/rotary/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=flash-att-builder      /usr/src/flash-attention/build/lib.linux-x86_64-cpython-310         /opt/conda/lib/python3.10/site-packages
+COPY --from=flash-att-builder      /usr/src/flash-attention/csrc/layer_norm/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=flash-att-builder      /usr/src/flash-attention/csrc/rotary/build/lib.linux-x86_64-cpython-310     /opt/conda/lib/python3.10/site-packages
 
 # Copy build artifacts from flash attention v2 builder
-COPY --from=flash-att-v2-builder /usr/src/flash-attention-v2/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=flash-att-v2-builder   /usr/src/flash-attention-v2/build/lib.linux-x86_64-cpython-310      /opt/conda/lib/python3.10/site-packages
 
 # Copy build artifacts from custom kernels builder
-COPY --from=custom-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=custom-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-310                         /opt/conda/lib/python3.10/site-packages
+
 # Copy build artifacts from exllama kernels builder
-COPY --from=exllama-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=exllama-kernels-builder   /usr/src/build/lib.linux-x86_64-cpython-310                     /opt/conda/lib/python3.10/site-packages
+
 # Copy build artifacts from exllamav2 kernels builder
-COPY --from=exllamav2-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=exllamav2-kernels-builder /usr/src/build/lib.linux-x86_64-cpython-310                     /opt/conda/lib/python3.10/site-packages
+
 # Copy build artifacts from awq kernels builder
-COPY --from=awq-kernels-builder /usr/src/llm-awq/awq/kernels/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=awq-kernels-builder        /usr/src/llm-awq/awq/kernels/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+
 # Copy builds artifacts from vllm builder
-COPY --from=vllm-builder /usr/src/vllm/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=vllm-builder               /usr/src/vllm/build/lib.linux-x86_64-cpython-310               /opt/conda/lib/python3.10/site-packages
 
 # Copy builds artifacts from punica builder
-COPY --from=punica-builder /usr/src/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=punica-builder             /usr/src/build/lib.linux-x86_64-cpython-310                    /opt/conda/lib/python3.10/site-packages
 
 # Copy build artifacts from megablocks builder
-COPY --from=megablocks-kernels-builder /usr/src/megablocks/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=megablocks-kernels-builder /usr/src/megablocks/build/lib.linux-x86_64-cpython-310         /opt/conda/lib/python3.10/site-packages
 
 # Copy build artifacts from eetq builder
-COPY --from=eetq-kernels-builder /usr/src/eetq/build/lib.linux-x86_64-cpython-310 /opt/conda/lib/python3.10/site-packages
+COPY --from=eetq-kernels-builder       /usr/src/eetq/build/lib.linux-x86_64-cpython-310               /opt/conda/lib/python3.10/site-packages
 
 # Install flash-attention dependencies
 RUN pip install einops --no-cache-dir
@@ -249,7 +269,6 @@ RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-ins
     g++ \
     && rm -rf /var/lib/apt/lists/*
 
-
 # Final image
 FROM base
 LABEL source="https://github.com/predibase/lorax"
@@ -261,7 +280,6 @@ RUN chmod +x entrypoint.sh
 COPY sync.sh sync.sh
 RUN chmod +x sync.sh
 
-
 RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && \
     unzip awscliv2.zip && \
     sudo ./aws/install && \
diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index 06ff52ddf..9a20cee76 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -281,170 +281,78 @@ echo $HUGGING_FACE_HUB_TOKEN
 
 ## Phase 2: Deploy LoRAX
 
-Choose one deployment path:
-- **(A) Pre-built Image** – Fastest option, recommended for most users.
-- **(B) Build from Source** – Only for custom changes or unreleased patches.
-
----
+You can deploy LoRAX using either the **pre-built image** or by **building from source**. Both methods now support the same set of models:
+- `meta-llama/Llama-3.2-3B-Instruct`
+- `mistralai/Mistral-7B-Instruct-v0.1`
+- `meta-llama/Meta-Llama-3-8B-Instruct`
 
-### Option A: Pre-built Image 🎉
+Choose your deployment path:
+- **(A) Pre-built Image** – Fastest option, recommended for most users.
+- **(B) Build from Source** – For custom changes or unreleased patches.
 
-#### 1. Pull the LoRAX Image
+### 1. (Option A) Pull the Pre-built Image
 
 ```bash
 docker pull ghcr.io/predibase/lorax:main
 ```
 
-**Success:** Image downloads successfully.  
-**Common Failure:** Network timeout → Retry or check connectivity.
-
-> **Tip:** This is a public image, so no authentication issues are expected.
-
----
-
-#### 2. Choose Your Model 📊
-
-**Critical Compatibility Note:** Due to internal versioning and optimization, the `ghcr.io/predibase/lorax:main` pre-built Docker image is **only consistently compatible with `mistralai/Mistral-7B-Instruct-v0.1`** at this time. Attempts to load other models (including `gpt2`, `starcoder2-3b`, or any other quantized models) may result in `TypeError`, `RuntimeError: weight ... does not exist`, or other internal loading failures. For broader model compatibility, custom configurations, or support for a wider range of quantized models, please proceed with **Option B: Build from Source**.
-
-For `mistralai/Mistral-7B-Instruct-v0.1`, a GPU with **16-24 GB VRAM is recommended** to ensure smooth operation and sufficient KV cache.
-
-#### 3. Run the Container
-
-```bash
-MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"
-SHARDED_MODEL="false" # Set to 'true' for sharded (multi-GPU) models like 70B
-PORT=80 # Host port to access the LoRAX server
-
-docker run --rm \
-  --name lorax \
-  --gpus all \
-  -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
-  -e TRANSFORMERS_CACHE=/data \
-  -v "$HOME/lorax_model_cache":/data \
-  -v "$HOME/lorax_outlines_cache":/root/.cache/outlines \
-  --user "$(id -u):$(id -g)" \
-  -p ${PORT}:80 \
-  ghcr.io/predibase/lorax:main \
-  --model-id "$MODEL_ID" \
-  --sharded "$SHARDED_MODEL"
-```
-
-<details>
-<summary>Click to expand: Explanation of Docker Run Flags</summary>
-
-**What This Does:**
-- `docker run --rm --name lorax`: Starts a new container, removes it on exit, and names it `lorax`.
-- `--gpus all`: Grants the container access to all available GPUs.
-- `-e HUGGING_FACE_HUB_TOKEN`: Passes your Hugging Face authentication token.
-- `-v "$HOME/lorax_model_cache":/data`: Mounts a local directory for persistent model caching.
-- `-v "$HOME/lorax_outlines_cache":/root/.cache/outlines`: Mounts cache for Outlines library.
-- `--user "$(id -u):$(id -g)"`: Runs the container process as your host user for permission consistency.
-- `-p ${PORT}:80`: Maps the container's internal port 80 to your specified host port.
-- `ghcr.io/predibase/lorax:main`: Specifies the Docker image to use.
-- `--model-id "$MODEL_ID"`: Sets the Hugging Face model to load.
-- `--sharded "$SHARDED_MODEL"`: Configures for multi-GPU sharding if set to `true`.
-
-</details>
-
----
+### 1. (Option B) Build the Image from Source
 
-### Option B: Build from Source 🛠️
-
-Use this if you need custom changes or unreleased patches, or if you want to run models other than `mistralai/Mistral-7B-Instruct-v0.1`.
-
-#### 1. Clone the LoRAX Repository (Including all necessary Submodules)
-
-**Problem:** To build LoRAX from source, you need not only the main repository but also its nested external dependencies, which are managed as Git submodules (e.g., `flashinfer` for custom CUDA kernels). Skipping this can lead to "No such file or directory" errors during the build.
-
-**Action:** First, clone the main repository, then immediately initialize and update all its submodules.
+> **Tip: Dramatically Speed Up Builds!**
+> 
+> By default, the Dockerfile sets `MAX_JOBS=2` to prevent out-of-memory (OOM) errors on machines with limited RAM. If you have a lot of RAM (e.g., 64GB, 96GB, or more), you can **dramatically speed up the build** by increasing this value.
+>
+> **How to do it:**
+> 1. Open the `Dockerfile` in your editor.
+> 2. Find the line:
+>    ```Dockerfile
+>    ENV MAX_JOBS=2
+>    ```
+> 3. Change `2` to a higher number (e.g., `16`, `24`, or `32`).
+> 4. Save the file and rebuild the image.
+>
+> **Warning:** Always monitor your RAM usage (e.g., with `htop`) during the build. If you run out of memory, reduce `MAX_JOBS` and try again.
 
 ```bash
 git clone -b feat/deployment-playbook-enhancements https://github.com/minhkhoango/lorax.git
 cd lorax
 git submodule update --init --recursive
-```
-
-#### 2. Build the Image
-
-```bash
 export DOCKER_BUILDKIT=1
 docker build -t my-lorax-server -f Dockerfile .
 ```
 
-**Common Failures:**
-- Build stalls → Add `--network=host` to the build command.
-- Version conflicts → Adjust base image or dependencies.
-
-<details>
-<summary>Click to expand: Advanced Build-Time Optimizations & Troubleshooting (MAX_JOBS, OOM)</summary>
-
-> **Important Note on Build Parallelism (`MAX_JOBS`) & Memory:**
-> Building custom CUDA kernels from source is a memory-intensive process. The `Dockerfile` is configured with `ENV MAX_JOBS=2` as a **very conservative default** for parallel compilation. This value aims to provide the highest stability and prevent Out-Of-Memory (OOM) crashes on a wide range of hardware, including instances with limited RAM relative to CPU cores.
->
-> **Advanced Build-Time Memory Management:**
-> For systems with very limited RAM or during memory-intensive CUDA kernel compilations, setting `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` can help PyTorch manage memory more flexibly during the build process, potentially reducing Out-Of-Memory (OOM) crashes. This setting is now automatically applied via an environment variable in the `Dockerfile` for relevant build stages.
->
-> * **To Optimize for Faster Builds (Recommended):**
->     If you have significantly more RAM (e.g., 96GB or more) and want to speed up compilation, you can safely **increase `MAX_JOBS`**.
->     1.  **Open the `Dockerfile`** in your cloned `lorax` directory using your preferred text editor (e.g., `nano Dockerfile` or `code Dockerfile`).
->     2.  **Find the line:** `ENV MAX_JOBS=2` (it will be surrounded by comments explaining its purpose)
->     3.  **Change the value** to a higher number (e.g., `16`, `24`, or `32`). *Always monitor your RAM usage (`htop`) during the build to avoid crashes.*
->     4.  **Save the `Dockerfile`** and restart your build command (`docker build -t my-lorax-server -f Dockerfile .`).
->
-> * **If your build still crashes with an OOM error:**
->     This indicates you have very limited RAM or other processes are consuming it. You **must reduce `MAX_JOBS` further**. Edit the `Dockerfile` as described above and change the value to `1`. Then, restart the build.
-
-</details>
-
-#### 3. Choose Your Model
+---
 
-Refer to the compatibility table below to select a model that fits your hardware and requirements.
+### 2. Choose Your Model & Run the Container
 
----
+Refer to the table below to select a model that fits your hardware and requirements:
 
 | **Model** | **Params** | **VRAM (FP16/BF16)** | **Notes** |
 |-----------|------------|-----------------------|-----------|
+| `meta-llama/Llama-3.2-3B-Instruct` | 3B | ~7 GB | Good for 8GB+ GPUs |
 | `mistralai/Mistral-7B-Instruct-v0.1` | 7B | ~14–15 GB | Needs 16–24 GB VRAM. |
 | `meta-llama/Meta-Llama-3-8B-Instruct` | 8B | ~16 GB | Tight on 16 GB; better with 24 GB. |
-| `meta-llama/Meta-Llama-3-70B-Instruct` | 70B | 135–140 GB | Needs multi-GPU or heavy quantization. |
-| `mistralai/Mixtral-8x7B-Instruct-v0.1` | 8x7B (MoE) | ~90-100 GB (FP16/BF16) | **Disk Required: ~130 GB.** Often runs via expert routing; requires heavy quantization (e.g., Q8_0) or multiple H100s/A100s. |
 
 > **VRAM Tips:**
 > - Keep **10–15% VRAM free** for KV cache and overhead.
-> - **6–8 GB GPUs**: Stick to quantized 7B models.
+> - **6–8 GB GPUs**: Stick to quantized or smaller models.
 > - **12–16 GB GPUs**: Comfortable for 7B; tight for 8B.
 > - **24 GB+ GPUs**: Suitable for 13B or multi-instance setups.
-> - **MoE Models (e.g., Mixtral 8x7B)**: These models consume VRAM differently, and also have significant disk footprint. A full 8x7B in FP16/BF16 will require significantly more than 48GB VRAM (closer to 90-100GB), and around **130 GB of disk space for the weights**. Consider heavy quantization (e.g., Q8_0) or multi-GPU systems like multiple H100s for deployment.
-
-<details>
-<summary>Click to expand: Troubleshooting Model Compatibility (Build from Source)</summary>
-
-### Model Compatibility Beyond Mistral-7B (Build from Source)
-
-If you attempt to load a model other than `mistralai/Mistral-7B-Instruct-v0.1` and encounter errors such as `TypeError: TensorParallelColumnLinear.load_multi()` or `RuntimeError: weight ... does not exist`, these errors typically indicate version incompatibilities between PEFT, Transformers, and TGI components. The root issue is that the `fan_in_fan_out` parameter conflicts with TGI's tensor parallel implementations, and `TensorParallelColumnLinear` expects certain `base_layer` attributes that may not be present in all model variants or library versions.
 
-- **Note:** If your model requires `--trust-remote-code`, this is a flag and should be passed as `--trust-remote-code` (no value, not `--trust-remote-code=True`).
-- **`ImportError: No module named 'msgspec'` (for Qwen or other vLLM-dependent models):** `vLLM` may require the `msgspec` Python library. Add `msgspec` to `server/requirements.txt` and rebuild your Docker image with `--no-cache`.
-- **`TypeError` for `gpt2` (fan_in_fan_out):** This is a specific API mismatch between LoRAX's custom `FlashGPT2` modeling and the `vLLM` version. Ensure the `vLLM` commit in `server/Makefile-vllm` is `9985d06add07a4cc691dc54a7e34f54205c04d40` (the stable `0.7.3+` version) or a later compatible version like `0.8.2+`, and rebuild. The `--model-impl transformers` flag does *not* exist in `lorax-launcher`.
+#### Run the Container
 
-To attempt compatibility with a different model (e.g., `gpt2`):
-
-1. The `vLLM` inference engine version is crucial. In LoRAX, `vLLM` is pinned to a specific Git commit for stability. To change it, you need to **edit `server/Makefile-vllm`**.
-2. Rebuild the Docker image after making any changes.
-3. If you encounter errors related to missing weights or quantization, check the model's compatibility with the current `transformers` and `vLLM` versions.
-4. Change the commit hash (e.g., `766435e660a786933392eb8ef0a873bc38cf0c8b`) to **`9985d06add07a4cc691dc54a7e34f54205c04d40`** (a `vLLM 0.7.3+` version known for broader compatibility, including `gpt2`), or try a later compatible version such as `0.8.2+`.
-
-* **Potential `transformers` version adjustments:** If changing the `vLLM` commit doesn't resolve the issue, you *might* also need to modify the `transformers` version in `server/requirements.txt`. Research suggests `Transformers 4.49.0+` provides stable `gpt2` support with `vLLM 0.7.3+`. **Avoid `Transformers 4.48.x` with `vLLM 0.7.2` due to known Qwen model compatibility issues.**
-
-</details>
-
-#### 4. Run the Container
+Set your desired model and image name (see below):
 
 ```bash
-MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1"
+MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1" # or meta-llama/Llama-3.2-3B-Instruct, meta-llama/Meta-Llama-3-8B-Instruct
 SHARDED_MODEL="false" # Set to 'true' for sharded (multi-GPU) models like 70B
 PORT=80 # Host port to access the LoRAX server
 
+# For pre-built image:
+IMAGE_NAME="ghcr.io/predibase/lorax:main"
+# For source-built image:
+# IMAGE_NAME="my-lorax-server"
+
 docker run --rm \
   --name lorax \
   --gpus all \
@@ -454,11 +362,28 @@ docker run --rm \
   -v "$HOME/lorax_outlines_cache":/root/.cache/outlines \
   --user "$(id -u):$(id -g)" \
   -p ${PORT}:80 \
-  my-lorax-server \
+  $IMAGE_NAME \
   --model-id "$MODEL_ID" \
   --sharded "$SHARDED_MODEL"
 ```
 
+<details>
+<summary>Click to expand: Explanation of Docker Run Flags</summary>
+
+**What This Does:**
+- `docker run --rm --name lorax`: Starts a new container, removes it on exit, and names it `lorax`.
+- `--gpus all`: Grants the container access to all available GPUs.
+- `-e HUGGING_FACE_HUB_TOKEN`: Passes your Hugging Face authentication token.
+- `-v "$HOME/lorax_model_cache":/data`: Mounts a local directory for persistent model caching.
+- `-v "$HOME/lorax_outlines_cache":/root/.cache/outlines`: Mounts cache for Outlines library.
+- `--user "$(id -u):$(id -g)"`: Runs the container process as your host user for permission consistency.
+- `-p ${PORT}:80`: Maps the container's internal port 80 to your specified host port.
+- `$IMAGE_NAME`: Specifies the Docker image to use (pre-built or source-built).
+- `--model-id "$MODEL_ID"`: Sets the Hugging Face model to load.
+- `--sharded "$SHARDED_MODEL"`: Configures for multi-GPU sharding if set to `true`.
+
+</details>
+
 ---
 
 ## Phase 3: Test the API
@@ -521,6 +446,27 @@ your chosen base model.
 
 </details>
 
+<!-- Inserted section: Model Compatibility Beyond Mistral-7B (Build from Source) troubleshooting bullets -->
+
+<details>
+<summary>Model Compatibility Beyond Mistral-7B (Build from Source)</summary>
+
+**Common Issues & Solutions:**
+
+* **`TypeError: TensorParallelColumnLinear.load_multi() got an unexpected keyword argument 'fan_in_fan_out'` (for `gpt2`):**
+    * **Cause:** This error is specific to `gpt2`'s `Conv1D` layer architecture and an API mismatch with the `vLLM` integration in LoRAX's custom modeling.
+    * **Fix:** Ensure your `vLLM` is pinned to a compatible version/commit in `server/Makefile-vllm` (e.g., `v0.7.3` or specific fixes like `9985d06add07a4cc691dc54a7e34f54205c04d40` if explicitly needed). Rebuild your Docker image. The `--model-impl transformers` flag, while a workaround in some TGI contexts, is not supported by `lorax-launcher`.
+
+* **`ImportError: No module named 'msgspec'` (for `Qwen` models or others using newer `vLLM` features):**
+    * **Cause:** The `vLLM` version integrated in your build may require the `msgspec` Python library, which is not a default dependency.
+    * **Fix:** Add `msgspec` to your `server/requirements.txt` file and rebuild your Docker image with `--no-cache` to ensure the new dependency is installed.
+
+* **`RuntimeError: weight transformer.wte.weight does not exist` (for `bigcode/starcoder2-3b`):**
+    * **Cause:** This indicates a specific naming convention or structural mismatch for certain weight files within the `bigcode/starcoder2-3b` checkpoint that LoRAX's `FlashSantacoderModel` is trying to load.
+    * **Fix:** This often requires deeper debugging of the model's weight structure or changes within `lorax_server/models/custom_modeling/flash_santacoder_modeling.py`. Consider this model a known edge case that may require specific code adjustments beyond standard dependency management.
+
+</details>
+
 ---
 
 ## 🧹 Cleanup & Reset

From 9583042b5a0b050a214450d91be897ed00dd381e Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Tue, 22 Jul 2025 21:05:40 +0700
Subject: [PATCH 10/12] add msgspec to requirements.txt

---
 lorax_deployment_playbook.md | 32 +++++++++++++++++++++-----------
 server/requirements.txt      |  1 +
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index 9a20cee76..7f99c3f56 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -298,25 +298,35 @@ docker pull ghcr.io/predibase/lorax:main
 
 ### 1. (Option B) Build the Image from Source
 
-> **Tip: Dramatically Speed Up Builds!**
+Want to build LoRAX from source for custom changes or the latest patches? Follow these steps:
+
+```bash
+# 1. Clone the repository (if you haven't already)
+git clone -b feat/deployment-playbook-enhancements https://github.com/minhkhoango/lorax.git
+cd lorax
+# 2. Initialize submodules
+git submodule update --init --recursive
+```
+
+> **Tip: Speed Up Your Build!**
 > 
-> By default, the Dockerfile sets `MAX_JOBS=2` to prevent out-of-memory (OOM) errors on machines with limited RAM. If you have a lot of RAM (e.g., 64GB, 96GB, or more), you can **dramatically speed up the build** by increasing this value.
+> By default, the Dockerfile uses `MAX_JOBS=2` to avoid out-of-memory (OOM) errors on machines with limited RAM. If you have a lot of RAM (e.g., 64GB, 96GB, or more), you can **dramatically speed up the build** by increasing this value.
 >
-> **How to do it:**
-> 1. Open the `Dockerfile` in your editor.
-> 2. Find the line:
+> **How to adjust build speed:**
+> 1. Open your `Dockerfile` at the root of your cloned repository (`~/lorax/Dockerfile`) in your editor.
+> 2. Locate the line:
 >    ```Dockerfile
 >    ENV MAX_JOBS=2
 >    ```
-> 3. Change `2` to a higher number (e.g., `16`, `24`, or `32`).
-> 4. Save the file and rebuild the image.
+>    (This line is typically found around line 90 in the `Dockerfile` within the `kernel-builder` stage, but verify its exact location).
+> 3. Change `2` to a higher number (e.g., `16`, `24`, or `32`) if your system has enough RAM.
+> 4. Save your `Dockerfile` and rebuild the image.
 >
-> **Warning:** Always monitor your RAM usage (e.g., with `htop`) during the build. If you run out of memory, reduce `MAX_JOBS` and try again.
+> *Not sure how much RAM you have? Run `htop` or `free -h` in your terminal. If you run out of memory during build, lower `MAX_JOBS` and try again!*
+
+Now, build your Docker image:
 
 ```bash
-git clone -b feat/deployment-playbook-enhancements https://github.com/minhkhoango/lorax.git
-cd lorax
-git submodule update --init --recursive
 export DOCKER_BUILDKIT=1
 docker build -t my-lorax-server -f Dockerfile .
 ```
diff --git a/server/requirements.txt b/server/requirements.txt
index c808e2032..7e0f7c37a 100644
--- a/server/requirements.txt
+++ b/server/requirements.txt
@@ -29,6 +29,7 @@ jmespath==1.0.1 ; python_version >= "3.9" and python_version < "4.0"
 loguru==0.6.0 ; python_version >= "3.9" and python_version < "4.0"
 markupsafe==3.0.2 ; python_version >= "3.9" and python_version < "4.0"
 mpmath==1.3.0 ; python_version >= "3.9" and python_version < "4.0"
+msgspec
 multidict==6.1.0 ; python_version >= "3.9" and python_version < "4.0"
 networkx==3.2.1 ; python_version >= "3.9" and python_version < "4.0"
 numpy==1.26.4 ; python_version >= "3.9" and python_version < "4.0"

From 5f0081b1ab6808e54ab38571026b22171328594a Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Tue, 22 Jul 2025 21:47:00 +0700
Subject: [PATCH 11/12] revert back to origin

---
 server/Makefile-vllm    | 2 +-
 server/requirements.txt | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/server/Makefile-vllm b/server/Makefile-vllm
index 4baa87120..4c92391b3 100644
--- a/server/Makefile-vllm
+++ b/server/Makefile-vllm
@@ -4,7 +4,7 @@ vllm-cuda:
 	git clone https://github.com/vllm-project/vllm.git vllm
 
 build-vllm-cuda: vllm-cuda
-	cd vllm && git fetch --tags && git checkout v0.7.3
+	cd vllm && git fetch && git checkout 766435e660a786933392eb8ef0a873bc38cf0c8b
 	cd vllm && python setup.py build
 
 install-vllm-cuda: build-vllm-cuda
diff --git a/server/requirements.txt b/server/requirements.txt
index 7e0f7c37a..c808e2032 100644
--- a/server/requirements.txt
+++ b/server/requirements.txt
@@ -29,7 +29,6 @@ jmespath==1.0.1 ; python_version >= "3.9" and python_version < "4.0"
 loguru==0.6.0 ; python_version >= "3.9" and python_version < "4.0"
 markupsafe==3.0.2 ; python_version >= "3.9" and python_version < "4.0"
 mpmath==1.3.0 ; python_version >= "3.9" and python_version < "4.0"
-msgspec
 multidict==6.1.0 ; python_version >= "3.9" and python_version < "4.0"
 networkx==3.2.1 ; python_version >= "3.9" and python_version < "4.0"
 numpy==1.26.4 ; python_version >= "3.9" and python_version < "4.0"

From dd7ab6f50e30fadbfee23ed388a56ec79d6122be Mon Sep 17 00:00:00 2001
From: Khoa Ngo <ngominhkhoa2006@gmail.com>
Date: Wed, 23 Jul 2025 09:44:03 +0700
Subject: [PATCH 12/12] final commit before PR

---
 lorax_deployment_playbook.md | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/lorax_deployment_playbook.md b/lorax_deployment_playbook.md
index 7f99c3f56..d31da4390 100644
--- a/lorax_deployment_playbook.md
+++ b/lorax_deployment_playbook.md
@@ -302,6 +302,10 @@ Want to build LoRAX from source for custom changes or the latest patches? Follow
 
 ```bash
 # 1. Clone the repository (if you haven't already)
+#    NOTE: This guide uses a battle-tested branch of the LoRAX repository
+#    that includes fixes for common on-premise deployment issues (e.g., build-time
+#    dependencies and submodule initialization). Once these fixes are
+#    merged upstream, you can use the official `predibase/lorax.git` repository.
 git clone -b feat/deployment-playbook-enhancements https://github.com/minhkhoango/lorax.git
 cd lorax
 # 2. Initialize submodules
@@ -354,7 +358,7 @@ Refer to the table below to select a model that fits your hardware and requireme
 Set your desired model and image name (see below):
 
 ```bash
-MODEL_ID="mistralai/Mistral-7B-Instruct-v0.1" # or meta-llama/Llama-3.2-3B-Instruct, meta-llama/Meta-Llama-3-8B-Instruct
+MODEL_ID="meta-llama/Llama-3.2-3B-Instruct" # or mistralai/Mistral-7B-Instruct-v0.1, meta-llama/Meta-Llama-3-8B-Instruct
 SHARDED_MODEL="false" # Set to 'true' for sharded (multi-GPU) models like 70B
 PORT=80 # Host port to access the LoRAX server