|
| 1 | +# Issue: Qwen3-8B OOM on 48GB Mac |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +Running Qwen3-8B-Q4_K_M.gguf (4.7GB on disk) on a 48GB Mac fails with OOM during weight loading, both via kllama and the unified skainet CLI. |
| 6 | + |
| 7 | +## Root Cause |
| 8 | + |
| 9 | +The current loading path uses `DEQUANTIZE_TO_FP32`, which expands Q4 weights 8x: |
| 10 | + |
| 11 | +| Component | Size | |
| 12 | +|--------------------------|-----------| |
| 13 | +| Quantized weights (disk) | 4.7 GB | |
| 14 | +| Dequantized FP32 weights | ~37-40 GB | |
| 15 | +| KV cache (2048 context) | 512 MB | |
| 16 | +| Embeddings, norms | ~1 GB | |
| 17 | +| JVM + tokenizer | ~2 GB | |
| 18 | +| **Total** | **~41 GB** | |
| 19 | + |
| 20 | +48GB barely fits, and the JVM needs headroom for temporary buffers during dequantization, so it OOMs. |
| 21 | + |
| 22 | +## What Already Exists in the Codebase |
| 23 | + |
| 24 | +### 1. NATIVE_OPTIMIZED quant policy (best option) |
| 25 | + |
| 26 | +`QuantPolicy.NATIVE_OPTIMIZED` keeps weights in quantized form and uses SIMD-accelerated matmul kernels. `MemSegWeightConverter` converts raw Q4/Q8 bytes to 64-byte-aligned MemorySegment-backed tensors for Vector API dispatch. |
| 27 | + |
| 28 | +- Memory: ~5GB for the 8B model (vs 40GB with FP32) |
| 29 | +- Speed: 1-3 tok/s (proven on Qwen2/3 via kqwen runner) |
| 30 | +- Already works for Qwen2/3 in kllama Main.kt (the `isQwen` path) |
| 31 | + |
| 32 | +**Why it doesn't work today for the 8B:** The kllama `isQwen` path loads with `NATIVE_OPTIMIZED` but then creates `LlamaRuntime` which still transposes weight matrices to FP32 during init (`LlamaRuntime.kt:74`). This transpose step allocates FP32 copies. |
| 33 | + |
| 34 | +### 2. Lazy per-layer dequantization (Apertus pattern) |
| 35 | + |
| 36 | +`ApertusQuantizedRuntime` keeps weights quantized and dequantizes one projection at a time during `runLayer()`: |
| 37 | + |
| 38 | +``` |
| 39 | +Resident: ~3.5GB (quantized) + ~100MB (norms/embeddings) |
| 40 | +Per-layer temp: ~50MB (one projection, discarded after matmul) |
| 41 | +``` |
| 42 | + |
| 43 | +This is the llama.cpp approach. Not yet available for LLaMA/Qwen runtimes. |
| 44 | + |
| 45 | +### 3. Memory-mapped loading (F32 only) |
| 46 | + |
| 47 | +`MmapLlamaLoader` maps the GGUF file via `MappedByteBuffer` for zero-copy tensor access. Only works for F32 models — Q4 models need dequantization which defeats the zero-copy benefit. |
| 48 | + |
| 49 | +## Proposed Solutions (ordered by effort) |
| 50 | + |
| 51 | +### Solution A: Fix NATIVE_OPTIMIZED path for 8B models (small effort) |
| 52 | + |
| 53 | +The kllama Main.kt Qwen path already loads with `NATIVE_OPTIMIZED`. The problem is `LlamaRuntime` constructor transposes weights to FP32. Fix: |
| 54 | + |
| 55 | +1. Skip transpose for quantized tensors in `LlamaRuntime` init |
| 56 | +2. Or use `OptimizedLLMRuntime` which doesn't transpose (the DSL path) |
| 57 | +3. Ensure SIMD matmul kernels handle Q4_K_M format (Q4_K dispatch exists in `MemSegWeightConverter`) |
| 58 | + |
| 59 | +**Expected result:** 8B Q4 loads in ~5GB, runs at 1-3 tok/s. |
| 60 | + |
| 61 | +**Files to change:** |
| 62 | +- `llm-inference/llama/.../LlamaRuntime.kt` -- skip transpose for quantized MemSeg tensors |
| 63 | +- Or migrate Qwen path in `kllama/cli/Main.kt` to `OptimizedLLMRuntime` + `llamaNetwork()` |
| 64 | + |
| 65 | +### Solution B: Port lazy dequant from Apertus to LLaMA (medium effort) |
| 66 | + |
| 67 | +Port the `ApertusQuantizedRuntime` pattern to a `LlamaQuantizedRuntime`: |
| 68 | + |
| 69 | +1. Store projections as `QuantizedTensor` (quantized bytes + metadata) |
| 70 | +2. In `runLayer()`, dequantize one weight matrix at a time, matmul, discard |
| 71 | +3. Keep embeddings and norms as FP32 (small, need element access) |
| 72 | + |
| 73 | +**Expected result:** 8B Q4 loads in ~5GB, runs at ~1 tok/s (dequant overhead per layer). |
| 74 | + |
| 75 | +**Files to create:** |
| 76 | +- `llm-inference/llama/.../LlamaQuantizedRuntime.kt` (new, based on Apertus pattern) |
| 77 | +- `llm-runtime/kllama/.../LlamaQuantizedWeights.kt` (new, mixed storage) |
| 78 | + |
| 79 | +### Solution C: SIMD-native matmul without dequantization (larger effort, best perf) |
| 80 | + |
| 81 | +The SIMD backend (`skainet-backend-cpu`) already has Q4/Q8 matmul kernels via Vector API. The issue is the runtime layer doesn't use them directly. Changes needed in skainet core: |
| 82 | + |
| 83 | +1. `skainet-backend-cpu`: Ensure `matmul(FP32, Q4_K)` kernel exists and dispatches correctly |
| 84 | +2. `LlamaRuntime` or `OptimizedLLMRuntime`: Accept mixed-precision weight tensors (Q4 weights, FP32 activations) |
| 85 | +3. Skip the `MemSegWeightConverter` step entirely — use raw quantized MemorySegments |
| 86 | + |
| 87 | +**Expected result:** 8B Q4 loads in ~5GB, runs at 2-5 tok/s (no dequant overhead). |
| 88 | + |
| 89 | +**Files to change (in skainet core):** |
| 90 | +- `skainet-backend-cpu`: Q4_K matmul kernel (may already exist) |
| 91 | +- `skainet-lang-core`: Mixed-precision tensor support in matmul dispatch |
| 92 | + |
| 93 | +### Solution D: Memory-mapped quantized tensors (largest effort) |
| 94 | + |
| 95 | +Extend `MmapLlamaLoader` to support quantized formats: |
| 96 | + |
| 97 | +1. Map the GGUF file to virtual memory |
| 98 | +2. Create quantized tensor views that reference mmap regions |
| 99 | +3. Dequantize on-the-fly during matmul (like lazy dequant but zero-copy from disk) |
| 100 | + |
| 101 | +**Expected result:** Load time near-zero, ~5GB virtual (OS manages paging). |
| 102 | + |
| 103 | +**Files to change:** |
| 104 | +- `llm-inference/llama/.../MmapLlamaLoader.kt` -- extend to Q4/Q8 formats |
| 105 | +- Requires `skainet-io-core` changes for mmap quantized tensor views |
| 106 | + |
| 107 | +## Recommended Path |
| 108 | + |
| 109 | +**Start with Solution A** — it's the smallest change and uses code that already works for Qwen2/3. The `NATIVE_OPTIMIZED` + `MemSegWeightConverter` path is proven; the only blocker is `LlamaRuntime`'s constructor transposing weights to FP32. |
| 110 | + |
| 111 | +If that's not enough, **add Solution B** (lazy dequant) which gives the most control over memory at a known performance cost. |
| 112 | + |
| 113 | +Solution C is the long-term goal (best performance) but requires skainet core changes. |
0 commit comments