Skip to content

Commit ced73dd

Browse files
Merge pull request #51 from SKaiNET-developers/feature/49-tool-calling-pipeline-oom
Feature/49 tool calling pipeline oom
2 parents f9110fb + b1c5457 commit ced73dd

13 files changed

Lines changed: 579 additions & 60 deletions

File tree

ISSUE-skainet-8b-oom.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
# Issue: Qwen3-8B OOM on 48GB Mac
2+
3+
## Problem
4+
5+
Running Qwen3-8B-Q4_K_M.gguf (4.7GB on disk) on a 48GB Mac fails with OOM during weight loading, both via kllama and the unified skainet CLI.
6+
7+
## Root Cause
8+
9+
The current loading path uses `DEQUANTIZE_TO_FP32`, which expands Q4 weights 8x:
10+
11+
| Component | Size |
12+
|--------------------------|-----------|
13+
| Quantized weights (disk) | 4.7 GB |
14+
| Dequantized FP32 weights | ~37-40 GB |
15+
| KV cache (2048 context) | 512 MB |
16+
| Embeddings, norms | ~1 GB |
17+
| JVM + tokenizer | ~2 GB |
18+
| **Total** | **~41 GB** |
19+
20+
48GB barely fits, and the JVM needs headroom for temporary buffers during dequantization, so it OOMs.
21+
22+
## What Already Exists in the Codebase
23+
24+
### 1. NATIVE_OPTIMIZED quant policy (best option)
25+
26+
`QuantPolicy.NATIVE_OPTIMIZED` keeps weights in quantized form and uses SIMD-accelerated matmul kernels. `MemSegWeightConverter` converts raw Q4/Q8 bytes to 64-byte-aligned MemorySegment-backed tensors for Vector API dispatch.
27+
28+
- Memory: ~5GB for the 8B model (vs 40GB with FP32)
29+
- Speed: 1-3 tok/s (proven on Qwen2/3 via kqwen runner)
30+
- Already works for Qwen2/3 in kllama Main.kt (the `isQwen` path)
31+
32+
**Why it doesn't work today for the 8B:** The kllama `isQwen` path loads with `NATIVE_OPTIMIZED` but then creates `LlamaRuntime` which still transposes weight matrices to FP32 during init (`LlamaRuntime.kt:74`). This transpose step allocates FP32 copies.
33+
34+
### 2. Lazy per-layer dequantization (Apertus pattern)
35+
36+
`ApertusQuantizedRuntime` keeps weights quantized and dequantizes one projection at a time during `runLayer()`:
37+
38+
```
39+
Resident: ~3.5GB (quantized) + ~100MB (norms/embeddings)
40+
Per-layer temp: ~50MB (one projection, discarded after matmul)
41+
```
42+
43+
This is the llama.cpp approach. Not yet available for LLaMA/Qwen runtimes.
44+
45+
### 3. Memory-mapped loading (F32 only)
46+
47+
`MmapLlamaLoader` maps the GGUF file via `MappedByteBuffer` for zero-copy tensor access. Only works for F32 models — Q4 models need dequantization which defeats the zero-copy benefit.
48+
49+
## Proposed Solutions (ordered by effort)
50+
51+
### Solution A: Fix NATIVE_OPTIMIZED path for 8B models (small effort)
52+
53+
The kllama Main.kt Qwen path already loads with `NATIVE_OPTIMIZED`. The problem is `LlamaRuntime` constructor transposes weights to FP32. Fix:
54+
55+
1. Skip transpose for quantized tensors in `LlamaRuntime` init
56+
2. Or use `OptimizedLLMRuntime` which doesn't transpose (the DSL path)
57+
3. Ensure SIMD matmul kernels handle Q4_K_M format (Q4_K dispatch exists in `MemSegWeightConverter`)
58+
59+
**Expected result:** 8B Q4 loads in ~5GB, runs at 1-3 tok/s.
60+
61+
**Files to change:**
62+
- `llm-inference/llama/.../LlamaRuntime.kt` -- skip transpose for quantized MemSeg tensors
63+
- Or migrate Qwen path in `kllama/cli/Main.kt` to `OptimizedLLMRuntime` + `llamaNetwork()`
64+
65+
### Solution B: Port lazy dequant from Apertus to LLaMA (medium effort)
66+
67+
Port the `ApertusQuantizedRuntime` pattern to a `LlamaQuantizedRuntime`:
68+
69+
1. Store projections as `QuantizedTensor` (quantized bytes + metadata)
70+
2. In `runLayer()`, dequantize one weight matrix at a time, matmul, discard
71+
3. Keep embeddings and norms as FP32 (small, need element access)
72+
73+
**Expected result:** 8B Q4 loads in ~5GB, runs at ~1 tok/s (dequant overhead per layer).
74+
75+
**Files to create:**
76+
- `llm-inference/llama/.../LlamaQuantizedRuntime.kt` (new, based on Apertus pattern)
77+
- `llm-runtime/kllama/.../LlamaQuantizedWeights.kt` (new, mixed storage)
78+
79+
### Solution C: SIMD-native matmul without dequantization (larger effort, best perf)
80+
81+
The SIMD backend (`skainet-backend-cpu`) already has Q4/Q8 matmul kernels via Vector API. The issue is the runtime layer doesn't use them directly. Changes needed in skainet core:
82+
83+
1. `skainet-backend-cpu`: Ensure `matmul(FP32, Q4_K)` kernel exists and dispatches correctly
84+
2. `LlamaRuntime` or `OptimizedLLMRuntime`: Accept mixed-precision weight tensors (Q4 weights, FP32 activations)
85+
3. Skip the `MemSegWeightConverter` step entirely — use raw quantized MemorySegments
86+
87+
**Expected result:** 8B Q4 loads in ~5GB, runs at 2-5 tok/s (no dequant overhead).
88+
89+
**Files to change (in skainet core):**
90+
- `skainet-backend-cpu`: Q4_K matmul kernel (may already exist)
91+
- `skainet-lang-core`: Mixed-precision tensor support in matmul dispatch
92+
93+
### Solution D: Memory-mapped quantized tensors (largest effort)
94+
95+
Extend `MmapLlamaLoader` to support quantized formats:
96+
97+
1. Map the GGUF file to virtual memory
98+
2. Create quantized tensor views that reference mmap regions
99+
3. Dequantize on-the-fly during matmul (like lazy dequant but zero-copy from disk)
100+
101+
**Expected result:** Load time near-zero, ~5GB virtual (OS manages paging).
102+
103+
**Files to change:**
104+
- `llm-inference/llama/.../MmapLlamaLoader.kt` -- extend to Q4/Q8 formats
105+
- Requires `skainet-io-core` changes for mmap quantized tensor views
106+
107+
## Recommended Path
108+
109+
**Start with Solution A** — it's the smallest change and uses code that already works for Qwen2/3. The `NATIVE_OPTIMIZED` + `MemSegWeightConverter` path is proven; the only blocker is `LlamaRuntime`'s constructor transposing weights to FP32.
110+
111+
If that's not enough, **add Solution B** (lazy dequant) which gives the most control over memory at a known performance cost.
112+
113+
Solution C is the long-term goal (best performance) but requires skainet core changes.

docs/.docker/Dockerfile

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,26 +10,28 @@ RUN apk add --no-cache chromium font-noto
1010
ENV PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \
1111
PUPPETEER_SKIP_DOWNLOAD=true
1212

13-
WORKDIR /antora
14-
15-
# Install Antora + extensions + mermaid-cli in one layer
16-
RUN npm i --save-exact \
13+
# Install Antora + extensions to /opt/antora (not /antora which gets volume-mounted)
14+
WORKDIR /opt/antora
15+
RUN npm init -y && npm i --save-exact \
1716
@antora/cli@3.1 \
1817
@antora/site-generator@3.1 \
1918
asciidoctor-kroki@0.18 \
2019
@mermaid-js/mermaid-cli@11 \
2120
&& npm cache clean --force
2221

23-
# Mermaid-cli config: use installed Chromium, no sandbox (container)
22+
# Make installed modules visible when workdir is the mounted project
23+
ENV NODE_PATH=/opt/antora/node_modules
24+
25+
# Mermaid-cli config
2426
RUN echo '{ \
2527
"executablePath": "/usr/bin/chromium-browser", \
2628
"args": ["--no-sandbox", "--disable-gpu", "--disable-dev-shm-usage"] \
27-
}' > /antora/puppeteer-config.json
29+
}' > /opt/antora/puppeteer-config.json
2830

29-
# Pre-generate a simple diagram to warm up and verify the stack works
31+
# Verify mermaid works
3032
RUN echo 'graph TD; A-->B;' > /tmp/test.mmd \
31-
&& npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /antora/puppeteer-config.json \
33+
&& npx mmdc -i /tmp/test.mmd -o /tmp/test.svg -p /opt/antora/puppeteer-config.json \
3234
&& rm /tmp/test.mmd /tmp/test.svg
3335

34-
ENTRYPOINT ["npx", "antora"]
36+
ENTRYPOINT ["/opt/antora/node_modules/.bin/antora"]
3537
CMD ["--stacktrace", "antora-playbook.yml"]

docs/antora-playbook.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ site:
44

55
content:
66
sources:
7-
- url: .
7+
- url: /antora
88
start_path: docs
99
branches: HEAD
1010

docs/modules/ROOT/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,3 +23,4 @@
2323
* xref:explanation/pipeline-design.adoc[Pipeline Design Decisions]
2424
* xref:explanation/dsl-vs-handcoded.adoc[DSL Networks vs Hand-Coded Runtimes]
2525
* xref:explanation/tokenizer-internals.adoc[Tokenizer Internals]
26+
* xref:explanation/weight-quantization.adoc[Weight Quantization and Numeric Representation]

0 commit comments

Comments
 (0)