|
| 1 | +# r/LocalLLaMA Post — 2026-03-31 |
| 2 | + |
| 3 | +## Title |
| 4 | + |
| 5 | +TurboQuant.cpp — Pure C inference engine with 3.8x KV cache compression. Runs Gemma 3 4B at 32K context using 1.2 GB KV instead of 4.4 GB. |
| 6 | + |
| 7 | +## Body |
| 8 | + |
| 9 | +We built a C inference engine from scratch focused on one thing llama.cpp doesn't do: **compressing the KV cache**. |
| 10 | + |
| 11 | +At short contexts, KV memory doesn't matter much. But at 32K+ tokens, it becomes the dominant memory cost — often larger than the model weights themselves. |
| 12 | + |
| 13 | +**The numbers (Gemma 3 4B):** |
| 14 | + |
| 15 | +``` |
| 16 | +Context llama.cpp KV (FP16) TurboQuant KV (Q4) Saved |
| 17 | +───────── ────────────────── ────────────────── ────── |
| 18 | +4K tokens 544 MB 145 MB 399 MB |
| 19 | +32K tokens 4,352 MB 1,156 MB 3,196 MB |
| 20 | +128K tokens 17,408 MB 4,624 MB 12,784 MB |
| 21 | +``` |
| 22 | + |
| 23 | +3.8x compression with verified output quality (per-layer exact match against PyTorch). |
| 24 | + |
| 25 | +**Speed is competitive, not the selling point:** |
| 26 | +- Single-thread Q4: 51.1 tok/s (llama.cpp: 50.7 tok/s) on Qwen3.5-0.8B |
| 27 | +- Same ballpark. We're not claiming to be faster. |
| 28 | + |
| 29 | +**What's different:** |
| 30 | +- 3.8x KV cache compression (TurboQuant/PolarQuant/QJL algorithms from ICLR 2026) |
| 31 | +- 3 models: Gemma 3 4B, Qwen3.5-0.8B, Gemma 3 270M |
| 32 | +- Pure C, zero dependencies, ~1MB binary |
| 33 | +- Multi-architecture: DeltaNet hybrid (Qwen) + sliding window (Gemma) |
| 34 | +- Gemma 4 ready (same architecture family) |
| 35 | + |
| 36 | +**Quick start:** |
| 37 | +```bash |
| 38 | +git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp |
| 39 | +bash scripts/quickstart.sh "What is deep learning?" |
| 40 | +``` |
| 41 | + |
| 42 | +Built in 2 days. 9,000 lines of C. 20 test suites. First release: v0.1.0. |
| 43 | + |
| 44 | +The KV compression matters most for long context on limited RAM — exactly the scenario local LLM users care about. |
| 45 | + |
| 46 | +GitHub: https://github.com/quantumaikr/TurboQuant.cpp |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Posting Notes |
| 51 | + |
| 52 | +- **Flair**: `New Model` or `Resource` |
| 53 | +- **Best time**: UTC Tue-Thu 1-3 PM (US East morning) |
| 54 | +- **Expected questions**: |
| 55 | + - "What about quality degradation?" → 0.999 cosine similarity, per-layer PyTorch match |
| 56 | + - "vs llama.cpp?" → Same speed, different value prop (KV compression) |
| 57 | + - "Only 3 models?" → Multi-arch engine, more coming. Gemma 4 ready. |
| 58 | + - "Q4 KV vs FP16 isn't fair" → Both are valid choices. We offer the option llama.cpp doesn't. |
0 commit comments