|
| 1 | +# TurboQuant.cpp v0.2 — Every Claim Now Has a Number |
| 2 | + |
| 3 | +We shipped V cache quantization and a full validation suite. Here's what changed. |
| 4 | + |
| 5 | +## What v0.2 adds |
| 6 | + |
| 7 | +**V quantization.** Keys were already 1-bit. Now values are Q4 or Q2. |
| 8 | + |
| 9 | +``` |
| 10 | +Gemma 3 4B — total K+V per token: |
| 11 | +
|
| 12 | + FP16 baseline: 136.00 KB |
| 13 | + 1-bit K + Q4 V: 27.62 KB (4.9x compression) |
| 14 | + 1-bit K + Q2 V: 19.12 KB (7.1x compression) |
| 15 | +``` |
| 16 | + |
| 17 | +At 32K context, that's 3.7 GB saved vs FP16. "Paris" still comes out as "Paris." |
| 18 | + |
| 19 | +**Validation.** We found a NEON bug, fixed it, then validated everything: |
| 20 | + |
| 21 | +- 14 tests comparing every NEON path against scalar reference |
| 22 | +- 5 tests proving Lloyd-Max codebook centroids match theory within 0.001 |
| 23 | +- 8 tests measuring attention score distribution preservation |
| 24 | +- 29 edge-case tests (NaN, Inf, single token, zero dim, 10K keys) |
| 25 | +- ASan + UBSan clean on all 26 test suites |
| 26 | + |
| 27 | +## The numbers that matter |
| 28 | + |
| 29 | +| What | Measured | How to reproduce | |
| 30 | +|------|----------|------------------| |
| 31 | +| Attention cosine (1-bit) | 0.634 | `test_attention_distribution` | |
| 32 | +| Theoretical limit (2/pi) | 0.637 | proven in JL literature | |
| 33 | +| Random K cosine | 0.089 | `test_attention_distribution` | |
| 34 | +| Codebook MSE vs optimal | < 1.18x | `test_codebook_theory` | |
| 35 | +| RHT overhead | 147 ns/vec | `bench_kv_overhead` | |
| 36 | +| 1-bit attention | 1.2 ns/key | `bench_kv_overhead` | |
| 37 | + |
| 38 | +The 1-bit cosine of 0.634 matches 2/pi = 0.637. This is not a deficiency — it's the information-theoretic maximum for sign-only quantization. Our implementation reaches the theoretical wall. |
| 39 | + |
| 40 | +## What we fixed |
| 41 | + |
| 42 | +- **Q4 dequant NEON bug**: Nibble interleaving was wrong, causing 300x worse MSE. Found by testing, fixed with `vzip_u8`. |
| 43 | +- **QJL sign bias**: `>= 0.0f` → `> 0.0f` across 11 call sites (CPU/CUDA/Metal). |
| 44 | +- **Norm overflow**: Large vectors could overflow `sum += x*x`. Added max-abs rescaling. |
| 45 | +- **Thread safety**: Mutex guards on global workspace realloc. |
| 46 | + |
| 47 | +## What's honest |
| 48 | + |
| 49 | +- 7.1x is total K+V, not K-only. Previous "10.7x" was K-only — now clearly labeled. |
| 50 | +- With V quantization (Q4/Q2), outputs diverge from baseline. They remain coherent and factually correct, but are not byte-identical. |
| 51 | +- The 30/30 byte-identical result applies to K-only mode (V stays FP16). |
| 52 | +- 1-bit attention cosine = 0.634, not 0.99. This is optimal for 1 bit. Want higher? Use 3-bit (0.918). |
| 53 | + |
| 54 | +## Try it |
| 55 | + |
| 56 | +```bash |
| 57 | +git clone https://github.com/quantumaikr/TurboQuant.cpp && cd TurboQuant.cpp |
| 58 | +cmake -B build -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON |
| 59 | +cmake --build build -j$(nproc) |
| 60 | +ctest --test-dir build # 26/26 should pass |
| 61 | +./build/tq_run gemma3-4b.tqm -p "1+1=" -j 6 -n 5 -T 0.0 -k turbo_kv_1b -v q4 -M |
| 62 | +``` |
| 63 | + |
| 64 | +--- |
| 65 | + |
| 66 | +[GitHub](https://github.com/quantumaikr/TurboQuant.cpp) | [Release Notes](../RELEASE_NOTES.md) | [Paper](https://arxiv.org/abs/2504.19874) |
0 commit comments