|
4 | 4 |
|
5 | 5 | []() |
6 | 6 | []() |
7 | | -[]() |
| 7 | +[]() |
8 | 8 |
|
9 | 9 | ### Up to 7.1x total K+V compression. Quality preserved. |
10 | 10 |
|
@@ -121,7 +121,7 @@ Multi-architecture: Qwen3.5 (DeltaNet hybrid) + Gemma 3 (sliding window). Gemma |
121 | 121 | - **Faithful ICLR 2026 implementation** — RHT + Lloyd-Max + QJL residual |
122 | 122 | - **Multi-architecture** — Qwen3.5 (DeltaNet) + Gemma 3 (sliding window + GeGLU) |
123 | 123 | - **NEON vectorized** — matmul, attention, Hamming distance, FP16 conversion |
124 | | -- **25 test suites** — KV roundtrip, attention accuracy, codebook, Q2 weights, NEON consistency, attention distribution |
| 124 | +- **26 test suites** — KV roundtrip, attention accuracy, codebook, Q2 weights, NEON consistency, attention distribution |
125 | 125 |
|
126 | 126 | --- |
127 | 127 |
|
@@ -185,27 +185,31 @@ inference to catch memory errors. No leaks or undefined behavior detected. |
185 | 185 |
|
186 | 186 | **Q: "Byte-identical output just means K doesn't matter, right?"** |
187 | 187 |
|
188 | | -No. Replacing K with random values produces garbage output immediately. TurboQuant preserves inner product ranking -- verified via attention score cosine similarity > 0.99 (uniform_4b), > 0.92 (turbo_kv_3b), and > 0.63 (turbo_kv_1b) across 32 keys averaged over 10 trials. Random keys average < 0.09 cosine. See `tests/test_attention_distribution.cpp`. |
| 188 | +No. Replacing K with random values produces garbage immediately (cosine < 0.09). TurboQuant preserves inner product ranking -- measured attention score cosine: uniform_4b = 0.996, turbo_kv_3b = 0.918, turbo_kv_1b = 0.634 (10-trial avg, 32 keys). The 1-bit cosine of 0.634 matches the information-theoretic limit of 2/pi = 0.637 for sign quantization -- this is mathematically optimal, not a deficiency. See `tests/test_attention_distribution.cpp`. |
189 | 189 |
|
190 | 190 | **Q: "How is this different from llama.cpp's Q4 KV?"** |
191 | 191 |
|
192 | | -llama.cpp uses uniform min-max quantization. TurboQuant uses RHT + Lloyd-Max codebook optimized for the post-rotation Gaussian distribution. At 2-bit, uniform quantization achieves 0.96 attention cosine, while TurboQuant 3-bit (2-bit codebook + 1-bit QJL) achieves 0.92 with provably unbiased inner product estimation via the QJL residual correction term. The mathematical guarantee matters more at scale. |
| 192 | +llama.cpp uses uniform min-max quantization. TurboQuant uses RHT + Lloyd-Max codebook optimized for the post-rotation Gaussian distribution. The Lloyd-Max centroids are verified against theory (MSE within 1.18x of information-theoretic optimal, tested in `tests/test_codebook_theory.cpp`). The QJL residual provides provably unbiased inner product estimation -- the mathematical guarantee matters at scale. |
193 | 193 |
|
194 | 194 | **Q: "What about perplexity?"** |
195 | 195 |
|
196 | | -Attention score distribution is preserved with Spearman rank correlation > 0.90 (turbo_kv_3b) and > 0.63 (turbo_kv_1b). Greedy decode matches up to ~120 tokens. Full perplexity benchmarks on standard datasets are in progress. |
| 196 | +Attention score distribution is preserved: Spearman rank correlation = 0.990 (uniform_4b), 0.900 (turbo_kv_3b), 0.632 (turbo_kv_1b). Greedy decode matches up to ~120 tokens. The 1-bit cosine of 0.634 = 2/pi is the theoretical maximum for sign-only quantization (proven in JL literature). Full perplexity on standard datasets is in progress. |
197 | 197 |
|
198 | 198 | **Q: "Is the NEON code correct?"** |
199 | 199 |
|
200 | | -All NEON paths are verified against scalar reference implementations in `tests/test_neon_scalar.cpp` and `tests/test_simd_neon.cpp`. ASan + UBSan pass on all 25 test suites with zero errors. |
| 200 | +Every NEON path (Q4 dequant, RHT butterfly, matmul, RMSNorm, RoPE, Hamming attention) is verified against scalar reference in `tests/test_neon_scalar.cpp`. The Q4 dequant had a nibble-interleaving bug that was caught and fixed. ASan + UBSan pass on all 26 test suites with zero errors. NaN/Inf/edge-case inputs tested in `tests/test_edge_cases.cpp` (29 cases). |
| 201 | + |
| 202 | +**Q: "What about thread safety?"** |
| 203 | + |
| 204 | +Global workspaces (Q8 quantization buffer, sampler probability index) are mutex-protected to prevent concurrent realloc races. The thread pool uses a single dispatch mutex. Concurrent multi-context usage is safe at the API level. |
201 | 205 |
|
202 | 206 | **Q: "Only 4B model -- what about 8B+?"** |
203 | 207 |
|
204 | 208 | Architecture is model-size independent. Gemma 3 4B and Qwen3.5 0.8B use the same code path. 8B support is planned (Llama 3.1 8B architecture support in progress). |
205 | 209 |
|
206 | 210 | **Q: "RHT overhead?"** |
207 | 211 |
|
208 | | -RHT is O(d log d) per vector. Measured overhead: 103 ns per 128-dim vector. Compared to matmul cost (~1ms per layer), RHT is negligible. Full quantization timing: uniform_4b = 217 ns, turbo_kv_1b = 649 ns, turbo_kv_3b = 11710 ns per vector. See `bench/bench_kv_overhead.cpp`. |
| 212 | +RHT is O(d log d) per vector, NEON-vectorized. Measured: 147 ns per 128-dim vector. Full quantization: uniform_4b = 148 ns, turbo_kv_1b = 659 ns, turbo_kv_3b = 11066 ns per vector. 1-bit attention: 1.2 ns/key (XOR+popcount). Compared to matmul (~1ms/layer), all overhead is negligible. See `bench/bench_kv_overhead.cpp`. |
209 | 213 |
|
210 | 214 | --- |
211 | 215 |
|
|
0 commit comments