|
| 1 | +# TurboQuant Paper Deep Analysis & Implementation Gap Assessment |
| 2 | + |
| 3 | +**Paper**: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate |
| 4 | +**Authors**: Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni (Google Research / Google DeepMind) |
| 5 | +**Published**: arXiv 2504.19874, April 2025 (ICLR 2026 accepted) |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. Paper Core Algorithm |
| 10 | + |
| 11 | +TurboQuant is a **two-stage** vector quantization algorithm: |
| 12 | + |
| 13 | +### Stage 1: TurboQuant_mse (MSE-optimal quantizer) |
| 14 | + |
| 15 | +**Algorithm 1 — Quantize:** |
| 16 | +1. Generate random rotation matrix **Pi** (orthogonal, d x d) |
| 17 | +2. Pre-compute **codebook** centroids c_1...c_{2^b} that minimize MSE for Beta distribution |
| 18 | +3. Rotate input: **y** = **Pi** . **x** |
| 19 | +4. For each coordinate j: find nearest centroid idx_j = argmin_k |y_j - c_k| |
| 20 | +5. Output: idx (array of b-bit integers per coordinate) |
| 21 | + |
| 22 | +**Algorithm 1 — DeQuantize:** |
| 23 | +1. Replace each idx_j with its centroid: y_tilde_j = c_{idx_j} |
| 24 | +2. Rotate back: **x_tilde** = **Pi**^T . **y_tilde** |
| 25 | + |
| 26 | +**Key insight**: After random rotation, each coordinate of a unit-norm vector follows a Beta distribution that converges to N(0, 1/d) in high dimensions. This allows **independent scalar quantization per coordinate** with near-optimal MSE. |
| 27 | + |
| 28 | +### Stage 2: TurboQuant_prod (Inner-product optimal quantizer) |
| 29 | + |
| 30 | +**Algorithm 2 — Quantize:** |
| 31 | +1. Apply TurboQuant_mse with bit-width **b-1** (one bit less) |
| 32 | +2. Compute residual: **r** = **x** - DeQuant_mse(Quant_mse(**x**)) |
| 33 | +3. Apply QJL (1-bit sign hash) on residual: qjl = sign(**S** . **r**) |
| 34 | +4. Store: (idx, qjl, ||**r**||_2) |
| 35 | + |
| 36 | +**Algorithm 2 — DeQuantize:** |
| 37 | +1. x_tilde_mse = DeQuant_mse(idx) |
| 38 | +2. x_tilde_qjl = sqrt(pi/2) / d * gamma * **S**^T . qjl |
| 39 | +3. Output: x_tilde_mse + x_tilde_qjl |
| 40 | + |
| 41 | +**Key insight**: MSE-optimal quantizers are **biased** for inner product estimation. The QJL residual correction is **unbiased**, combining both gives optimal inner product distortion. |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## 2. Theoretical Guarantees |
| 46 | + |
| 47 | +### MSE Distortion (Theorem 1) |
| 48 | +- D_mse <= (sqrt(3)*pi/2) * (1/4^b) for any bit-width b >= 0 |
| 49 | +- For b=1,2,3,4: D_mse ~ 0.36, 0.117, 0.03, 0.009 |
| 50 | + |
| 51 | +### Inner Product Distortion (Theorem 2) |
| 52 | +- **Unbiased**: E[<y, x_tilde>] = <y, x> (exact) |
| 53 | +- D_prod <= (sqrt(3)*pi^2 * ||y||^2) / d * (1/4^b) |
| 54 | +- For b=1,2,3,4: D_prod ~ 1.57/d, 0.56/d, 0.18/d, 0.047/d |
| 55 | + |
| 56 | +### Lower Bound (Theorem 3) |
| 57 | +- D_mse >= 1/4^b (information-theoretic) |
| 58 | +- TurboQuant is within factor **sqrt(3)*pi/2 ~ 2.7** of optimal |
| 59 | + |
| 60 | +### KV Cache Results |
| 61 | +- **3.5 bits/channel**: absolute quality neutrality (no degradation) |
| 62 | +- **2.5 bits/channel**: marginal quality degradation |
| 63 | +- **4x compression** with perfect Needle-in-a-Haystack recall (score 0.997 vs 0.997 full precision) |
| 64 | +- Outperforms KIVI, SnapKV, PyramidKV on LongBench-E |
| 65 | + |
| 66 | +--- |
| 67 | + |
| 68 | +## 3. Paper's Outlier Treatment Strategy |
| 69 | + |
| 70 | +The paper uses a critical strategy not obvious from the abstract: |
| 71 | + |
| 72 | +> "Our strategy of splitting channels into **outlier and non-outlier sets**, and applying two independent instances of TurboQuant to each, allocating higher bit precision to outliers." |
| 73 | +
|
| 74 | +**2.5-bit setup**: 32 outlier channels at 3 bits + 96 regular channels at 2 bits = (32*3 + 96*2)/128 = **2.5 effective bits** |
| 75 | +**3.5-bit setup**: Different ratio of outliers vs regular for 3.5 effective bits |
| 76 | + |
| 77 | +This is a **mixed-precision** approach where outlier channels get more bits. |
| 78 | + |
| 79 | +--- |
| 80 | + |
| 81 | +## 4. Gap Analysis: Paper vs Current Implementation |
| 82 | + |
| 83 | +### 4.1 Random Rotation (CRITICAL GAP) |
| 84 | + |
| 85 | +**Paper**: Uses random orthogonal matrix **Pi** (d x d) to rotate input vectors before scalar quantization. This is the **foundation** of the algorithm — it converts any worst-case input into a near-Gaussian distribution that enables optimal scalar quantization. |
| 86 | + |
| 87 | +**Our implementation**: `tq_rht.c` implements Walsh-Hadamard Transform (WHT) with random sign flip. This is a **fast approximation** of random rotation (O(d log d) vs O(d^2)), which is acceptable for practical use. However: |
| 88 | + |
| 89 | +| Aspect | Paper | Our Code | Gap | |
| 90 | +|--------|-------|----------|-----| |
| 91 | +| Rotation type | Full random orthogonal **Pi** | Walsh-Hadamard + random signs | Acceptable (WHT is standard practice) | |
| 92 | +| Applied to KV cache? | Yes, before quantization | **tq_rht.c exists but NOT wired into KV quantization pipeline** | **CRITICAL: RHT is implemented but unused in the engine** | |
| 93 | +| Pre-compute | Generate once, reuse | Seed-based deterministic | OK | |
| 94 | + |
| 95 | +**Action**: Wire `tq_rht_transform()` into the KV cache quantization path (before `tq_uniform_4b_quantize` or `tq_polar_quantize`). |
| 96 | + |
| 97 | +### 4.2 Codebook Design (SIGNIFICANT GAP) |
| 98 | + |
| 99 | +**Paper**: Solves the **continuous k-means optimization** (Eq. 4) for the Beta distribution f_X(x) to find optimal centroids. For b=1: centroids = {+/- sqrt(2/(pi*d))}, for b=2: centroids = {+/- 0.453/sqrt(d), +/- 1.51/sqrt(d)}. |
| 100 | + |
| 101 | +**Our implementation**: Uses **uniform min-max quantization** (`tq_uniform.c`): scale = (max-min)/levels, q = round(x/scale). This is the simplest possible quantizer. |
| 102 | + |
| 103 | +| Aspect | Paper | Our Code | Gap | |
| 104 | +|--------|-------|----------|-----| |
| 105 | +| Quantizer type | **Optimal Lloyd-Max** for Beta/Gaussian distribution | Uniform min-max | **SIGNIFICANT: ~20-30% worse MSE** | |
| 106 | +| Centroids | Pre-computed optimal for each bit-width | Uniformly spaced | Missing | |
| 107 | +| Distribution-aware | Yes (tuned for post-rotation Gaussian) | No (data-agnostic) | Key gap | |
| 108 | + |
| 109 | +**Action**: Implement optimal codebook (Lloyd-Max centroids for Gaussian) as a lookup table. For high-d, Gaussian centroids are well-known: |
| 110 | +- b=1: {-0.7979, +0.7979} (scaled by 1/sqrt(d)) |
| 111 | +- b=2: {-1.510, -0.4528, +0.4528, +1.510} (scaled by 1/sqrt(d)) |
| 112 | +- b=3: 8 centroids from standard tables |
| 113 | +- b=4: 16 centroids from standard tables |
| 114 | + |
| 115 | +### 4.3 Two-Stage Quantization (SIGNIFICANT GAP) |
| 116 | + |
| 117 | +**Paper**: Stage 1 (MSE quantizer, b-1 bits) + Stage 2 (QJL on residual, 1 bit) = b total bits. This produces **unbiased** inner product estimates. |
| 118 | + |
| 119 | +**Our implementation**: `tq_turbo.c` does implement the two-stage pattern: |
| 120 | +```c |
| 121 | +tq_polar_quantize_ref(src, &block->polar, dim); // Stage 1 |
| 122 | +// compute residual |
| 123 | +tq_qjl_quantize_ref(residual, &block->residual, dim); // Stage 2 |
| 124 | +``` |
| 125 | +
|
| 126 | +But there are gaps: |
| 127 | +
|
| 128 | +| Aspect | Paper | Our Code | Gap | |
| 129 | +|--------|-------|----------|-----| |
| 130 | +| Stage 1 quantizer | Optimal Lloyd-Max after rotation | PolarQuant (atan2-based, NOT rotation-based) | **WRONG algorithm** | |
| 131 | +| Residual computation | r = x - DeQuant_mse(Quant_mse(x)) | r = src - dequantized_polar | Correct structure | |
| 132 | +| QJL implementation | sign(**S** . **r**) with Gaussian **S** | sign(random_projection) with Rademacher | Acceptable (Rademacher is simpler) | |
| 133 | +| Norm storage | ||r||_2 stored explicitly | Stored in block | OK | |
| 134 | +| DeQuant formula | x_mse + sqrt(pi/2)/d * gamma * **S**^T . qjl | Different reconstruction | Needs verification | |
| 135 | +
|
| 136 | +**Critical issue**: Our "PolarQuant" uses atan2-based polar coordinates (angle + radius), which is a **completely different algorithm** from the paper's rotation + scalar quantization. The paper's "PolarQuant" reference [28] is the same group's earlier work, but the TurboQuant paper supersedes it with the rotation-based approach. |
| 137 | +
|
| 138 | +### 4.4 QJL Implementation |
| 139 | +
|
| 140 | +**Paper**: Q_qjl(x) = sign(**S** . x), where **S** has i.i.d. N(0,1) entries. DeQuant: sqrt(pi/2)/d * **S**^T . z. |
| 141 | +
|
| 142 | +**Our implementation**: Uses Rademacher (+1/-1) random entries instead of Gaussian. This is a valid simplification (both satisfy JL property), but the dequantization formula may differ. |
| 143 | +
|
| 144 | +| Aspect | Paper | Our Code | Gap | |
| 145 | +|--------|-------|----------|-----| |
| 146 | +| Random matrix | Gaussian N(0,1) | Rademacher (+1/-1) | Acceptable | |
| 147 | +| Quantize | sign(**S** . x) | sign(random_projection . x) | OK | |
| 148 | +| DeQuant scale | sqrt(pi/2) / d | Needs verification | Check | |
| 149 | +| Bias correction | Provably unbiased | Unverified | Test needed | |
| 150 | +
|
| 151 | +### 4.5 KV Cache Integration (CRITICAL GAP) |
| 152 | +
|
| 153 | +**Paper**: Applied to KV cache quantization in LLM inference. Specifically: |
| 154 | +- Quantize K (keys) and V (values) separately |
| 155 | +- Apply outlier detection: split channels into outlier/non-outlier |
| 156 | +- Different bit allocation per group |
| 157 | +- Applied **online** during generation (not offline) |
| 158 | +- Tested on Llama-3.1-8B and Ministral-7B at 4K-104K context |
| 159 | +
|
| 160 | +**Our implementation**: The KV cache quantization (`src/cache/`) uses `tq_uniform_4b` (simple min-max Q4) — **not the TurboQuant algorithm at all**. The sophisticated quantization types (polar, qjl, turbo) exist in `src/core/` but are **not connected to the inference engine's KV cache**. |
| 161 | +
|
| 162 | +| Aspect | Paper | Our Code | Gap | |
| 163 | +|--------|-------|----------|-----| |
| 164 | +| KV quantization method | TurboQuant (rotation + Lloyd-Max + QJL) | Uniform min-max Q4 | **CRITICAL: Not using TurboQuant for KV** | |
| 165 | +| Outlier channels | Mixed-precision (3-bit outliers + 2-bit regular) | `tq_mixed.c` exists but not in engine | Not wired | |
| 166 | +| K/V asymmetry | Separate treatment | Config flag exists | Partial | |
| 167 | +| Online quantization | During generation | During generation | OK | |
| 168 | +
|
| 169 | +### 4.6 Attention Computation |
| 170 | +
|
| 171 | +**Paper**: For inner-product TurboQuant, attention scores are computed as: |
| 172 | +``` |
| 173 | +<y, Q^-1(Q(x))> = <y, x_mse> + ||r|| * <y, Q_qjl^-1(Q_qjl(r))> |
| 174 | +``` |
| 175 | +
|
| 176 | +**Our implementation**: Integer Q4×Q8 attention using vdotq_s32 — optimized for uniform quantization, not for the two-stage TurboQuant scheme. |
| 177 | +
|
| 178 | +--- |
| 179 | +
|
| 180 | +## 5. Implementation Priority: What to Fix |
| 181 | +
|
| 182 | +### Priority 1: Wire RHT into KV Cache (High impact, Low effort) |
| 183 | +
|
| 184 | +The Random Hadamard Transform is already implemented (`tq_rht.c`) but not used in the KV path. Adding it before quantization would improve quality significantly by making the input distribution more uniform. |
| 185 | +
|
| 186 | +``` |
| 187 | +Before: KV_fp16 → uniform_4b_quantize → stored |
| 188 | +After: KV_fp16 → RHT_transform → optimal_quantize → stored |
| 189 | + Attention: dequant → RHT_inverse → attention_score |
| 190 | +``` |
| 191 | +
|
| 192 | +### Priority 2: Optimal Codebook (High impact, Medium effort) |
| 193 | +
|
| 194 | +Replace uniform quantization with Lloyd-Max optimal centroids for the post-rotation Gaussian distribution. This is a lookup table — the centroids are precomputed constants. |
| 195 | +
|
| 196 | +For 4-bit (16 levels) Gaussian quantizer, the optimal centroids and boundaries are well-known from quantization theory. This alone can reduce MSE by **20-30%** vs uniform. |
| 197 | +
|
| 198 | +### Priority 3: True TurboQuant Two-Stage (High impact, High effort) |
| 199 | +
|
| 200 | +Implement the actual paper algorithm: |
| 201 | +1. Apply RHT |
| 202 | +2. Scalar quantize with optimal codebook (b-1 bits) |
| 203 | +3. Compute residual |
| 204 | +4. Apply QJL on residual (1 bit) |
| 205 | +5. Store: indices + qjl_signs + residual_norm |
| 206 | +
|
| 207 | +This would make TurboQuant.cpp a **faithful implementation** of the paper, not just named after it. |
| 208 | +
|
| 209 | +### Priority 4: Mixed-Precision Outlier Channels (Medium impact, Medium effort) |
| 210 | +
|
| 211 | +Split KV channels into outlier (high-variance) and non-outlier groups. Allocate 3 bits to outliers, 2 bits to others. This is what the paper does for their 2.5-bit configuration. |
| 212 | +
|
| 213 | +--- |
| 214 | +
|
| 215 | +## 6. Quantitative Impact Estimates |
| 216 | +
|
| 217 | +| Improvement | MSE Reduction | Inner Product Error | Effort | |
| 218 | +|-------------|---------------|---------------------|--------| |
| 219 | +| RHT pre-rotation | ~15-25% | ~15-25% | 2-3 hours | |
| 220 | +| Optimal codebook | ~20-30% | ~20-30% | 4-6 hours | |
| 221 | +| Two-stage (MSE + QJL) | ~40-50% | **unbiased** (vs biased) | 8-12 hours | |
| 222 | +| Outlier mixed-precision | ~10-20% | ~10-20% | 4-6 hours | |
| 223 | +| **Combined** | **~60-70%** | **near-optimal** | 20-30 hours | |
| 224 | +
|
| 225 | +Current uniform Q4 achieves ~3.8x compression. |
| 226 | +Paper's TurboQuant at 3.5 bits achieves ~4.5x compression with **zero quality degradation**. |
| 227 | +At 2.5 bits: ~6.4x compression with **marginal** quality degradation. |
| 228 | +
|
| 229 | +--- |
| 230 | +
|
| 231 | +## 7. Paper's Key Numbers for Reference |
| 232 | +
|
| 233 | +### LongBench-E (Table 1, Llama-3.1-8B-Instruct) |
| 234 | +
|
| 235 | +| Method | KV Size (bits) | Average Score | |
| 236 | +|--------|---------------|---------------| |
| 237 | +| Full Cache | 16 | 50.06 | |
| 238 | +| KIVI | 3 | 48.50 | |
| 239 | +| KIVI | 5 | 50.16 | |
| 240 | +| PolarQuant | 3.9 | 49.78 | |
| 241 | +| **TurboQuant** | **2.5** | **49.44** | |
| 242 | +| **TurboQuant** | **3.5** | **50.06** | |
| 243 | +
|
| 244 | +At 3.5 bits, TurboQuant matches full precision (50.06 = 50.06). |
| 245 | +At 2.5 bits, TurboQuant still outperforms KIVI at 3 bits. |
| 246 | +
|
| 247 | +### Needle-in-a-Haystack (Figure 4) |
| 248 | +
|
| 249 | +| Method | Score | |
| 250 | +|--------|-------| |
| 251 | +| Full Precision | 0.997 | |
| 252 | +| **TurboQuant** | **0.997** | |
| 253 | +| PolarQuant | 0.995 | |
| 254 | +| KIVI | 0.981 | |
| 255 | +| PyramidKV | 0.895 | |
| 256 | +| SnapKV | 0.858 | |
| 257 | +
|
| 258 | +TurboQuant achieves **identical** performance to full precision at 4x compression. |
| 259 | +
|
| 260 | +### Quantization Speed (Table 2) |
| 261 | +
|
| 262 | +| Method | d=200 | d=1536 | d=3072 | |
| 263 | +|--------|-------|--------|--------| |
| 264 | +| Product Quantization | 37.04s | 239.75s | 494.42s | |
| 265 | +| RabitQ | 597.25s | 2267.59s | 3957.19s | |
| 266 | +| **TurboQuant** | **0.0007s** | **0.0013s** | **0.0021s** | |
| 267 | +
|
| 268 | +TurboQuant is **100,000x faster** than alternatives — crucial for online KV cache quantization. |
| 269 | +
|
| 270 | +--- |
| 271 | +
|
| 272 | +## 8. Recommended Implementation Roadmap |
| 273 | +
|
| 274 | +### Phase 1: Foundation (Days 1-2) |
| 275 | +- [ ] Implement Gaussian Lloyd-Max codebook as static lookup tables (b=1,2,3,4) |
| 276 | +- [ ] Wire RHT into KV cache quantization path |
| 277 | +- [ ] Add `TQ_TYPE_TURBOQUANT_MSE` that uses rotation + optimal scalar quantization |
| 278 | +- [ ] Benchmark MSE improvement vs current uniform |
| 279 | +
|
| 280 | +### Phase 2: Two-Stage (Days 3-4) |
| 281 | +- [ ] Implement residual computation after MSE quantization |
| 282 | +- [ ] Apply QJL on residual with correct dequantization scale (sqrt(pi/2)/d) |
| 283 | +- [ ] Add `TQ_TYPE_TURBOQUANT_PROD` for unbiased inner product |
| 284 | +- [ ] Verify unbiasedness with statistical tests |
| 285 | +
|
| 286 | +### Phase 3: Mixed-Precision (Days 5-6) |
| 287 | +- [ ] Implement outlier channel detection (top-K variance channels) |
| 288 | +- [ ] Allocate 3 bits to outliers, 2 bits to regular (2.5-bit config) |
| 289 | +- [ ] Allocate 4 bits to outliers, 3 bits to regular (3.5-bit config) |
| 290 | +- [ ] Benchmark on LongBench-E equivalent tasks |
| 291 | +
|
| 292 | +### Phase 4: Integration (Days 7-8) |
| 293 | +- [ ] Replace `uniform_4b` as default KV cache type with `turboquant_3.5b` |
| 294 | +- [ ] Update benchmarks with true TurboQuant numbers |
| 295 | +- [ ] Compare against paper's reported results |
| 296 | +- [ ] Update README with "faithful paper implementation" claim |
| 297 | +
|
| 298 | +--- |
| 299 | +
|
| 300 | +## 9. Conclusion |
| 301 | +
|
| 302 | +**Current state**: TurboQuant.cpp is named after the paper but uses **uniform min-max quantization** for KV cache, not the actual TurboQuant algorithm. The core algorithms (polar, qjl, turbo) exist in `src/core/` but are **not connected to the inference engine**. |
| 303 | +
|
| 304 | +**Impact of fixing**: Implementing the true TurboQuant algorithm would: |
| 305 | +1. Reduce KV cache to **2.5-3.5 bits** (vs current 4 bits) — **30-55% more compression** |
| 306 | +2. Achieve **zero quality degradation** at 3.5 bits (vs current measurable degradation at 4 bits) |
| 307 | +3. Make TurboQuant.cpp a **faithful reference implementation** of the ICLR 2026 paper |
| 308 | +4. Provide a unique, defensible differentiation that no other C inference engine has |
| 309 | +
|
| 310 | +This is the **single highest-impact improvement** possible for the project. |
0 commit comments