Skip to content

Commit 8b5e096

Browse files
unamedkrclaude
andcommitted
Paper analysis: TurboQuant algorithm vs our implementation gap
Deep analysis of arXiv 2504.19874 (ICLR 2026) identifying critical gaps: - RHT exists but not wired into KV cache path - Using uniform min-max instead of optimal Lloyd-Max codebook - Two-stage (MSE + QJL residual) not connected to engine - Mixed-precision outlier channels not in engine - Estimated 60-70% MSE reduction from full implementation Includes implementation roadmap and paper's key benchmark numbers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c08b1bf commit 8b5e096

1 file changed

Lines changed: 310 additions & 0 deletions

File tree

Lines changed: 310 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,310 @@
1+
# TurboQuant Paper Deep Analysis & Implementation Gap Assessment
2+
3+
**Paper**: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
4+
**Authors**: Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni (Google Research / Google DeepMind)
5+
**Published**: arXiv 2504.19874, April 2025 (ICLR 2026 accepted)
6+
7+
---
8+
9+
## 1. Paper Core Algorithm
10+
11+
TurboQuant is a **two-stage** vector quantization algorithm:
12+
13+
### Stage 1: TurboQuant_mse (MSE-optimal quantizer)
14+
15+
**Algorithm 1 — Quantize:**
16+
1. Generate random rotation matrix **Pi** (orthogonal, d x d)
17+
2. Pre-compute **codebook** centroids c_1...c_{2^b} that minimize MSE for Beta distribution
18+
3. Rotate input: **y** = **Pi** . **x**
19+
4. For each coordinate j: find nearest centroid idx_j = argmin_k |y_j - c_k|
20+
5. Output: idx (array of b-bit integers per coordinate)
21+
22+
**Algorithm 1 — DeQuantize:**
23+
1. Replace each idx_j with its centroid: y_tilde_j = c_{idx_j}
24+
2. Rotate back: **x_tilde** = **Pi**^T . **y_tilde**
25+
26+
**Key insight**: After random rotation, each coordinate of a unit-norm vector follows a Beta distribution that converges to N(0, 1/d) in high dimensions. This allows **independent scalar quantization per coordinate** with near-optimal MSE.
27+
28+
### Stage 2: TurboQuant_prod (Inner-product optimal quantizer)
29+
30+
**Algorithm 2 — Quantize:**
31+
1. Apply TurboQuant_mse with bit-width **b-1** (one bit less)
32+
2. Compute residual: **r** = **x** - DeQuant_mse(Quant_mse(**x**))
33+
3. Apply QJL (1-bit sign hash) on residual: qjl = sign(**S** . **r**)
34+
4. Store: (idx, qjl, ||**r**||_2)
35+
36+
**Algorithm 2 — DeQuantize:**
37+
1. x_tilde_mse = DeQuant_mse(idx)
38+
2. x_tilde_qjl = sqrt(pi/2) / d * gamma * **S**^T . qjl
39+
3. Output: x_tilde_mse + x_tilde_qjl
40+
41+
**Key insight**: MSE-optimal quantizers are **biased** for inner product estimation. The QJL residual correction is **unbiased**, combining both gives optimal inner product distortion.
42+
43+
---
44+
45+
## 2. Theoretical Guarantees
46+
47+
### MSE Distortion (Theorem 1)
48+
- D_mse <= (sqrt(3)*pi/2) * (1/4^b) for any bit-width b >= 0
49+
- For b=1,2,3,4: D_mse ~ 0.36, 0.117, 0.03, 0.009
50+
51+
### Inner Product Distortion (Theorem 2)
52+
- **Unbiased**: E[<y, x_tilde>] = <y, x> (exact)
53+
- D_prod <= (sqrt(3)*pi^2 * ||y||^2) / d * (1/4^b)
54+
- For b=1,2,3,4: D_prod ~ 1.57/d, 0.56/d, 0.18/d, 0.047/d
55+
56+
### Lower Bound (Theorem 3)
57+
- D_mse >= 1/4^b (information-theoretic)
58+
- TurboQuant is within factor **sqrt(3)*pi/2 ~ 2.7** of optimal
59+
60+
### KV Cache Results
61+
- **3.5 bits/channel**: absolute quality neutrality (no degradation)
62+
- **2.5 bits/channel**: marginal quality degradation
63+
- **4x compression** with perfect Needle-in-a-Haystack recall (score 0.997 vs 0.997 full precision)
64+
- Outperforms KIVI, SnapKV, PyramidKV on LongBench-E
65+
66+
---
67+
68+
## 3. Paper's Outlier Treatment Strategy
69+
70+
The paper uses a critical strategy not obvious from the abstract:
71+
72+
> "Our strategy of splitting channels into **outlier and non-outlier sets**, and applying two independent instances of TurboQuant to each, allocating higher bit precision to outliers."
73+
74+
**2.5-bit setup**: 32 outlier channels at 3 bits + 96 regular channels at 2 bits = (32*3 + 96*2)/128 = **2.5 effective bits**
75+
**3.5-bit setup**: Different ratio of outliers vs regular for 3.5 effective bits
76+
77+
This is a **mixed-precision** approach where outlier channels get more bits.
78+
79+
---
80+
81+
## 4. Gap Analysis: Paper vs Current Implementation
82+
83+
### 4.1 Random Rotation (CRITICAL GAP)
84+
85+
**Paper**: Uses random orthogonal matrix **Pi** (d x d) to rotate input vectors before scalar quantization. This is the **foundation** of the algorithm — it converts any worst-case input into a near-Gaussian distribution that enables optimal scalar quantization.
86+
87+
**Our implementation**: `tq_rht.c` implements Walsh-Hadamard Transform (WHT) with random sign flip. This is a **fast approximation** of random rotation (O(d log d) vs O(d^2)), which is acceptable for practical use. However:
88+
89+
| Aspect | Paper | Our Code | Gap |
90+
|--------|-------|----------|-----|
91+
| Rotation type | Full random orthogonal **Pi** | Walsh-Hadamard + random signs | Acceptable (WHT is standard practice) |
92+
| Applied to KV cache? | Yes, before quantization | **tq_rht.c exists but NOT wired into KV quantization pipeline** | **CRITICAL: RHT is implemented but unused in the engine** |
93+
| Pre-compute | Generate once, reuse | Seed-based deterministic | OK |
94+
95+
**Action**: Wire `tq_rht_transform()` into the KV cache quantization path (before `tq_uniform_4b_quantize` or `tq_polar_quantize`).
96+
97+
### 4.2 Codebook Design (SIGNIFICANT GAP)
98+
99+
**Paper**: Solves the **continuous k-means optimization** (Eq. 4) for the Beta distribution f_X(x) to find optimal centroids. For b=1: centroids = {+/- sqrt(2/(pi*d))}, for b=2: centroids = {+/- 0.453/sqrt(d), +/- 1.51/sqrt(d)}.
100+
101+
**Our implementation**: Uses **uniform min-max quantization** (`tq_uniform.c`): scale = (max-min)/levels, q = round(x/scale). This is the simplest possible quantizer.
102+
103+
| Aspect | Paper | Our Code | Gap |
104+
|--------|-------|----------|-----|
105+
| Quantizer type | **Optimal Lloyd-Max** for Beta/Gaussian distribution | Uniform min-max | **SIGNIFICANT: ~20-30% worse MSE** |
106+
| Centroids | Pre-computed optimal for each bit-width | Uniformly spaced | Missing |
107+
| Distribution-aware | Yes (tuned for post-rotation Gaussian) | No (data-agnostic) | Key gap |
108+
109+
**Action**: Implement optimal codebook (Lloyd-Max centroids for Gaussian) as a lookup table. For high-d, Gaussian centroids are well-known:
110+
- b=1: {-0.7979, +0.7979} (scaled by 1/sqrt(d))
111+
- b=2: {-1.510, -0.4528, +0.4528, +1.510} (scaled by 1/sqrt(d))
112+
- b=3: 8 centroids from standard tables
113+
- b=4: 16 centroids from standard tables
114+
115+
### 4.3 Two-Stage Quantization (SIGNIFICANT GAP)
116+
117+
**Paper**: Stage 1 (MSE quantizer, b-1 bits) + Stage 2 (QJL on residual, 1 bit) = b total bits. This produces **unbiased** inner product estimates.
118+
119+
**Our implementation**: `tq_turbo.c` does implement the two-stage pattern:
120+
```c
121+
tq_polar_quantize_ref(src, &block->polar, dim); // Stage 1
122+
// compute residual
123+
tq_qjl_quantize_ref(residual, &block->residual, dim); // Stage 2
124+
```
125+
126+
But there are gaps:
127+
128+
| Aspect | Paper | Our Code | Gap |
129+
|--------|-------|----------|-----|
130+
| Stage 1 quantizer | Optimal Lloyd-Max after rotation | PolarQuant (atan2-based, NOT rotation-based) | **WRONG algorithm** |
131+
| Residual computation | r = x - DeQuant_mse(Quant_mse(x)) | r = src - dequantized_polar | Correct structure |
132+
| QJL implementation | sign(**S** . **r**) with Gaussian **S** | sign(random_projection) with Rademacher | Acceptable (Rademacher is simpler) |
133+
| Norm storage | ||r||_2 stored explicitly | Stored in block | OK |
134+
| DeQuant formula | x_mse + sqrt(pi/2)/d * gamma * **S**^T . qjl | Different reconstruction | Needs verification |
135+
136+
**Critical issue**: Our "PolarQuant" uses atan2-based polar coordinates (angle + radius), which is a **completely different algorithm** from the paper's rotation + scalar quantization. The paper's "PolarQuant" reference [28] is the same group's earlier work, but the TurboQuant paper supersedes it with the rotation-based approach.
137+
138+
### 4.4 QJL Implementation
139+
140+
**Paper**: Q_qjl(x) = sign(**S** . x), where **S** has i.i.d. N(0,1) entries. DeQuant: sqrt(pi/2)/d * **S**^T . z.
141+
142+
**Our implementation**: Uses Rademacher (+1/-1) random entries instead of Gaussian. This is a valid simplification (both satisfy JL property), but the dequantization formula may differ.
143+
144+
| Aspect | Paper | Our Code | Gap |
145+
|--------|-------|----------|-----|
146+
| Random matrix | Gaussian N(0,1) | Rademacher (+1/-1) | Acceptable |
147+
| Quantize | sign(**S** . x) | sign(random_projection . x) | OK |
148+
| DeQuant scale | sqrt(pi/2) / d | Needs verification | Check |
149+
| Bias correction | Provably unbiased | Unverified | Test needed |
150+
151+
### 4.5 KV Cache Integration (CRITICAL GAP)
152+
153+
**Paper**: Applied to KV cache quantization in LLM inference. Specifically:
154+
- Quantize K (keys) and V (values) separately
155+
- Apply outlier detection: split channels into outlier/non-outlier
156+
- Different bit allocation per group
157+
- Applied **online** during generation (not offline)
158+
- Tested on Llama-3.1-8B and Ministral-7B at 4K-104K context
159+
160+
**Our implementation**: The KV cache quantization (`src/cache/`) uses `tq_uniform_4b` (simple min-max Q4) — **not the TurboQuant algorithm at all**. The sophisticated quantization types (polar, qjl, turbo) exist in `src/core/` but are **not connected to the inference engine's KV cache**.
161+
162+
| Aspect | Paper | Our Code | Gap |
163+
|--------|-------|----------|-----|
164+
| KV quantization method | TurboQuant (rotation + Lloyd-Max + QJL) | Uniform min-max Q4 | **CRITICAL: Not using TurboQuant for KV** |
165+
| Outlier channels | Mixed-precision (3-bit outliers + 2-bit regular) | `tq_mixed.c` exists but not in engine | Not wired |
166+
| K/V asymmetry | Separate treatment | Config flag exists | Partial |
167+
| Online quantization | During generation | During generation | OK |
168+
169+
### 4.6 Attention Computation
170+
171+
**Paper**: For inner-product TurboQuant, attention scores are computed as:
172+
```
173+
<y, Q^-1(Q(x))> = <y, x_mse> + ||r|| * <y, Q_qjl^-1(Q_qjl(r))>
174+
```
175+
176+
**Our implementation**: Integer Q4×Q8 attention using vdotq_s32 — optimized for uniform quantization, not for the two-stage TurboQuant scheme.
177+
178+
---
179+
180+
## 5. Implementation Priority: What to Fix
181+
182+
### Priority 1: Wire RHT into KV Cache (High impact, Low effort)
183+
184+
The Random Hadamard Transform is already implemented (`tq_rht.c`) but not used in the KV path. Adding it before quantization would improve quality significantly by making the input distribution more uniform.
185+
186+
```
187+
Before: KV_fp16 → uniform_4b_quantize → stored
188+
After: KV_fp16 → RHT_transform → optimal_quantize → stored
189+
Attention: dequant → RHT_inverse → attention_score
190+
```
191+
192+
### Priority 2: Optimal Codebook (High impact, Medium effort)
193+
194+
Replace uniform quantization with Lloyd-Max optimal centroids for the post-rotation Gaussian distribution. This is a lookup table — the centroids are precomputed constants.
195+
196+
For 4-bit (16 levels) Gaussian quantizer, the optimal centroids and boundaries are well-known from quantization theory. This alone can reduce MSE by **20-30%** vs uniform.
197+
198+
### Priority 3: True TurboQuant Two-Stage (High impact, High effort)
199+
200+
Implement the actual paper algorithm:
201+
1. Apply RHT
202+
2. Scalar quantize with optimal codebook (b-1 bits)
203+
3. Compute residual
204+
4. Apply QJL on residual (1 bit)
205+
5. Store: indices + qjl_signs + residual_norm
206+
207+
This would make TurboQuant.cpp a **faithful implementation** of the paper, not just named after it.
208+
209+
### Priority 4: Mixed-Precision Outlier Channels (Medium impact, Medium effort)
210+
211+
Split KV channels into outlier (high-variance) and non-outlier groups. Allocate 3 bits to outliers, 2 bits to others. This is what the paper does for their 2.5-bit configuration.
212+
213+
---
214+
215+
## 6. Quantitative Impact Estimates
216+
217+
| Improvement | MSE Reduction | Inner Product Error | Effort |
218+
|-------------|---------------|---------------------|--------|
219+
| RHT pre-rotation | ~15-25% | ~15-25% | 2-3 hours |
220+
| Optimal codebook | ~20-30% | ~20-30% | 4-6 hours |
221+
| Two-stage (MSE + QJL) | ~40-50% | **unbiased** (vs biased) | 8-12 hours |
222+
| Outlier mixed-precision | ~10-20% | ~10-20% | 4-6 hours |
223+
| **Combined** | **~60-70%** | **near-optimal** | 20-30 hours |
224+
225+
Current uniform Q4 achieves ~3.8x compression.
226+
Paper's TurboQuant at 3.5 bits achieves ~4.5x compression with **zero quality degradation**.
227+
At 2.5 bits: ~6.4x compression with **marginal** quality degradation.
228+
229+
---
230+
231+
## 7. Paper's Key Numbers for Reference
232+
233+
### LongBench-E (Table 1, Llama-3.1-8B-Instruct)
234+
235+
| Method | KV Size (bits) | Average Score |
236+
|--------|---------------|---------------|
237+
| Full Cache | 16 | 50.06 |
238+
| KIVI | 3 | 48.50 |
239+
| KIVI | 5 | 50.16 |
240+
| PolarQuant | 3.9 | 49.78 |
241+
| **TurboQuant** | **2.5** | **49.44** |
242+
| **TurboQuant** | **3.5** | **50.06** |
243+
244+
At 3.5 bits, TurboQuant matches full precision (50.06 = 50.06).
245+
At 2.5 bits, TurboQuant still outperforms KIVI at 3 bits.
246+
247+
### Needle-in-a-Haystack (Figure 4)
248+
249+
| Method | Score |
250+
|--------|-------|
251+
| Full Precision | 0.997 |
252+
| **TurboQuant** | **0.997** |
253+
| PolarQuant | 0.995 |
254+
| KIVI | 0.981 |
255+
| PyramidKV | 0.895 |
256+
| SnapKV | 0.858 |
257+
258+
TurboQuant achieves **identical** performance to full precision at 4x compression.
259+
260+
### Quantization Speed (Table 2)
261+
262+
| Method | d=200 | d=1536 | d=3072 |
263+
|--------|-------|--------|--------|
264+
| Product Quantization | 37.04s | 239.75s | 494.42s |
265+
| RabitQ | 597.25s | 2267.59s | 3957.19s |
266+
| **TurboQuant** | **0.0007s** | **0.0013s** | **0.0021s** |
267+
268+
TurboQuant is **100,000x faster** than alternatives — crucial for online KV cache quantization.
269+
270+
---
271+
272+
## 8. Recommended Implementation Roadmap
273+
274+
### Phase 1: Foundation (Days 1-2)
275+
- [ ] Implement Gaussian Lloyd-Max codebook as static lookup tables (b=1,2,3,4)
276+
- [ ] Wire RHT into KV cache quantization path
277+
- [ ] Add `TQ_TYPE_TURBOQUANT_MSE` that uses rotation + optimal scalar quantization
278+
- [ ] Benchmark MSE improvement vs current uniform
279+
280+
### Phase 2: Two-Stage (Days 3-4)
281+
- [ ] Implement residual computation after MSE quantization
282+
- [ ] Apply QJL on residual with correct dequantization scale (sqrt(pi/2)/d)
283+
- [ ] Add `TQ_TYPE_TURBOQUANT_PROD` for unbiased inner product
284+
- [ ] Verify unbiasedness with statistical tests
285+
286+
### Phase 3: Mixed-Precision (Days 5-6)
287+
- [ ] Implement outlier channel detection (top-K variance channels)
288+
- [ ] Allocate 3 bits to outliers, 2 bits to regular (2.5-bit config)
289+
- [ ] Allocate 4 bits to outliers, 3 bits to regular (3.5-bit config)
290+
- [ ] Benchmark on LongBench-E equivalent tasks
291+
292+
### Phase 4: Integration (Days 7-8)
293+
- [ ] Replace `uniform_4b` as default KV cache type with `turboquant_3.5b`
294+
- [ ] Update benchmarks with true TurboQuant numbers
295+
- [ ] Compare against paper's reported results
296+
- [ ] Update README with "faithful paper implementation" claim
297+
298+
---
299+
300+
## 9. Conclusion
301+
302+
**Current state**: TurboQuant.cpp is named after the paper but uses **uniform min-max quantization** for KV cache, not the actual TurboQuant algorithm. The core algorithms (polar, qjl, turbo) exist in `src/core/` but are **not connected to the inference engine**.
303+
304+
**Impact of fixing**: Implementing the true TurboQuant algorithm would:
305+
1. Reduce KV cache to **2.5-3.5 bits** (vs current 4 bits) — **30-55% more compression**
306+
2. Achieve **zero quality degradation** at 3.5 bits (vs current measurable degradation at 4 bits)
307+
3. Make TurboQuant.cpp a **faithful reference implementation** of the ICLR 2026 paper
308+
4. Provide a unique, defensible differentiation that no other C inference engine has
309+
310+
This is the **single highest-impact improvement** possible for the project.

0 commit comments

Comments
 (0)