@@ -76,6 +76,58 @@ simple deployment, and theoretical guarantees matter most, while PQ or OPQ may
7676still win empirically when a learned vector codebook can exploit dataset-specific
7777structure.
7878
79+ ### Comparison to HIGGS
80+
81+ HIGGS [ 12] (Malinovskii et al., 2024) is a data-free quantization method for LLM
82+ weight matrices that shares TurboQuant's core idea — Hadamard rotation followed by
83+ MSE-optimal grid quantization — but targets a different application domain and makes
84+ different design trade-offs:
85+
86+ | | TurboQuant | HIGGS |
87+ | -------------------- | ------------------------------------------------------------------------ | -------------------------------------------------------------------------- |
88+ | Application domain | ANN embedding search (per-vector, online) | LLM weight quantization (per-layer, offline) |
89+ | Rotation | 3-round SORF (HD₃·HD₂·HD₁): high-quality random orthogonal approximation | Single RHT (H·D): one Hadamard × random diagonal signs |
90+ | Target distribution | Beta marginal (1-x²)^((d-3)/2) on unit sphere | Approximate Gaussian N(0,1) |
91+ | Quantization grid | Max-Lloyd centroids (scalar, p=1), analytically derived for Beta | CLVQ grids (Pagès & Printems 2003), supports vector quantization p∈{1,2,4} |
92+ | Error metric | Pure MSE (reconstruction error) | MSE + Hessian-weighted per-layer coefficients αₗ (Linearity Theorem) |
93+ | Calibration data | None | None for quantization; small calibration set for αₗ estimation |
94+ | Non-uniform bitwidth | No (uniform across all vectors) | Yes (DP solver for per-layer bit allocation) |
95+ | Distance computation | Quantized-domain scan kernel (PDX layout, SIMD over 64 vectors) | GPU matrix multiply (FLUTE kernel) |
96+ | Norm storage | Explicit per-block norms for distance computation | Per-group scales folded into weight reconstruction |
97+
98+ ** Key design differences explained:**
99+
100+ - ** Rotation depth.** TurboQuant normalizes to the unit sphere first, so
101+ coordinates must follow the specific Beta marginal for Max-Lloyd centroids to
102+ be optimal — this requires a high-quality random orthogonal approximation
103+ (3-round SORF). HIGGS operates on raw (group-normalized) weights and only
104+ needs approximate Gaussianity, so a single RHT suffices.
105+ - ** VQ dimension.** HIGGS's CLVQ grids support multi-dimensional vector
106+ quantization (p>1), where groups of p coordinates are quantized jointly to an
107+ optimal multi-dimensional grid. At 3-4 bits, p=2 or p=4 achieves better
108+ rate-distortion than scalar (p=1) quantization by exploiting residual
109+ correlations between coordinates. TurboQuant is currently scalar-only (p=1);
110+ p>1 would require changes to the PDX scan kernel (per-subvector codebook
111+ lookup instead of per-coordinate). See Future work for discussion.
112+ - ** Error metric.** HIGGS's Linearity Theorem (perplexity increase ≈ Σ αₗ·tₗ²)
113+ enables Hessian-aware optimization specific to LLM inference. For ANN search,
114+ MSE is the natural metric — it directly bounds distance distortion — and
115+ non-uniform bit allocation has no analogue (all vectors share the same
116+ encoding).
117+ - ** Beta vs. Gaussian at high d.** As d grows, the Beta distribution
118+ (1-x²)^((d-3)/2) concentrates and becomes approximately Gaussian with
119+ variance ~ 1/d. At d=256+, the practical difference between Beta-optimal and
120+ Gaussian-optimal grids shrinks. Whether Gaussian grids (simpler: one grid per
121+ bitwidth, no dimension dependence) match Beta Max-Lloyd for ANN recall is an
122+ empirical question — see Experimental plan.
123+
124+ ** Domain mismatch.** Comparisons of TurboQuant vs. HIGGS on LLM perplexity
125+ benchmarks are misleading: HIGGS's Hessian-aware optimization naturally dominates
126+ for that task, but TurboQuant was never designed for LLM weight quantization. The
127+ relevant comparison is ANN recall@k on embedding datasets, where TurboQuant's
128+ block decomposition, PDX scan layout, and per-vector encode/decode are the
129+ critical features.
130+
79131### Current Vortex implementation
80132
81133The [ current implementation] [ current-impl ] (Rust, in the ` vortex-tensor ` crate,
@@ -908,6 +960,31 @@ maximizes per-block quality and minimizes norm count. Experiments may show that
908960smaller B with more pruning checkpoints yields better end-to-end scan
909961performance despite higher per-block overhead.
910962
963+ ### Gaussian-optimal vs. Beta-optimal grids
964+
965+ HIGGS [ 12] demonstrates that Gaussian-optimal grids (computed via CLVQ for N(0,1))
966+ work well after a single Hadamard rotation. Since the Beta marginal converges to
967+ Gaussian at high d, test whether Gaussian grids can replace Beta Max-Lloyd centroids
968+ for ANN search:
969+
970+ - ** Grid comparison** : At B ∈ {64, 128, 256, 512} and b ∈ {2, 3, 4, 5, 8},
971+ compare ANN recall@k and normalized MSE for (a) Beta Max-Lloyd centroids at
972+ B-dim, (b) Gaussian-optimal scalar grids (Normal Float style), and
973+ (c) CLVQ-computed Gaussian grids. Report the crossover point where the grids
974+ become practically equivalent.
975+ - ** Rotation depth** : If Gaussian grids match Beta Max-Lloyd at a given B, test
976+ whether 1-round RHT (H·D with random signs) achieves comparable quality to
977+ 3-round SORF. A single round would reduce rotation cost by ~ 3× and simplify
978+ the transform. Test at B ∈ {64, 128, 256, 512} on the benchmarking datasets.
979+ - ** Simplification potential** : If Gaussian grids + 1-round RHT match quality at
980+ B ≥ 256, this eliminates the dimension-dependent centroid computation (one grid
981+ per bitwidth, shared across all block sizes) and reduces rotation overhead.
982+ This would be a significant implementation simplification for Stage 2+.
983+
984+ The expectation is that at B=256+ the difference is negligible, but at B=64-128
985+ the Beta-optimal grids may still win due to stronger non-Gaussian effects. Results
986+ should inform whether the centroid computation strategy changes in Phase 2.
987+
911988### QJL strategy comparison (if pursued)
912989
913990- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim SORF QJL
@@ -992,6 +1069,43 @@ In all cases, MSE-only is the recommended starting point. QJL should only be
9921069added if experiments demonstrate clear recall@k improvements for the target
9931070workload.
9941071
1072+ ## Future work: Multi-dimensional vector quantization (p>1)
1073+
1074+ HIGGS [ 12] demonstrates that vector quantization with dimension p>1 (quantizing
1075+ groups of p coordinates jointly to an optimal multi-dimensional grid) achieves
1076+ better rate-distortion than scalar quantization (p=1) at the same bit budget. For
1077+ TurboQuant, this would mean replacing the per-coordinate Max-Lloyd centroid lookup
1078+ with a per-subvector codebook lookup, where each group of p rotated coordinates
1079+ maps to one of n codewords in a p-dimensional CLVQ grid.
1080+
1081+ ** Benefits:**
1082+
1083+ - Improved rate-distortion: at 3-4 bits, p=2 or p=4 captures residual
1084+ correlations between coordinates that scalar quantization misses.
1085+ - Simpler centroid computation: CLVQ grids for Gaussian inputs are computed once
1086+ per (n, p) pair and reused across all block sizes (no dimension dependence).
1087+
1088+ ** Costs and constraints:**
1089+
1090+ - ** Distance kernel redesign.** The PDX scan kernel (Stage 3) is built around
1091+ per-coordinate centroid lookups with a (2^b)²-entry distance table. At p=2
1092+ with b=4 bits per coordinate, the codebook has 2^(4×2)=256 entries, and the
1093+ distance table becomes 256×256=64K entries (256 KB) — still fits in L1/L2 but
1094+ much larger than the current 1 KB at b=4 scalar. At p=4 the table is
1095+ infeasible; alternative distance strategies (asymmetric distance computation,
1096+ partial codebook scans) would be needed.
1097+ - ** GPU shared memory.** HIGGS notes total grid points 2^(b×p) must fit GPU
1098+ shared memory (~ 2^10 points practical limit), constraining (b, p) pairs.
1099+ - ** PDX layout interaction.** The current "1 dim × 64 vecs" PDX layout assumes
1100+ per-coordinate independence. At p>1, the layout would need to group p
1101+ consecutive dimensions together per lookup, changing the transpose structure.
1102+
1103+ ** Recommendation:** Evaluate p=2 VQ experimentally after Stage 3 (PDX) is
1104+ validated. Compare ANN recall@k at matched bit budgets: p=1 at b bits vs. p=2 at
1105+ b bits. If p=2 shows meaningful recall improvement (>2% recall@10), design the
1106+ kernel changes as a Stage 4 extension. CLVQ grids for p=2 can be precomputed
1107+ offline using the Pagès & Printems (2003) algorithm [ 12] .
1108+
9951109## Future work: GPU decode and fused distance computation
9961110
9971111The B-dim block structure maps naturally to GPU tile sizes and tensor cores.
@@ -1181,6 +1295,10 @@ IEEE Trans. PAMI 36(4):744-755, 2014.
11811295[ 11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M.
11821296"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025.
11831297
1298+ [ 12] Malinovskii, V., Panferov, A., Ilin, I., Guo, H., Richtárik, P. and
1299+ Alistarh, D. "Pushing the Limits of Large Language Model Quantization via the
1300+ Linearity Theorem." arXiv:2411.17525, November 2024.
1301+
11841302## Appendix A: Reference implementation bugs and Theorem 1 constant
11851303
11861304### Reference implementation bugs
0 commit comments