Skip to content

Commit ef7e874

Browse files
authored
incorporate HIGGS into RFC 33 (#43)
Signed-off-by: Will Manning <will@willmanning.io>
1 parent 3655eac commit ef7e874

File tree

1 file changed

+118
-0
lines changed

1 file changed

+118
-0
lines changed

proposed/0033-block-turboquant.md

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,58 @@ simple deployment, and theoretical guarantees matter most, while PQ or OPQ may
7676
still win empirically when a learned vector codebook can exploit dataset-specific
7777
structure.
7878

79+
### Comparison to HIGGS
80+
81+
HIGGS [12] (Malinovskii et al., 2024) is a data-free quantization method for LLM
82+
weight matrices that shares TurboQuant's core idea — Hadamard rotation followed by
83+
MSE-optimal grid quantization — but targets a different application domain and makes
84+
different design trade-offs:
85+
86+
| | TurboQuant | HIGGS |
87+
| -------------------- | ------------------------------------------------------------------------ | -------------------------------------------------------------------------- |
88+
| Application domain | ANN embedding search (per-vector, online) | LLM weight quantization (per-layer, offline) |
89+
| Rotation | 3-round SORF (HD₃·HD₂·HD₁): high-quality random orthogonal approximation | Single RHT (H·D): one Hadamard × random diagonal signs |
90+
| Target distribution | Beta marginal (1-x²)^((d-3)/2) on unit sphere | Approximate Gaussian N(0,1) |
91+
| Quantization grid | Max-Lloyd centroids (scalar, p=1), analytically derived for Beta | CLVQ grids (Pagès & Printems 2003), supports vector quantization p∈{1,2,4} |
92+
| Error metric | Pure MSE (reconstruction error) | MSE + Hessian-weighted per-layer coefficients αₗ (Linearity Theorem) |
93+
| Calibration data | None | None for quantization; small calibration set for αₗ estimation |
94+
| Non-uniform bitwidth | No (uniform across all vectors) | Yes (DP solver for per-layer bit allocation) |
95+
| Distance computation | Quantized-domain scan kernel (PDX layout, SIMD over 64 vectors) | GPU matrix multiply (FLUTE kernel) |
96+
| Norm storage | Explicit per-block norms for distance computation | Per-group scales folded into weight reconstruction |
97+
98+
**Key design differences explained:**
99+
100+
- **Rotation depth.** TurboQuant normalizes to the unit sphere first, so
101+
coordinates must follow the specific Beta marginal for Max-Lloyd centroids to
102+
be optimal — this requires a high-quality random orthogonal approximation
103+
(3-round SORF). HIGGS operates on raw (group-normalized) weights and only
104+
needs approximate Gaussianity, so a single RHT suffices.
105+
- **VQ dimension.** HIGGS's CLVQ grids support multi-dimensional vector
106+
quantization (p>1), where groups of p coordinates are quantized jointly to an
107+
optimal multi-dimensional grid. At 3-4 bits, p=2 or p=4 achieves better
108+
rate-distortion than scalar (p=1) quantization by exploiting residual
109+
correlations between coordinates. TurboQuant is currently scalar-only (p=1);
110+
p>1 would require changes to the PDX scan kernel (per-subvector codebook
111+
lookup instead of per-coordinate). See Future work for discussion.
112+
- **Error metric.** HIGGS's Linearity Theorem (perplexity increase ≈ Σ αₗ·tₗ²)
113+
enables Hessian-aware optimization specific to LLM inference. For ANN search,
114+
MSE is the natural metric — it directly bounds distance distortion — and
115+
non-uniform bit allocation has no analogue (all vectors share the same
116+
encoding).
117+
- **Beta vs. Gaussian at high d.** As d grows, the Beta distribution
118+
(1-x²)^((d-3)/2) concentrates and becomes approximately Gaussian with
119+
variance ~1/d. At d=256+, the practical difference between Beta-optimal and
120+
Gaussian-optimal grids shrinks. Whether Gaussian grids (simpler: one grid per
121+
bitwidth, no dimension dependence) match Beta Max-Lloyd for ANN recall is an
122+
empirical question — see Experimental plan.
123+
124+
**Domain mismatch.** Comparisons of TurboQuant vs. HIGGS on LLM perplexity
125+
benchmarks are misleading: HIGGS's Hessian-aware optimization naturally dominates
126+
for that task, but TurboQuant was never designed for LLM weight quantization. The
127+
relevant comparison is ANN recall@k on embedding datasets, where TurboQuant's
128+
block decomposition, PDX scan layout, and per-vector encode/decode are the
129+
critical features.
130+
79131
### Current Vortex implementation
80132

81133
The [current implementation][current-impl] (Rust, in the `vortex-tensor` crate,
@@ -908,6 +960,31 @@ maximizes per-block quality and minimizes norm count. Experiments may show that
908960
smaller B with more pruning checkpoints yields better end-to-end scan
909961
performance despite higher per-block overhead.
910962

963+
### Gaussian-optimal vs. Beta-optimal grids
964+
965+
HIGGS [12] demonstrates that Gaussian-optimal grids (computed via CLVQ for N(0,1))
966+
work well after a single Hadamard rotation. Since the Beta marginal converges to
967+
Gaussian at high d, test whether Gaussian grids can replace Beta Max-Lloyd centroids
968+
for ANN search:
969+
970+
- **Grid comparison**: At B ∈ {64, 128, 256, 512} and b ∈ {2, 3, 4, 5, 8},
971+
compare ANN recall@k and normalized MSE for (a) Beta Max-Lloyd centroids at
972+
B-dim, (b) Gaussian-optimal scalar grids (Normal Float style), and
973+
(c) CLVQ-computed Gaussian grids. Report the crossover point where the grids
974+
become practically equivalent.
975+
- **Rotation depth**: If Gaussian grids match Beta Max-Lloyd at a given B, test
976+
whether 1-round RHT (H·D with random signs) achieves comparable quality to
977+
3-round SORF. A single round would reduce rotation cost by ~3× and simplify
978+
the transform. Test at B ∈ {64, 128, 256, 512} on the benchmarking datasets.
979+
- **Simplification potential**: If Gaussian grids + 1-round RHT match quality at
980+
B ≥ 256, this eliminates the dimension-dependent centroid computation (one grid
981+
per bitwidth, shared across all block sizes) and reduces rotation overhead.
982+
This would be a significant implementation simplification for Stage 2+.
983+
984+
The expectation is that at B=256+ the difference is negligible, but at B=64-128
985+
the Beta-optimal grids may still win due to stronger non-Gaussian effects. Results
986+
should inform whether the centroid computation strategy changes in Phase 2.
987+
911988
### QJL strategy comparison (if pursued)
912989

913990
- Per-block Gaussian QJL vs. per-block SORF QJL vs. full-dim SORF QJL
@@ -992,6 +1069,43 @@ In all cases, MSE-only is the recommended starting point. QJL should only be
9921069
added if experiments demonstrate clear recall@k improvements for the target
9931070
workload.
9941071

1072+
## Future work: Multi-dimensional vector quantization (p>1)
1073+
1074+
HIGGS [12] demonstrates that vector quantization with dimension p>1 (quantizing
1075+
groups of p coordinates jointly to an optimal multi-dimensional grid) achieves
1076+
better rate-distortion than scalar quantization (p=1) at the same bit budget. For
1077+
TurboQuant, this would mean replacing the per-coordinate Max-Lloyd centroid lookup
1078+
with a per-subvector codebook lookup, where each group of p rotated coordinates
1079+
maps to one of n codewords in a p-dimensional CLVQ grid.
1080+
1081+
**Benefits:**
1082+
1083+
- Improved rate-distortion: at 3-4 bits, p=2 or p=4 captures residual
1084+
correlations between coordinates that scalar quantization misses.
1085+
- Simpler centroid computation: CLVQ grids for Gaussian inputs are computed once
1086+
per (n, p) pair and reused across all block sizes (no dimension dependence).
1087+
1088+
**Costs and constraints:**
1089+
1090+
- **Distance kernel redesign.** The PDX scan kernel (Stage 3) is built around
1091+
per-coordinate centroid lookups with a (2^b)²-entry distance table. At p=2
1092+
with b=4 bits per coordinate, the codebook has 2^(4×2)=256 entries, and the
1093+
distance table becomes 256×256=64K entries (256 KB) — still fits in L1/L2 but
1094+
much larger than the current 1 KB at b=4 scalar. At p=4 the table is
1095+
infeasible; alternative distance strategies (asymmetric distance computation,
1096+
partial codebook scans) would be needed.
1097+
- **GPU shared memory.** HIGGS notes total grid points 2^(b×p) must fit GPU
1098+
shared memory (~2^10 points practical limit), constraining (b, p) pairs.
1099+
- **PDX layout interaction.** The current "1 dim × 64 vecs" PDX layout assumes
1100+
per-coordinate independence. At p>1, the layout would need to group p
1101+
consecutive dimensions together per lookup, changing the transpose structure.
1102+
1103+
**Recommendation:** Evaluate p=2 VQ experimentally after Stage 3 (PDX) is
1104+
validated. Compare ANN recall@k at matched bit budgets: p=1 at b bits vs. p=2 at
1105+
b bits. If p=2 shows meaningful recall improvement (>2% recall@10), design the
1106+
kernel changes as a Stage 4 extension. CLVQ grids for p=2 can be precomputed
1107+
offline using the Pagès & Printems (2003) algorithm [12].
1108+
9951109
## Future work: GPU decode and fused distance computation
9961110

9971111
The B-dim block structure maps naturally to GPU tile sizes and tensor cores.
@@ -1181,6 +1295,10 @@ IEEE Trans. PAMI 36(4):744-755, 2014.
11811295
[11] Jääsaari, E., Hyvönen, V., Ceccarello, M., Roos, T. and Aumüller, M.
11821296
"VIBE: Vector Index Benchmark for Embeddings." arXiv:2505.17810, May 2025.
11831297

1298+
[12] Malinovskii, V., Panferov, A., Ilin, I., Guo, H., Richtárik, P. and
1299+
Alistarh, D. "Pushing the Limits of Large Language Model Quantization via the
1300+
Linearity Theorem." arXiv:2411.17525, November 2024.
1301+
11841302
## Appendix A: Reference implementation bugs and Theorem 1 constant
11851303

11861304
### Reference implementation bugs

0 commit comments

Comments
 (0)