Skip to content

Commit fac7e16

Browse files
committed
prettier
Signed-off-by: Will Manning <will@willmanning.io>
1 parent 6f25eee commit fac7e16

1 file changed

Lines changed: 44 additions & 41 deletions

File tree

proposed/0033-block-turboquant.md

Lines changed: 44 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -218,11 +218,11 @@ by the NormVector encoding (PR #7251).
218218
The per-block norms are stored as a single `FixedSizeListArray<F>` with
219219
`list_size = num_blocks`, where `F` matches or widens the input element type:
220220

221-
| Input dtype | Norm dtype | Rationale |
222-
|-------------|-----------|-----------|
223-
| f16 | f32 | f16 has insufficient range/precision for norms |
224-
| f32 | f32 | Same type |
225-
| f64 | f64 | Preserve full precision |
221+
| Input dtype | Norm dtype | Rationale |
222+
| ----------- | ---------- | ---------------------------------------------- |
223+
| f16 | f32 | f16 has insufficient range/precision for norms |
224+
| f32 | f32 | Same type |
225+
| f64 | f64 | Preserve full precision |
226226

227227
Norms are stored as plain child arrays — TurboQuant does not apply any secondary
228228
encoding to them. The cascading compressor treats norms like any other float
@@ -245,6 +245,7 @@ With block decomposition and externalized norms, quantized cosine similarity
245245
requires more work than the current single-block approach.
246246

247247
**Quantized dot product:**
248+
248249
```
249250
<a, b> ≈ Σ_k ‖aₖ‖ · ‖bₖ‖ · Σ_j centroids[code_aₖ[j]] · centroids[code_bₖ[j]]
250251
```
@@ -253,6 +254,7 @@ Per-block: compute the unit-norm quantized dot product (sum over 64 centroid
253254
products), then weight by both vectors' block norms.
254255

255256
**Quantized cosine similarity:**
257+
256258
```
257259
cos(a, b) ≈ <a, b> / (‖a‖ · ‖b‖)
258260
= Σ_k ‖aₖ‖ · ‖bₖ‖ · unit_dotₖ
@@ -334,6 +336,7 @@ In a second stage, we transpose the code storage from row-major to
334336
dimension-major within groups of 64 vectors, following the PDX layout [4].
335337

336338
**Row-major (Stage 1):**
339+
337340
```
338341
Vector 0: [b0_d0 b0_d1 ... b0_d63 | b1_d0 ... b1_d63 | ... ]
339342
Vector 1: [b0_d0 b0_d1 ... b0_d63 | b1_d0 ... b1_d63 | ... ]
@@ -344,6 +347,7 @@ All codes for a single vector are contiguous. Good for per-vector decode, bad
344347
for vectorized scans.
345348

346349
**PDX (Stage 2), within each 64-vector chunk:**
350+
347351
```
348352
Block 0, dim 0: [v0 v1 v2 ... v63]
349353
Block 0, dim 1: [v0 v1 v2 ... v63]
@@ -450,18 +454,18 @@ in row-major or PDX-transposed layout.
450454
For f32 input at dimension d with bit width b (QJL, so b-1 MSE bits + 1 QJL
451455
bit), k = ⌈d/64⌉ blocks, B = 64:
452456

453-
| Component | Bits per vector |
454-
|-----------|----------------|
455-
| MSE codes | k × B × (b-1) |
456-
| QJL signs | k × B × 1 |
457-
| Block norms | k × norm_bits |
458-
| QJL residual norms | k × norm_bits |
457+
| Component | Bits per vector |
458+
| ------------------ | --------------- |
459+
| MSE codes | k × B × (b-1) |
460+
| QJL signs | k × B × 1 |
461+
| Block norms | k × norm_bits |
462+
| QJL residual norms | k × norm_bits |
459463

460-
| Component | Shared bits |
461-
|-----------|-------------|
462-
| Centroids | 2^(b-1) × 32 |
463-
| MSE rotation signs | k × 3 × B |
464-
| QJL rotation signs | k × 3 × B |
464+
| Component | Shared bits |
465+
| ------------------ | ------------ |
466+
| Centroids | 2^(b-1) × 32 |
467+
| MSE rotation signs | k × 3 × B |
468+
| QJL rotation signs | k × 3 × B |
465469

466470
### Example: f32, d=768, b=5 (4-bit MSE + 1-bit QJL), 1000 vectors, k=12
467471

@@ -487,10 +491,10 @@ bit), k = ⌈d/64⌉ blocks, B = 64:
487491

488492
### Comparison: current single-SRHT TurboQuant
489493

490-
| Dimension | Current ratio | Block ratio | Delta |
491-
|-----------|--------------|-------------|-------|
492-
| 768 (padded→1024) | 4.7× | 5.3× | +13% better |
493-
| 1024 (no padding) | 6.3× | 5.3× | -16% worse |
494+
| Dimension | Current ratio | Block ratio | Delta |
495+
| ----------------- | ------------- | ----------- | ----------- |
496+
| 768 (padded→1024) | 4.7× | 5.3× | +13% better |
497+
| 1024 (no padding) | 6.3× | 5.3× | -16% worse |
494498

495499
For power-of-2 dimensions, the block approach uses more storage due to per-block
496500
norms and QJL residual norms (k × 2 × 32 bits/vector overhead). At d=1024 with
@@ -506,12 +510,12 @@ modest overhead for power-of-2 dimensions.
506510

507511
With k blocks at d-dimensional input, encoding requires:
508512

509-
| Operation | Per-block | Total (k blocks) |
510-
|-----------|-----------|-------------------|
511-
| SRHT (3-round, B=64) | ~384 FLOPs | k × 384 |
512-
| Centroid lookup (B=64) | 64 binary searches | k × 64 |
513-
| QJL SRHT | ~384 FLOPs | k × 384 |
514-
| Norm computation | 64 FMA + sqrt | k × 65 |
513+
| Operation | Per-block | Total (k blocks) |
514+
| ---------------------- | ------------------ | ---------------- |
515+
| SRHT (3-round, B=64) | ~384 FLOPs | k × 384 |
516+
| Centroid lookup (B=64) | 64 binary searches | k × 64 |
517+
| QJL SRHT | ~384 FLOPs | k × 384 |
518+
| Norm computation | 64 FMA + sqrt | k × 65 |
515519

516520
For d=768, k=12: total ~20K FLOPs per vector (SRHT + QJL SRHT) vs. current
517521
single-SRHT approach: 2 × ~10K = ~20K FLOPs. Encode throughput should be
@@ -521,12 +525,12 @@ comparable — the per-block overhead is offset by smaller per-block transforms.
521525

522526
Decoding is dominated by k inverse rotations + codebook lookups:
523527

524-
| Operation | Per-block | Total (k blocks) |
525-
|-----------|-----------|-------------------|
526-
| Codebook lookup | 64 table reads | k × 64 |
527-
| Inverse SRHT | ~384 FLOPs | k × 384 |
528-
| QJL inverse SRHT | ~384 FLOPs | k × 384 |
529-
| Denormalize | 64 multiplies | k × 64 |
528+
| Operation | Per-block | Total (k blocks) |
529+
| ---------------- | -------------- | ---------------- |
530+
| Codebook lookup | 64 table reads | k × 64 |
531+
| Inverse SRHT | ~384 FLOPs | k × 384 |
532+
| QJL inverse SRHT | ~384 FLOPs | k × 384 |
533+
| Denormalize | 64 multiplies | k × 64 |
530534

531535
For d=768: ~10K FLOPs decode vs. current ~10K (single 1024-dim inverse SRHT).
532536
Similar throughput.
@@ -541,14 +545,13 @@ memory bandwidth utilization.
541545
### Benchmarking plan
542546

543547
Stage 1 should include benchmarks comparing:
548+
544549
1. Encode throughput: block TQ vs. current TQ at d=128, 768, 1024
545550
2. Decode throughput: same dimensions
546551
3. Quantized cosine similarity throughput: block vs. current
547552
4. L2 norm readthrough latency: O(k) block norms vs. O(1) current
548553

549-
Stage 2 should benchmark:
550-
5. PDX scan throughput vs. row-major scan at d=768, 1024
551-
6. Full decompression from PDX layout (includes un-transpose overhead)
554+
Stage 2 should benchmark: 5. PDX scan throughput vs. row-major scan at d=768, 1024 6. Full decompression from PDX layout (includes un-transpose overhead)
552555

553556
## Phasing
554557

@@ -580,6 +583,7 @@ The current TurboQuant test suite validates specific behaviors that will change:
580583
more children to manage.
581584

582585
New tests needed:
586+
583587
- SRHT quality at d=64: coordinate distribution vs. analytical Beta at 3, 4, 5 rounds
584588
- Practical MSE comparison: d=64 blocks vs. d=768 single-rotation at same bit width
585589
- Straggler block handling: dense rotation, separate centroids
@@ -590,15 +594,14 @@ New tests needed:
590594
## References
591595

592596
[1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online
593-
Vector Quantization with Near-optimal Distortion Rate." arXiv:2504.19874,
594-
April 2025.
597+
Vector Quantization with Near-optimal Distortion Rate." arXiv:2504.19874,
598+
April 2025.
595599

596600
[2] Ailon, N. and Chazelle, B. "The Fast Johnson-Lindenstrauss Transform and
597-
Approximate Nearest Neighbors." SIAM Journal on Computing, 39(1):302-322,
598-
2009.
601+
Approximate Nearest Neighbors." SIAM Journal on Computing, 39(1):302-322, 2009.
599602

600603
[3] Tropp, J.A. "Improved Analysis of the Subsampled Randomized Hadamard
601-
Transform." Advances in Adaptive Data Analysis, 3(1-2):115-126, 2011.
604+
Transform." Advances in Adaptive Data Analysis, 3(1-2):115-126, 2011.
602605

603606
[4] Kuffo, L., Krippner, E. and Boncz, P. "PDX: A Data Layout for Vector
604-
Similarity Search." Proceedings of SIGMOD '25. arXiv:2503.04422, March 2025.
607+
Similarity Search." Proceedings of SIGMOD '25. arXiv:2503.04422, March 2025.

0 commit comments

Comments
 (0)