Skip to content

Commit 22fe48b

Browse files
authored
RFC 33: more details on PDX, better description of baseline + incremental work (#37)
follow up to #33 #34 #35 #36 --------- Signed-off-by: Will Manning <will@willmanning.io>
1 parent 8a65474 commit 22fe48b

1 file changed

Lines changed: 216 additions & 36 deletions

File tree

proposed/0033-block-turboquant.md

Lines changed: 216 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,52 @@ MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the paper
4545
uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not an
4646
orthogonal rotation); this distinction matters for the unbiasedness proof.
4747

48-
Our [current implementation][current-impl] substitutes a 3-round Structured
49-
Orthogonal Random Features (SORF) transform `HD₃·HD₂·HD₁` [5] for both the MSE
50-
rotation and the QJL projection, giving O(d) storage and O(d log d) per-vector.
51-
The 3-round SORF construction was introduced for kernel approximation [5] and
52-
approximates a random orthogonal matrix. Note that this is distinct from the
53-
single-round SRHT (`R·H·D`) analyzed by Tropp [3] and the FJLT (`P·H·D`) of
54-
Ailon-Chazelle [2], both of which are dimensionality-reducing projections.
48+
### Current Vortex implementation
49+
50+
Our [current implementation][current-impl] (Rust, in the `vortex-tensor` crate)
51+
implements TurboQuant as a Vortex array encoding that compresses
52+
`FixedSizeList<float>` arrays — the storage format of `Vector` and
53+
`FixedShapeTensor` extension types. Key design choices and characteristics:
54+
55+
**Rotation.** Instead of the paper's O(d²) QR rotation, we use a 3-round
56+
Structured Orthogonal Random Features (SORF) transform `HD₃·HD₂·HD₁` [5] for
57+
both the MSE rotation and the QJL projection, giving O(d) storage (3d sign bits,
58+
bitpacked) and O(d log d) per-vector. The rotation signs are stored as a
59+
bitpacked child array rather than recomputed from a seed at decode time. The
60+
3-round SORF was introduced for kernel approximation [5] and approximates a
61+
random orthogonal matrix. It is distinct from the single-round SRHT (`R·H·D`)
62+
analyzed by Tropp [3] and the FJLT (`P·H·D`) of Ailon-Chazelle [2], both of
63+
which are dimensionality-reducing projections rather than rotation
64+
approximations.
65+
66+
**Centroids.** Max-Lloyd centroids are computed via numerical integration
67+
(trapezoid rule, 1000 points per interval) of the marginal Beta distribution at
68+
the padded dimension, using the `HalfIntExponent` type for exact integer/half-
69+
integer exponent arithmetic. Centroids are cached in a global `DashMap` keyed by
70+
`(dimension, bit_width)` and stored as a shared `PrimitiveArray<f32>` child.
71+
72+
**Array structure.** The `TurboQuantArray` stores up to 7 child slots: codes
73+
(`FixedSizeListArray<u8>`, one per vector, list_size = padded_dim), norms
74+
(`PrimitiveArray<f32>`), centroids (shared), MSE rotation signs (shared,
75+
bitpacked), and optionally 3 QJL children (signs, residual norms, QJL rotation
76+
signs). Codes are stored as u8 centroid indices; the cascade compressor
77+
(BitPacked encoding) handles packing to the actual bit width on disk.
78+
79+
**Compute pushdowns.** Slice and take propagate to per-row children (codes,
80+
norms) while sharing rotation signs and centroids. Quantized cosine similarity
81+
and dot product operate directly on codes and centroids without decompression.
82+
L2 norm returns the stored norm directly (O(1) readthrough).
83+
84+
**Compression scheme.** `TurboQuantScheme` implements the `Scheme` trait for the
85+
BtrBlocks cascading compressor. It matches `Vector` and `FixedShapeTensor`
86+
extension arrays with non-nullable float elements and dimension ≥ 3, using the
87+
default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42).
88+
89+
**Input handling.** All float types (f16, f32, f64) are converted to f32 before
90+
quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2
91+
dimensions are zero-padded to the next power of 2 for SORF compatibility. The
92+
minimum dimension is 3 (d=2 causes a singularity in the Beta distribution
93+
exponent).
5594

5695
### Reference implementation bugs
5796

@@ -105,13 +144,48 @@ The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions
105144

106145
### PDX
107146

108-
PDX [4] is a data layout for vector similarity search that stores dimensions in
109-
a vertical (dimension-major) layout within fixed-size blocks of 64 vectors. This
110-
enables the compiler to auto-vectorize the inner distance loop over vectors
147+
PDX [4] is a data layout for vector similarity search. The paper (SIGMOD '25)
148+
describes a dimension-major layout within fixed-size blocks of 64 vectors,
149+
enabling the compiler to auto-vectorize the inner distance loop over vectors
111150
rather than dimensions, achieving on average 2× speedups over SIMD-optimized
112151
row-major kernels on modern CPUs. The block size of 64 is empirically optimal
113152
across AVX-512, AVX2, and NEON architectures [4].
114153

154+
**PDX implementation evolution.** The [open-source implementation][pdx-impl]
155+
has evolved beyond the paper in several ways relevant to this RFC:
156+
157+
- **8-bit scalar quantization** (`IndexPDXIVFTreeSQ8`): Maps floats to 0-255 via
158+
linear min-max scaling. The int8 layout differs from float32: dimensions are
159+
packed in groups of 4 ("4 dims × 16 vecs") to leverage hardware dot-product
160+
instructions (VPDPBUSD on x86, UDOT/SDOT on ARM) that process 4 byte pairs
161+
per operation. This is a different tiling than the paper's "1 dim × 64 vecs."
162+
- **ADSampling with random rotation**: The pruner applies a random orthogonal
163+
rotation (QR of Gaussian, or DCT when FFTW is available) to the entire
164+
collection as a preprocessing step. This makes coordinates approximately
165+
independent, enabling dimension-by-dimension hypothesis testing for early
166+
pruning. The rotation serves a similar purpose to TurboQuant's rotation —
167+
making the coordinate distribution known — but for pruning rather than
168+
quantization.
169+
- **Dimension zones**: Consecutive dimensions are grouped into zones; at query
170+
time, zones are ranked by "distance-to-means" and the most discriminative
171+
zones are scanned first, enabling faster pruning.
172+
- **Future: 1-bit vectors** are mentioned as planned.
173+
174+
**Implications for our design.** The PDX paper's float32 layout ("1 dim × 64
175+
vecs") maps cleanly to our quantized-code scan kernel, where the inner loop
176+
gathers from a centroid-product distance table over 64 vectors. However, if we
177+
pursue direct int8 arithmetic (b_mse=8 with linear centroids, see GPU section),
178+
the "4 dims × 16 vecs" int8 layout from the PDX implementation may be more
179+
appropriate, as it enables hardware dot-product instructions.
180+
181+
Additionally, ADSampling's dimension-pruning approach is complementary to
182+
TurboQuant's block structure: when scanning with block decomposition, the pruner
183+
could skip entire TQ blocks (B dimensions at a time) if the partial distance
184+
already exceeds the candidate threshold. This combines the storage efficiency of
185+
quantization with the computational savings of early termination.
186+
187+
[pdx-impl]: https://github.com/cwida/PDX
188+
115189
## Proposal
116190

117191
### Block size strategy
@@ -148,40 +222,68 @@ divides d. This eliminates stragglers entirely for common embedding dimensions:
148222

149223
### Stage 1: MSE-only TurboQuant (immediate — split from current PR)
150224

151-
Split the [current PR][current-impl] to extract and merge the MSE-only subset
152-
(removing QJL encoding, QJL array slots, and QJL-specific tests). The QJL code
153-
can be preserved on a separate branch for Phase 4. The MSE-only encoding
154-
provides:
225+
Split the [current PR][current-impl] to extract and merge the MSE-only subset.
226+
The QJL code can be preserved on a separate branch for Phase 4.
227+
228+
**Changes vs. current PR:**
229+
230+
| Aspect | Current PR | Stage 1 |
231+
| -------------- | ------------------------------------------- | ----------------------------------------------------- |
232+
| QJL support | Full (encode, decode, QJL slots, QJL tests) | **Removed** |
233+
| Array slots | 7 (4 MSE + 3 QJL) | **4** (codes, norms, centroids, rotation_signs) |
234+
| Scheme default | 5-bit QJL (4-bit MSE + 1-bit QJL) | **5-bit MSE-only** (32 centroids) |
235+
| Norms dtype | Always f32 | **Same-or-wider**: f64 for f64 input, f32 for f32/f16 |
236+
| Metadata | `has_qjl: bool` | **Removed** (always MSE-only) |
155237

156-
- SORF-based random rotation at the padded dimension
157-
- Max-Lloyd scalar quantization with shared centroids
158-
- Per-vector norm storage (single f32, regardless of input dtype — the
159-
dtype-matching norm behavior described in Stage 2 is a later change)
160-
- Slice, take, scalar_at compute pushdowns
161-
- Quantized-domain cosine similarity and dot product
162-
- File format integration via the compression scheme
238+
**Unchanged from current PR:** SORF rotation, Max-Lloyd centroids,
239+
zero-padding for non-power-of-2, slice/take/scalar_at pushdowns, quantized
240+
cosine similarity and dot product, compression scheme integration, minimum dim=3.
163241

164-
This is a complete, useful encoding for power-of-2 dimensions. For non-power-of-2
165-
dimensions it has the padding overhead described above.
242+
**Added to metadata (for forward compat):** `block_size: u32` (always =
243+
padded_dim), `num_blocks: u32` (always = 1), `is_pdx: bool` (always = false).
244+
These fields are inert in Stage 1 but enable Stage 2/3 decoders to read
245+
Stage 1 files.
246+
247+
This is a complete, useful encoding for all dimensions. Power-of-2 dimensions
248+
have zero padding waste; non-power-of-2 dimensions have the padding overhead
249+
described above.
166250

167251
### Stage 2: Block decomposition
168252

169253
For non-power-of-2 dimensions, split into blocks of size B (as determined by the
170254
table above). Each full block gets an independent B-dim SORF rotation.
171255

172-
**Key properties:**
256+
**Changes vs. Stage 1:**
257+
258+
| Aspect | Stage 1 | Stage 2 |
259+
| --------------------- | ------------------------------------ | ---------------------------------------------------------------------------- |
260+
| Block count | k = 1 (single block at padded_dim) | **k = d/B** (multiple blocks, no padding) |
261+
| SORF dimension | padded_dim (e.g., 1024 for d=768) | **B** (e.g., 256 for d=768) |
262+
| Rotation signs | Single set, len = 3 × padded_dim | **k sets**, len = k × 3 × B |
263+
| Centroids | Computed for padded_dim distribution | **Computed for B-dim distribution** (different codebook!) |
264+
| Norms child | `PrimitiveArray<F>`, 1 per vector | **`PrimitiveArray<F>` (k=1) or `FixedSizeListArray<F>` (k>1)**, same dtype F |
265+
| Codes list_size | padded_dim | **k × B** (= d for no-straggler dims) |
266+
| Scheme compress() | Pad → single SORF → quantize | **Choose B → split → per-block normalize/rotate/quantize** |
267+
| Quantized dot product | Single sum over padded_dim centroids | **Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k) |
268+
| L2 norm readthrough | O(1) — return stored norm | **O(k)** — compute √(Σ_k norm_k²) |
269+
| Zero-padding waste | Up to 33% (768→1024) | **Zero** for common dims |
270+
271+
**Unchanged from Stage 1:** SORF construction (3-round HD), Max-Lloyd algorithm,
272+
f32 internal quantization, slice/take semantics (per-row data sliced, shared
273+
data cloned), bitpacked rotation sign storage, compression scheme trait.
274+
275+
**For power-of-2 dimensions**: B = d, k = 1. The encoding produces an identical
276+
wire format to Stage 1 (single norm, single SORF, single codes block). A Stage 2
277+
encoder writing k=1 data is fully backward-compatible with Stage 1 decoders.
278+
279+
**Key design properties:**
173280

174281
- **Self-contained.** The TurboQuant array handles block splitting, per-block
175-
normalization, rotation, and quantization internally. It accepts arbitrary
176-
(non-unit-norm) input vectors and stores per-block norms as internal children.
177-
No parent cooperation is needed — the array can decode without any parent
178-
context.
179-
- **One shared centroid set** for all blocks. All blocks use the same B-dim
180-
marginal distribution, so a single Max-Lloyd codebook serves every block.
282+
normalization, rotation, and quantization internally. No parent cooperation
283+
is needed.
284+
- **One shared centroid set** for all blocks at the same B-dim distribution.
181285
- **Per-block SORF rotation signs.** Each block's SORF is independent (different
182286
seed). Signs are 3 × B bits per block.
183-
- **For power-of-2 dimensions**: B = d, k = 1. The encoding is functionally
184-
identical to Stage 1 (single norm, single SORF rotation, no block splitting).
185287

186288
#### Norm architecture
187289

@@ -248,6 +350,23 @@ B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per
248350
block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+
249351
vectors). Each block must have an **independent** rotation matrix.
250352

353+
**Why not DCT?** The PDX implementation [pdx-impl] uses DCT (via FFTW) as a fast
354+
rotation for ADSampling. DCT is O(B log B) and invertible, but it is a **fixed
355+
structured transform**, not a random rotation — it does not produce the Beta
356+
marginal distribution `(1-x²)^((d-3)/2)` that TurboQuant's Max-Lloyd centroids
357+
are optimized for. ADSampling only needs approximate coordinate independence
358+
(for hypothesis-testing pruning), so DCT suffices there. TurboQuant needs a
359+
specific known marginal distribution, so only random orthogonal rotations (QR or
360+
SORF) are suitable.
361+
362+
**Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a
363+
random orthogonal rotation to make coordinates independent. If we integrate
364+
ADSampling-style dimension pruning (see Stage 3), the same rotation could serve
365+
both purposes: producing the Beta distribution for quantization AND enabling
366+
hypothesis-testing for early pruning. This would avoid rotating the data twice
367+
and is a natural future optimization when combining block-TurboQuant with
368+
PDX-style scans.
369+
251370
#### Quantized-domain operations
252371

253372
All quantized operations read per-block norms from the internal child array:
@@ -305,6 +424,21 @@ x̃ = concat(x̂₀, ..., x̂ₖ₋₁)
305424
Transpose code storage from row-major to dimension-major within groups of 64
306425
vectors [4]. The 64-vector group size is independent of B.
307426

427+
**Changes vs. Stage 2:**
428+
429+
| Aspect | Stage 2 | Stage 3 |
430+
| ---------------------- | ------------------------------------------------ | ----------------------------------------------------------------- |
431+
| Codes layout | Row-major (all codes for one vector contiguous) | **Dimension-major within 64-vector chunks** |
432+
| Metadata | `is_pdx = false` | **`is_pdx = true`** |
433+
| Distance kernel | Per-vector loop with per-element centroid lookup | **SIMD-friendly 64-vector inner loop with distance-table lookup** |
434+
| Decode path | Direct inverse SORF per vector | **Un-transpose 64-vector chunk first**, then inverse SORF |
435+
| QJL signs (if present) | Row-major | **Also transposed** (same PDX layout as codes) |
436+
437+
**Unchanged from Stage 2:** Block size B, centroid computation, norm storage,
438+
SORF rotation, all encoding logic (PDX transpose is applied after encoding).
439+
The encode path produces row-major codes then transposes; the decode path
440+
un-transposes then decodes.
441+
308442
Within each 64-vector chunk, codes are stored dimension-major:
309443

310444
```
@@ -348,17 +482,47 @@ for tq_block in 0..k {
348482
}
349483
```
350484

485+
**Int8 layout variant.** The PDX implementation [pdx-impl] uses a different
486+
tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware
487+
dot-product instructions. For TurboQuant codes at b_mse ≤ 8, codes are u8
488+
centroid indices (not linear values), so VPDPBUSD doesn't apply directly — we
489+
need the distance-table-lookup path shown above. However, if we support a linear
490+
quantization mode (b_mse=8 with uniform centroids), the "4 dims × 16 vecs"
491+
layout could enable direct hardware dot-product on the codes, bypassing the
492+
lookup table entirely. This is a potential Stage 3 optimization to evaluate.
493+
494+
**ADSampling integration.** The PDX dimension-pruning approach (ADSampling [4])
495+
is complementary to TurboQuant's block structure. During a scan, the pruner
496+
could evaluate partial distances after each TQ block (B dimensions) and skip
497+
remaining blocks if the partial L2 distance already exceeds the candidate
498+
threshold. This requires the per-block norm weighting to happen at block
499+
boundaries (as shown in the kernel above), which our design already provides.
500+
351501
**Open design questions:**
352502

353503
- Slice/take on PDX-transposed codes: produce row-major (simpler) or preserve
354504
PDX (aligned 64-vector slices only)?
355505
- Is PDX a property of the encoding or a separate layout layer?
356506
- How does the compressor see the transposed codes?
507+
- Should we support the "4 dims × 16 vecs" int8 layout variant alongside the
508+
"1 dim × 64 vecs" float-style layout?
357509

358510
### QJL correction (deferred — experimental)
359511

360512
Based on community findings [8], QJL is deferred to after the MSE stages are
361-
validated. If pursued, four strategies should be compared:
513+
validated.
514+
515+
**Changes vs. MSE-only (if pursued):**
516+
517+
| Aspect | MSE-only | MSE + QJL |
518+
| ---------------------- | -------------------------------- | --------------------------------------------------------------- |
519+
| Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) |
520+
| Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction, Theorem 2 [1]) |
521+
| Additional children | None | QJL signs, QJL residual norms, QJL projection params |
522+
| Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) |
523+
| Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection |
524+
525+
If pursued, four strategies should be compared:
362526

363527
| Strategy | Theoretical | Speed | Storage |
364528
| -------------------- | --------------------- | ---------------- | --------------- |
@@ -377,9 +541,24 @@ bit widths, so QJL may not be worth the complexity.
377541

378542
## Array layout
379543

380-
### Stage 1 (single block, current)
544+
### Stage 1 (MSE-only single block)
545+
546+
```
547+
TurboQuantArray
548+
├── metadata: { dimension, b_mse, block_size (= padded_dim),
549+
│ num_blocks (= 1), is_pdx (= false) }
550+
551+
│ # Per-row children
552+
├── codes: FixedSizeListArray<u8> # list_size = padded_dim
553+
├── norms: PrimitiveArray<F> # len = num_rows (F = f64 for f64, f32 otherwise)
554+
555+
│ # Shared children
556+
├── centroids: PrimitiveArray<f32> # len = 2^b_mse
557+
├── mse_rotation_signs: PrimitiveArray<u8> # len = 3 × padded_dim (bitpacked)
558+
```
381559

382-
Identical to the [current PR][current-impl] array structure.
560+
Same structure as the [current PR][current-impl] minus the 3 QJL slots, plus
561+
the forward-compatible metadata fields and dtype-matching norms.
383562

384563
### Stage 2 (block decomposition)
385564

@@ -535,7 +714,8 @@ Stage 1 files without migration.
535714
**Norms are always internal children.** The TurboQuant array is self-contained —
536715
it stores norms as a child slot, not in a parent encoding. This means:
537716

538-
- Stage 1: norms child is `PrimitiveArray<f32>`, one norm per vector.
717+
- Stage 1: norms child is `PrimitiveArray<F>`, one norm per vector (F = f64 for
718+
f64 input, f32 otherwise).
539719
- Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format.
540720
- Stage 2 with k>1: norms child is `FixedSizeListArray<F>`, k norms per vector.
541721

0 commit comments

Comments
 (0)