@@ -45,13 +45,52 @@ MSE stage — O(d²) storage and O(d²) per-vector. For the QJL stage, the paper
4545uses a random Gaussian projection matrix S with i.i.d. N(0,1) entries (not an
4646orthogonal rotation); this distinction matters for the unbiasedness proof.
4747
48- Our [ current implementation] [ current-impl ] substitutes a 3-round Structured
49- Orthogonal Random Features (SORF) transform ` HD₃·HD₂·HD₁ ` [ 5] for both the MSE
50- rotation and the QJL projection, giving O(d) storage and O(d log d) per-vector.
51- The 3-round SORF construction was introduced for kernel approximation [ 5] and
52- approximates a random orthogonal matrix. Note that this is distinct from the
53- single-round SRHT (` R·H·D ` ) analyzed by Tropp [ 3] and the FJLT (` P·H·D ` ) of
54- Ailon-Chazelle [ 2] , both of which are dimensionality-reducing projections.
48+ ### Current Vortex implementation
49+
50+ Our [ current implementation] [ current-impl ] (Rust, in the ` vortex-tensor ` crate)
51+ implements TurboQuant as a Vortex array encoding that compresses
52+ ` FixedSizeList<float> ` arrays — the storage format of ` Vector ` and
53+ ` FixedShapeTensor ` extension types. Key design choices and characteristics:
54+
55+ ** Rotation.** Instead of the paper's O(d²) QR rotation, we use a 3-round
56+ Structured Orthogonal Random Features (SORF) transform ` HD₃·HD₂·HD₁ ` [ 5] for
57+ both the MSE rotation and the QJL projection, giving O(d) storage (3d sign bits,
58+ bitpacked) and O(d log d) per-vector. The rotation signs are stored as a
59+ bitpacked child array rather than recomputed from a seed at decode time. The
60+ 3-round SORF was introduced for kernel approximation [ 5] and approximates a
61+ random orthogonal matrix. It is distinct from the single-round SRHT (` R·H·D ` )
62+ analyzed by Tropp [ 3] and the FJLT (` P·H·D ` ) of Ailon-Chazelle [ 2] , both of
63+ which are dimensionality-reducing projections rather than rotation
64+ approximations.
65+
66+ ** Centroids.** Max-Lloyd centroids are computed via numerical integration
67+ (trapezoid rule, 1000 points per interval) of the marginal Beta distribution at
68+ the padded dimension, using the ` HalfIntExponent ` type for exact integer/half-
69+ integer exponent arithmetic. Centroids are cached in a global ` DashMap ` keyed by
70+ ` (dimension, bit_width) ` and stored as a shared ` PrimitiveArray<f32> ` child.
71+
72+ ** Array structure.** The ` TurboQuantArray ` stores up to 7 child slots: codes
73+ (` FixedSizeListArray<u8> ` , one per vector, list_size = padded_dim), norms
74+ (` PrimitiveArray<f32> ` ), centroids (shared), MSE rotation signs (shared,
75+ bitpacked), and optionally 3 QJL children (signs, residual norms, QJL rotation
76+ signs). Codes are stored as u8 centroid indices; the cascade compressor
77+ (BitPacked encoding) handles packing to the actual bit width on disk.
78+
79+ ** Compute pushdowns.** Slice and take propagate to per-row children (codes,
80+ norms) while sharing rotation signs and centroids. Quantized cosine similarity
81+ and dot product operate directly on codes and centroids without decompression.
82+ L2 norm returns the stored norm directly (O(1) readthrough).
83+
84+ ** Compression scheme.** ` TurboQuantScheme ` implements the ` Scheme ` trait for the
85+ BtrBlocks cascading compressor. It matches ` Vector ` and ` FixedShapeTensor `
86+ extension arrays with non-nullable float elements and dimension ≥ 3, using the
87+ default config (5-bit QJL = 4-bit MSE + 1-bit QJL, seed 42).
88+
89+ ** Input handling.** All float types (f16, f32, f64) are converted to f32 before
90+ quantization. Per-vector L2 norms are computed and stored as f32. Non-power-of-2
91+ dimensions are zero-padded to the next power of 2 for SORF compatibility. The
92+ minimum dimension is 3 (d=2 causes a singularity in the Beta distribution
93+ exponent).
5594
5695### Reference implementation bugs
5796
@@ -105,13 +144,48 @@ The SORF requires power-of-2 input dimension. For non-power-of-2 dimensions
105144
106145### PDX
107146
108- PDX [ 4] is a data layout for vector similarity search that stores dimensions in
109- a vertical ( dimension-major) layout within fixed-size blocks of 64 vectors. This
110- enables the compiler to auto-vectorize the inner distance loop over vectors
147+ PDX [ 4] is a data layout for vector similarity search. The paper (SIGMOD '25)
148+ describes a dimension-major layout within fixed-size blocks of 64 vectors,
149+ enabling the compiler to auto-vectorize the inner distance loop over vectors
111150rather than dimensions, achieving on average 2× speedups over SIMD-optimized
112151row-major kernels on modern CPUs. The block size of 64 is empirically optimal
113152across AVX-512, AVX2, and NEON architectures [ 4] .
114153
154+ ** PDX implementation evolution.** The [ open-source implementation] [ pdx-impl ]
155+ has evolved beyond the paper in several ways relevant to this RFC:
156+
157+ - ** 8-bit scalar quantization** (` IndexPDXIVFTreeSQ8 ` ): Maps floats to 0-255 via
158+ linear min-max scaling. The int8 layout differs from float32: dimensions are
159+ packed in groups of 4 ("4 dims × 16 vecs") to leverage hardware dot-product
160+ instructions (VPDPBUSD on x86, UDOT/SDOT on ARM) that process 4 byte pairs
161+ per operation. This is a different tiling than the paper's "1 dim × 64 vecs."
162+ - ** ADSampling with random rotation** : The pruner applies a random orthogonal
163+ rotation (QR of Gaussian, or DCT when FFTW is available) to the entire
164+ collection as a preprocessing step. This makes coordinates approximately
165+ independent, enabling dimension-by-dimension hypothesis testing for early
166+ pruning. The rotation serves a similar purpose to TurboQuant's rotation —
167+ making the coordinate distribution known — but for pruning rather than
168+ quantization.
169+ - ** Dimension zones** : Consecutive dimensions are grouped into zones; at query
170+ time, zones are ranked by "distance-to-means" and the most discriminative
171+ zones are scanned first, enabling faster pruning.
172+ - ** Future: 1-bit vectors** are mentioned as planned.
173+
174+ ** Implications for our design.** The PDX paper's float32 layout ("1 dim × 64
175+ vecs") maps cleanly to our quantized-code scan kernel, where the inner loop
176+ gathers from a centroid-product distance table over 64 vectors. However, if we
177+ pursue direct int8 arithmetic (b_mse=8 with linear centroids, see GPU section),
178+ the "4 dims × 16 vecs" int8 layout from the PDX implementation may be more
179+ appropriate, as it enables hardware dot-product instructions.
180+
181+ Additionally, ADSampling's dimension-pruning approach is complementary to
182+ TurboQuant's block structure: when scanning with block decomposition, the pruner
183+ could skip entire TQ blocks (B dimensions at a time) if the partial distance
184+ already exceeds the candidate threshold. This combines the storage efficiency of
185+ quantization with the computational savings of early termination.
186+
187+ [ pdx-impl ] : https://github.com/cwida/PDX
188+
115189## Proposal
116190
117191### Block size strategy
@@ -148,40 +222,68 @@ divides d. This eliminates stragglers entirely for common embedding dimensions:
148222
149223### Stage 1: MSE-only TurboQuant (immediate — split from current PR)
150224
151- Split the [ current PR] [ current-impl ] to extract and merge the MSE-only subset
152- (removing QJL encoding, QJL array slots, and QJL-specific tests). The QJL code
153- can be preserved on a separate branch for Phase 4. The MSE-only encoding
154- provides:
225+ Split the [ current PR] [ current-impl ] to extract and merge the MSE-only subset.
226+ The QJL code can be preserved on a separate branch for Phase 4.
227+
228+ ** Changes vs. current PR:**
229+
230+ | Aspect | Current PR | Stage 1 |
231+ | -------------- | ------------------------------------------- | ----------------------------------------------------- |
232+ | QJL support | Full (encode, decode, QJL slots, QJL tests) | ** Removed** |
233+ | Array slots | 7 (4 MSE + 3 QJL) | ** 4** (codes, norms, centroids, rotation_signs) |
234+ | Scheme default | 5-bit QJL (4-bit MSE + 1-bit QJL) | ** 5-bit MSE-only** (32 centroids) |
235+ | Norms dtype | Always f32 | ** Same-or-wider** : f64 for f64 input, f32 for f32/f16 |
236+ | Metadata | ` has_qjl: bool ` | ** Removed** (always MSE-only) |
155237
156- - SORF-based random rotation at the padded dimension
157- - Max-Lloyd scalar quantization with shared centroids
158- - Per-vector norm storage (single f32, regardless of input dtype — the
159- dtype-matching norm behavior described in Stage 2 is a later change)
160- - Slice, take, scalar_at compute pushdowns
161- - Quantized-domain cosine similarity and dot product
162- - File format integration via the compression scheme
238+ ** Unchanged from current PR:** SORF rotation, Max-Lloyd centroids,
239+ zero-padding for non-power-of-2, slice/take/scalar_at pushdowns, quantized
240+ cosine similarity and dot product, compression scheme integration, minimum dim=3.
163241
164- This is a complete, useful encoding for power-of-2 dimensions. For non-power-of-2
165- dimensions it has the padding overhead described above.
242+ ** Added to metadata (for forward compat):** ` block_size: u32 ` (always =
243+ padded_dim), ` num_blocks: u32 ` (always = 1), ` is_pdx: bool ` (always = false).
244+ These fields are inert in Stage 1 but enable Stage 2/3 decoders to read
245+ Stage 1 files.
246+
247+ This is a complete, useful encoding for all dimensions. Power-of-2 dimensions
248+ have zero padding waste; non-power-of-2 dimensions have the padding overhead
249+ described above.
166250
167251### Stage 2: Block decomposition
168252
169253For non-power-of-2 dimensions, split into blocks of size B (as determined by the
170254table above). Each full block gets an independent B-dim SORF rotation.
171255
172- ** Key properties:**
256+ ** Changes vs. Stage 1:**
257+
258+ | Aspect | Stage 1 | Stage 2 |
259+ | --------------------- | ------------------------------------ | ---------------------------------------------------------------------------- |
260+ | Block count | k = 1 (single block at padded_dim) | ** k = d/B** (multiple blocks, no padding) |
261+ | SORF dimension | padded_dim (e.g., 1024 for d=768) | ** B** (e.g., 256 for d=768) |
262+ | Rotation signs | Single set, len = 3 × padded_dim | ** k sets** , len = k × 3 × B |
263+ | Centroids | Computed for padded_dim distribution | ** Computed for B-dim distribution** (different codebook!) |
264+ | Norms child | ` PrimitiveArray<F> ` , 1 per vector | ** ` PrimitiveArray<F> ` (k=1) or ` FixedSizeListArray<F> ` (k>1)** , same dtype F |
265+ | Codes list_size | padded_dim | ** k × B** (= d for no-straggler dims) |
266+ | Scheme compress() | Pad → single SORF → quantize | ** Choose B → split → per-block normalize/rotate/quantize** |
267+ | Quantized dot product | Single sum over padded_dim centroids | ** Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k) |
268+ | L2 norm readthrough | O(1) — return stored norm | ** O(k)** — compute √(Σ_k norm_k²) |
269+ | Zero-padding waste | Up to 33% (768→1024) | ** Zero** for common dims |
270+
271+ ** Unchanged from Stage 1:** SORF construction (3-round HD), Max-Lloyd algorithm,
272+ f32 internal quantization, slice/take semantics (per-row data sliced, shared
273+ data cloned), bitpacked rotation sign storage, compression scheme trait.
274+
275+ ** For power-of-2 dimensions** : B = d, k = 1. The encoding produces an identical
276+ wire format to Stage 1 (single norm, single SORF, single codes block). A Stage 2
277+ encoder writing k=1 data is fully backward-compatible with Stage 1 decoders.
278+
279+ ** Key design properties:**
173280
174281- ** Self-contained.** The TurboQuant array handles block splitting, per-block
175- normalization, rotation, and quantization internally. It accepts arbitrary
176- (non-unit-norm) input vectors and stores per-block norms as internal children.
177- No parent cooperation is needed — the array can decode without any parent
178- context.
179- - ** One shared centroid set** for all blocks. All blocks use the same B-dim
180- marginal distribution, so a single Max-Lloyd codebook serves every block.
282+ normalization, rotation, and quantization internally. No parent cooperation
283+ is needed.
284+ - ** One shared centroid set** for all blocks at the same B-dim distribution.
181285- ** Per-block SORF rotation signs.** Each block's SORF is independent (different
182286 seed). Signs are 3 × B bits per block.
183- - ** For power-of-2 dimensions** : B = d, k = 1. The encoding is functionally
184- identical to Stage 1 (single norm, single SORF rotation, no block splitting).
185287
186288#### Norm architecture
187289
@@ -248,6 +350,23 @@ B × B random orthogonal matrix (QR of Gaussian). Storage at B=256: 256 KB per
248350block. For d=768 with k=3: 768 KB total. Amortizes for large columns (100K+
249351vectors). Each block must have an ** independent** rotation matrix.
250352
353+ ** Why not DCT?** The PDX implementation [ pdx-impl] uses DCT (via FFTW) as a fast
354+ rotation for ADSampling. DCT is O(B log B) and invertible, but it is a ** fixed
355+ structured transform** , not a random rotation — it does not produce the Beta
356+ marginal distribution ` (1-x²)^((d-3)/2) ` that TurboQuant's Max-Lloyd centroids
357+ are optimized for. ADSampling only needs approximate coordinate independence
358+ (for hypothesis-testing pruning), so DCT suffices there. TurboQuant needs a
359+ specific known marginal distribution, so only random orthogonal rotations (QR or
360+ SORF) are suitable.
361+
362+ ** Shared rotation with ADSampling.** Both TurboQuant and ADSampling apply a
363+ random orthogonal rotation to make coordinates independent. If we integrate
364+ ADSampling-style dimension pruning (see Stage 3), the same rotation could serve
365+ both purposes: producing the Beta distribution for quantization AND enabling
366+ hypothesis-testing for early pruning. This would avoid rotating the data twice
367+ and is a natural future optimization when combining block-TurboQuant with
368+ PDX-style scans.
369+
251370#### Quantized-domain operations
252371
253372All quantized operations read per-block norms from the internal child array:
@@ -305,6 +424,21 @@ x̃ = concat(x̂₀, ..., x̂ₖ₋₁)
305424Transpose code storage from row-major to dimension-major within groups of 64
306425vectors [ 4] . The 64-vector group size is independent of B.
307426
427+ ** Changes vs. Stage 2:**
428+
429+ | Aspect | Stage 2 | Stage 3 |
430+ | ---------------------- | ------------------------------------------------ | ----------------------------------------------------------------- |
431+ | Codes layout | Row-major (all codes for one vector contiguous) | ** Dimension-major within 64-vector chunks** |
432+ | Metadata | ` is_pdx = false ` | ** ` is_pdx = true ` ** |
433+ | Distance kernel | Per-vector loop with per-element centroid lookup | ** SIMD-friendly 64-vector inner loop with distance-table lookup** |
434+ | Decode path | Direct inverse SORF per vector | ** Un-transpose 64-vector chunk first** , then inverse SORF |
435+ | QJL signs (if present) | Row-major | ** Also transposed** (same PDX layout as codes) |
436+
437+ ** Unchanged from Stage 2:** Block size B, centroid computation, norm storage,
438+ SORF rotation, all encoding logic (PDX transpose is applied after encoding).
439+ The encode path produces row-major codes then transposes; the decode path
440+ un-transposes then decodes.
441+
308442Within each 64-vector chunk, codes are stored dimension-major:
309443
310444```
@@ -348,17 +482,47 @@ for tq_block in 0..k {
348482}
349483```
350484
485+ ** Int8 layout variant.** The PDX implementation [ pdx-impl] uses a different
486+ tiling for int8 data: "4 dims × 16 vecs" to leverage VPDPBUSD/UDOT hardware
487+ dot-product instructions. For TurboQuant codes at b_mse ≤ 8, codes are u8
488+ centroid indices (not linear values), so VPDPBUSD doesn't apply directly — we
489+ need the distance-table-lookup path shown above. However, if we support a linear
490+ quantization mode (b_mse=8 with uniform centroids), the "4 dims × 16 vecs"
491+ layout could enable direct hardware dot-product on the codes, bypassing the
492+ lookup table entirely. This is a potential Stage 3 optimization to evaluate.
493+
494+ ** ADSampling integration.** The PDX dimension-pruning approach (ADSampling [ 4] )
495+ is complementary to TurboQuant's block structure. During a scan, the pruner
496+ could evaluate partial distances after each TQ block (B dimensions) and skip
497+ remaining blocks if the partial L2 distance already exceeds the candidate
498+ threshold. This requires the per-block norm weighting to happen at block
499+ boundaries (as shown in the kernel above), which our design already provides.
500+
351501** Open design questions:**
352502
353503- Slice/take on PDX-transposed codes: produce row-major (simpler) or preserve
354504 PDX (aligned 64-vector slices only)?
355505- Is PDX a property of the encoding or a separate layout layer?
356506- How does the compressor see the transposed codes?
507+ - Should we support the "4 dims × 16 vecs" int8 layout variant alongside the
508+ "1 dim × 64 vecs" float-style layout?
357509
358510### QJL correction (deferred — experimental)
359511
360512Based on community findings [ 8] , QJL is deferred to after the MSE stages are
361- validated. If pursued, four strategies should be compared:
513+ validated.
514+
515+ ** Changes vs. MSE-only (if pursued):**
516+
517+ | Aspect | MSE-only | MSE + QJL |
518+ | ---------------------- | -------------------------------- | --------------------------------------------------------------- |
519+ | Bit budget | All b bits → MSE (2^b centroids) | b-1 bits MSE + 1 bit QJL (2^(b-1) centroids) |
520+ | Inner product estimate | Biased (MSE quantization noise) | Unbiased (QJL correction, Theorem 2 [ 1] ) |
521+ | Additional children | None | QJL signs, QJL residual norms, QJL projection params |
522+ | Encode cost | SORF only | SORF + QJL projection (O(B²) for Gaussian, O(B log B) for SORF) |
523+ | Decode cost | Inverse SORF only | Inverse SORF + QJL inverse projection |
524+
525+ If pursued, four strategies should be compared:
362526
363527| Strategy | Theoretical | Speed | Storage |
364528| -------------------- | --------------------- | ---------------- | --------------- |
@@ -377,9 +541,24 @@ bit widths, so QJL may not be worth the complexity.
377541
378542## Array layout
379543
380- ### Stage 1 (single block, current)
544+ ### Stage 1 (MSE-only single block)
545+
546+ ```
547+ TurboQuantArray
548+ ├── metadata: { dimension, b_mse, block_size (= padded_dim),
549+ │ num_blocks (= 1), is_pdx (= false) }
550+ │
551+ │ # Per-row children
552+ ├── codes: FixedSizeListArray<u8> # list_size = padded_dim
553+ ├── norms: PrimitiveArray<F> # len = num_rows (F = f64 for f64, f32 otherwise)
554+ │
555+ │ # Shared children
556+ ├── centroids: PrimitiveArray<f32> # len = 2^b_mse
557+ ├── mse_rotation_signs: PrimitiveArray<u8> # len = 3 × padded_dim (bitpacked)
558+ ```
381559
382- Identical to the [ current PR] [ current-impl ] array structure.
560+ Same structure as the [ current PR] [ current-impl ] minus the 3 QJL slots, plus
561+ the forward-compatible metadata fields and dtype-matching norms.
383562
384563### Stage 2 (block decomposition)
385564
@@ -535,7 +714,8 @@ Stage 1 files without migration.
535714** Norms are always internal children.** The TurboQuant array is self-contained —
536715it stores norms as a child slot, not in a parent encoding. This means:
537716
538- - Stage 1: norms child is ` PrimitiveArray<f32> ` , one norm per vector.
717+ - Stage 1: norms child is ` PrimitiveArray<F> ` , one norm per vector (F = f64 for
718+ f64 input, f32 otherwise).
539719- Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format.
540720- Stage 2 with k>1: norms child is ` FixedSizeListArray<F> ` , k norms per vector.
541721
0 commit comments