22
33** Authors:** Will Manning
44** Status:** Proposal
5- ** Date:** 2026-04-03
5+ ** Date:** 2026-04-02
66
77## Summary
88
@@ -14,7 +14,7 @@ in three stages:
14142 . ** Block decomposition** (next): for non-power-of-2 dimensions, split into
1515 blocks of size B = the largest power-of-2 ≥ 64 that divides d. For
1616 power-of-2 dimensions, B = d (single block, same as current). Per-block
17- norms externalized .
17+ norms stored as internal children .
18183 . ** PDX layout** (later): within each block, transpose codes into groups of
1919 64 vectors for SIMD scan performance.
2020
@@ -171,22 +171,27 @@ table above). Each full block gets an independent B-dim SORF rotation.
171171
172172** Key properties:**
173173
174+ - ** Self-contained.** The TurboQuant array handles block splitting, per-block
175+ normalization, rotation, and quantization internally. It accepts arbitrary
176+ (non-unit-norm) input vectors and stores per-block norms as internal children.
177+ No parent cooperation is needed — the array can decode without any parent
178+ context.
174179- ** One shared centroid set** for all blocks. All blocks use the same B-dim
175180 marginal distribution, so a single Max-Lloyd codebook serves every block.
176- - ** Unit-norm assumption.** The TurboQuant array operates only on pre-normalized
177- sub-vectors. Per-block norms are externalized, following the pattern explored
178- in PR #7251 (closed; concept will need reimplementation).
179181- ** Per-block SORF rotation signs.** Each block's SORF is independent (different
180182 seed). Signs are 3 × B bits per block.
181183- ** For power-of-2 dimensions** : B = d, k = 1. The encoding is functionally
182- identical to Stage 1. The norm remains a single value per vector (not a
183- FixedSizeList with list_size=1). Norm externalization is optional for k=1 and
184- can be deferred to when it provides concrete benefit (e.g., GPU decode).
184+ identical to Stage 1 (single norm, single SORF rotation, no block splitting).
185185
186186#### Norm architecture
187187
188- The per-block norms are stored as a ` FixedSizeListArray<F> ` with
189- ` list_size = num_blocks ` , where ` F ` matches or widens the input element type:
188+ Per-block norms are stored as an ** internal child** of the TurboQuant array:
189+
190+ - For k = 1 (power-of-2 dims): ` PrimitiveArray<F> ` with len = num_rows
191+ (identical to Stage 1's single-norm layout).
192+ - For k > 1: ` FixedSizeListArray<F> ` with list_size = k, len = num_rows.
193+
194+ The norm dtype ` F ` matches or widens the input element type:
190195
191196| Input dtype | Norm dtype | Rationale |
192197| ----------- | ---------- | ---------------------------------------------- |
@@ -232,8 +237,9 @@ The actual MSE may depend on block dimension B: at larger B the coordinate
232237distribution is more concentrated (variance ~ 1/B), giving the Max-Lloyd
233238quantizer more to exploit. See Experimental plan.
234239
235- ** SORF approximation.** The 3-round SORF ` HD₃·HD₂·HD₁ ` [ 5] provides
236- 3 × log₂(B) butterfly stages per round (18 at B=64, 24 at B=256, 27 at B=512).
240+ ** SORF approximation.** The 3-round SORF ` HD₃·HD₂·HD₁ ` [ 5] provides log₂(B)
241+ butterfly stages per round × 3 rounds = 3·log₂(B) total (18 at B=64, 24 at
242+ B=256, 27 at B=512).
237243This is a rough heuristic for mixing quality — [ 5] does not analyze convergence
238244rate as a function of rounds × dimension. Empirical validation is needed.
239245
@@ -244,7 +250,7 @@ vectors). Each block must have an **independent** rotation matrix.
244250
245251#### Quantized-domain operations
246252
247- All quantized operations require per-block norms:
253+ All quantized operations read per-block norms from the internal child array :
248254
249255- ** L2 distance** : `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖·
250256unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms.
@@ -279,8 +285,9 @@ for i in 0..k:
279285 else:
280286 cᵢ[j] = 0
281287
282- Store: codes (k × B per vector), block_norms (k per vector),
283- centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared)
288+ Store (all as internal children):
289+ codes (k × B per vector), norms (k per vector),
290+ centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared)
284291```
285292
286293#### Decoding algorithm
@@ -289,7 +296,7 @@ Store: codes (k × B per vector), block_norms (k per vector),
289296for i in 0..k:
290297 r̂ᵢ[j] = centroids[cᵢ[j]]
291298 ûᵢ = SORF⁻¹ᵢ(r̂ᵢ)
292- x̂ᵢ = nᵢ × ûᵢ
299+ x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child)
293300x̃ = concat(x̂₀, ..., x̂ₖ₋₁)
294301```
295302
@@ -350,7 +357,7 @@ for tq_block in 0..k {
350357
351358### QJL correction (deferred — experimental)
352359
353- Based on community findings [ 7 ] , QJL is deferred to after the MSE stages are
360+ Based on community findings [ 8 ] , QJL is deferred to after the MSE stages are
354361validated. If pursued, four strategies should be compared:
355362
356363| Strategy | Theoretical | Speed | Storage |
@@ -377,18 +384,17 @@ Identical to the [current PR][current-impl] array structure.
377384### Stage 2 (block decomposition)
378385
379386```
380- TurboQuantArray (operates on unit-norm B-dim sub-vectors )
387+ TurboQuantArray (self-contained, handles blocks internally )
381388├── metadata: { dimension, b_mse, block_size, num_blocks, is_pdx }
382389│
383- │ # Per-row children
390+ │ # Per-row children (sliced/taken on row operations)
384391├── codes: FixedSizeListArray<u8> # list_size = k × B
392+ ├── norms: PrimitiveArray<F> # len = num_rows (k=1)
393+ │ or FixedSizeListArray<F> # list_size = k (k>1)
385394│
386- │ # Shared children
395+ │ # Shared children (cloned on row operations, not sliced)
387396├── centroids: PrimitiveArray<f32> # len = 2^b_mse
388397├── mse_rotation_signs: PrimitiveArray<u8> # len = k × 3 × B
389-
390- Externalized:
391- ├── block_norms: FixedSizeListArray<F> # list_size = k
392398```
393399
394400## Compression ratio
@@ -467,10 +473,10 @@ to merge MSE-only (no QJL). This is a complete encoding for all dimensions
467473(with padding for non-power-of-2).
468474
469475** Phase 2** — Block decomposition: Add block splitting for non-power-of-2
470- dimensions. Externalize norms. B = largest power-of-2 ≥ 64 dividing d. The
471- ` TurboQuantScheme::compress() ` method must be updated to: (a) choose B based on
472- d, (b) split input into blocks, (c) normalize per-block, (d) encode each block,
473- and (e) store per-block norms in the parent encoding layer .
476+ dimensions. B = largest power-of-2 ≥ 64 dividing d. Per-block norms stored as
477+ internal children. The ` TurboQuantScheme::compress() ` method must be updated to:
478+ (a) choose B based on d, (b) split input into blocks, (c) normalize per-block,
479+ (d) encode each block, and (e) store per-block norms as an internal child array .
474480
475481** Phase 3** — PDX layout: Dimension-major code transposition within 64-vector
476482chunks. Distance computation kernels.
@@ -515,6 +521,44 @@ At b=8, codes are raw int8 indices. Direct int8 tensor core GEMM requires
515521approximately linear centroids (sacrificing Max-Lloyd optimality); viable for
516522ANN ranking but not reconstruction.
517523
524+ ## Migration and compatibility
525+
526+ TurboQuant has not shipped yet, so there are no existing files to migrate. We
527+ can design the metadata for forward compatibility from day one.
528+
529+ ** Strategy: single array ID, versioned metadata.** All stages use the same array
530+ ID (` vortex.turboquant ` ). The metadata includes ` block_size ` , ` num_blocks ` , and
531+ ` is_pdx ` fields from Stage 1 onward. Stage 1 always writes `num_blocks=1,
532+ is_pdx=false`, but the fields exist so that Stage 2 and 3 decoders can read
533+ Stage 1 files without migration.
534+
535+ ** Norms are always internal children.** The TurboQuant array is self-contained —
536+ it stores norms as a child slot, not in a parent encoding. This means:
537+
538+ - Stage 1: norms child is ` PrimitiveArray<f32> ` , one norm per vector.
539+ - Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format.
540+ - Stage 2 with k>1: norms child is ` FixedSizeListArray<F> ` , k norms per vector.
541+
542+ The decoder distinguishes k=1 from k>1 by reading ` num_blocks ` from metadata.
543+ A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a new
544+ code path that only applies to files written by Stage 2+.
545+
546+ ** Stage 3 (PDX) is additive.** The ` is_pdx ` flag in metadata tells the decoder
547+ whether codes are row-major or dimension-major. Stage 1/2 files have
548+ ` is_pdx=false ` ; Stage 3 files have ` is_pdx=true ` . The decoder un-transposes
549+ PDX files on read if needed. No migration required.
550+
551+ ** Incremental shipping:**
552+
553+ | Stage | Ships to users? | Reads Stage 1 files? | Notes |
554+ | ------------ | ---------------- | ---------------------- | ----------------------------------- |
555+ | 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern |
556+ | 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder |
557+ | 3 (PDX) | Yes | Yes (is_pdx=false) | PDX files need Stage 3 decoder |
558+
559+ Each stage is independently shippable. Users can upgrade incrementally. Files
560+ written by earlier stages are always readable by later decoders.
561+
518562## References
519563
520564[ 1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online
0 commit comments