Skip to content

Commit 8a65474

Browse files
authored
further revisions to rfc 33 (#36)
follow ups to #33 #34 #35 based on further review Signed-off-by: Will Manning <will@willmanning.io>
1 parent 607c876 commit 8a65474

File tree

1 file changed

+71
-27
lines changed

1 file changed

+71
-27
lines changed

proposed/0033-block-turboquant.md

Lines changed: 71 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Authors:** Will Manning
44
**Status:** Proposal
5-
**Date:** 2026-04-03
5+
**Date:** 2026-04-02
66

77
## Summary
88

@@ -14,7 +14,7 @@ in three stages:
1414
2. **Block decomposition** (next): for non-power-of-2 dimensions, split into
1515
blocks of size B = the largest power-of-2 ≥ 64 that divides d. For
1616
power-of-2 dimensions, B = d (single block, same as current). Per-block
17-
norms externalized.
17+
norms stored as internal children.
1818
3. **PDX layout** (later): within each block, transpose codes into groups of
1919
64 vectors for SIMD scan performance.
2020

@@ -171,22 +171,27 @@ table above). Each full block gets an independent B-dim SORF rotation.
171171

172172
**Key properties:**
173173

174+
- **Self-contained.** The TurboQuant array handles block splitting, per-block
175+
normalization, rotation, and quantization internally. It accepts arbitrary
176+
(non-unit-norm) input vectors and stores per-block norms as internal children.
177+
No parent cooperation is needed — the array can decode without any parent
178+
context.
174179
- **One shared centroid set** for all blocks. All blocks use the same B-dim
175180
marginal distribution, so a single Max-Lloyd codebook serves every block.
176-
- **Unit-norm assumption.** The TurboQuant array operates only on pre-normalized
177-
sub-vectors. Per-block norms are externalized, following the pattern explored
178-
in PR #7251 (closed; concept will need reimplementation).
179181
- **Per-block SORF rotation signs.** Each block's SORF is independent (different
180182
seed). Signs are 3 × B bits per block.
181183
- **For power-of-2 dimensions**: B = d, k = 1. The encoding is functionally
182-
identical to Stage 1. The norm remains a single value per vector (not a
183-
FixedSizeList with list_size=1). Norm externalization is optional for k=1 and
184-
can be deferred to when it provides concrete benefit (e.g., GPU decode).
184+
identical to Stage 1 (single norm, single SORF rotation, no block splitting).
185185

186186
#### Norm architecture
187187

188-
The per-block norms are stored as a `FixedSizeListArray<F>` with
189-
`list_size = num_blocks`, where `F` matches or widens the input element type:
188+
Per-block norms are stored as an **internal child** of the TurboQuant array:
189+
190+
- For k = 1 (power-of-2 dims): `PrimitiveArray<F>` with len = num_rows
191+
(identical to Stage 1's single-norm layout).
192+
- For k > 1: `FixedSizeListArray<F>` with list_size = k, len = num_rows.
193+
194+
The norm dtype `F` matches or widens the input element type:
190195

191196
| Input dtype | Norm dtype | Rationale |
192197
| ----------- | ---------- | ---------------------------------------------- |
@@ -232,8 +237,9 @@ The actual MSE may depend on block dimension B: at larger B the coordinate
232237
distribution is more concentrated (variance ~1/B), giving the Max-Lloyd
233238
quantizer more to exploit. See Experimental plan.
234239

235-
**SORF approximation.** The 3-round SORF `HD₃·HD₂·HD₁` [5] provides
236-
3 × log₂(B) butterfly stages per round (18 at B=64, 24 at B=256, 27 at B=512).
240+
**SORF approximation.** The 3-round SORF `HD₃·HD₂·HD₁` [5] provides log₂(B)
241+
butterfly stages per round × 3 rounds = 3·log₂(B) total (18 at B=64, 24 at
242+
B=256, 27 at B=512).
237243
This is a rough heuristic for mixing quality — [5] does not analyze convergence
238244
rate as a function of rounds × dimension. Empirical validation is needed.
239245

@@ -244,7 +250,7 @@ vectors). Each block must have an **independent** rotation matrix.
244250

245251
#### Quantized-domain operations
246252

247-
All quantized operations require per-block norms:
253+
All quantized operations read per-block norms from the internal child array:
248254

249255
- **L2 distance**: `‖a-b‖² = Σ_k ‖aₖ‖² + Σ_k ‖bₖ‖² - 2·Σ_k ‖aₖ‖·‖bₖ‖·
250256
unit_dotₖ`. Primary ANN metric; reuses per-block dot product and norms.
@@ -279,8 +285,9 @@ for i in 0..k:
279285
else:
280286
cᵢ[j] = 0
281287
282-
Store: codes (k × B per vector), block_norms (k per vector),
283-
centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared)
288+
Store (all as internal children):
289+
codes (k × B per vector), norms (k per vector),
290+
centroids (2^b_mse, shared), SORF signs (k × 3 × B, shared)
284291
```
285292

286293
#### Decoding algorithm
@@ -289,7 +296,7 @@ Store: codes (k × B per vector), block_norms (k per vector),
289296
for i in 0..k:
290297
r̂ᵢ[j] = centroids[cᵢ[j]]
291298
ûᵢ = SORF⁻¹ᵢ(r̂ᵢ)
292-
x̂ᵢ = nᵢ × ûᵢ
299+
x̂ᵢ = nᵢ × ûᵢ (nᵢ read from internal norms child)
293300
x̃ = concat(x̂₀, ..., x̂ₖ₋₁)
294301
```
295302

@@ -350,7 +357,7 @@ for tq_block in 0..k {
350357

351358
### QJL correction (deferred — experimental)
352359

353-
Based on community findings [7], QJL is deferred to after the MSE stages are
360+
Based on community findings [8], QJL is deferred to after the MSE stages are
354361
validated. If pursued, four strategies should be compared:
355362

356363
| Strategy | Theoretical | Speed | Storage |
@@ -377,18 +384,17 @@ Identical to the [current PR][current-impl] array structure.
377384
### Stage 2 (block decomposition)
378385

379386
```
380-
TurboQuantArray (operates on unit-norm B-dim sub-vectors)
387+
TurboQuantArray (self-contained, handles blocks internally)
381388
├── metadata: { dimension, b_mse, block_size, num_blocks, is_pdx }
382389
383-
│ # Per-row children
390+
│ # Per-row children (sliced/taken on row operations)
384391
├── codes: FixedSizeListArray<u8> # list_size = k × B
392+
├── norms: PrimitiveArray<F> # len = num_rows (k=1)
393+
│ or FixedSizeListArray<F> # list_size = k (k>1)
385394
386-
│ # Shared children
395+
│ # Shared children (cloned on row operations, not sliced)
387396
├── centroids: PrimitiveArray<f32> # len = 2^b_mse
388397
├── mse_rotation_signs: PrimitiveArray<u8> # len = k × 3 × B
389-
390-
Externalized:
391-
├── block_norms: FixedSizeListArray<F> # list_size = k
392398
```
393399

394400
## Compression ratio
@@ -467,10 +473,10 @@ to merge MSE-only (no QJL). This is a complete encoding for all dimensions
467473
(with padding for non-power-of-2).
468474

469475
**Phase 2** — Block decomposition: Add block splitting for non-power-of-2
470-
dimensions. Externalize norms. B = largest power-of-2 ≥ 64 dividing d. The
471-
`TurboQuantScheme::compress()` method must be updated to: (a) choose B based on
472-
d, (b) split input into blocks, (c) normalize per-block, (d) encode each block,
473-
and (e) store per-block norms in the parent encoding layer.
476+
dimensions. B = largest power-of-2 ≥ 64 dividing d. Per-block norms stored as
477+
internal children. The `TurboQuantScheme::compress()` method must be updated to:
478+
(a) choose B based on d, (b) split input into blocks, (c) normalize per-block,
479+
(d) encode each block, and (e) store per-block norms as an internal child array.
474480

475481
**Phase 3** — PDX layout: Dimension-major code transposition within 64-vector
476482
chunks. Distance computation kernels.
@@ -515,6 +521,44 @@ At b=8, codes are raw int8 indices. Direct int8 tensor core GEMM requires
515521
approximately linear centroids (sacrificing Max-Lloyd optimality); viable for
516522
ANN ranking but not reconstruction.
517523

524+
## Migration and compatibility
525+
526+
TurboQuant has not shipped yet, so there are no existing files to migrate. We
527+
can design the metadata for forward compatibility from day one.
528+
529+
**Strategy: single array ID, versioned metadata.** All stages use the same array
530+
ID (`vortex.turboquant`). The metadata includes `block_size`, `num_blocks`, and
531+
`is_pdx` fields from Stage 1 onward. Stage 1 always writes `num_blocks=1,
532+
is_pdx=false`, but the fields exist so that Stage 2 and 3 decoders can read
533+
Stage 1 files without migration.
534+
535+
**Norms are always internal children.** The TurboQuant array is self-contained —
536+
it stores norms as a child slot, not in a parent encoding. This means:
537+
538+
- Stage 1: norms child is `PrimitiveArray<f32>`, one norm per vector.
539+
- Stage 2 with k=1 (power-of-2 dims): same as Stage 1, identical wire format.
540+
- Stage 2 with k>1: norms child is `FixedSizeListArray<F>`, k norms per vector.
541+
542+
The decoder distinguishes k=1 from k>1 by reading `num_blocks` from metadata.
543+
A k=1 decoder is backward-compatible with Stage 1 files. A k>1 decoder is a new
544+
code path that only applies to files written by Stage 2+.
545+
546+
**Stage 3 (PDX) is additive.** The `is_pdx` flag in metadata tells the decoder
547+
whether codes are row-major or dimension-major. Stage 1/2 files have
548+
`is_pdx=false`; Stage 3 files have `is_pdx=true`. The decoder un-transposes
549+
PDX files on read if needed. No migration required.
550+
551+
**Incremental shipping:**
552+
553+
| Stage | Ships to users? | Reads Stage 1 files? | Notes |
554+
| ------------ | ---------------- | ---------------------- | ----------------------------------- |
555+
| 1 (MSE-only) | Yes, immediately | N/A (first version) | New encoding, no backcompat concern |
556+
| 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder |
557+
| 3 (PDX) | Yes | Yes (is_pdx=false) | PDX files need Stage 3 decoder |
558+
559+
Each stage is independently shippable. Users can upgrade incrementally. Files
560+
written by earlier stages are always readable by later decoders.
561+
518562
## References
519563

520564
[1] Zandieh, A., Daliri, M., Hadian, M. and Mirrokni, V. "TurboQuant: Online

0 commit comments

Comments
 (0)