@@ -358,7 +358,7 @@ child encodings.
358358In the initial implementation, block decomposition is embedded inside
359359` TurboQuantArray ` — all blocks use TQ MSE-only encoding with independent SORF
360360rotations, and TQ-specific children (centroids, rotation signs) are stored
361- alongside the blocks. However, the * concept * of block decomposition is
361+ alongside the blocks. However, the _ concept _ of block decomposition is
362362encoding-agnostic: a future refactor could extract it into a general-purpose
363363` BlockDecomposedFSLArray ` that wraps k independently-encoded child arrays. This
364364matters for straggler-block support (see below), where the straggler may use a
@@ -370,17 +370,17 @@ power-of-2 TQ array with an independent B-dim SORF rotation.
370370
371371** Changes vs. Stage 1 (with TQ blocks):**
372372
373- | Aspect | Stage 1 | Stage 2 |
374- | --------------------- | ------------------------------------------- | ---------------------------------------------------------------------------- |
375- | Block count | k = 1 (single power-of-2 block) | ** k = d/B** (multiple blocks) |
376- | SORF dimension | padded_dim (next power-of-2 ≥ dim) | ** B** (e.g., 256 for d=768) |
377- | Rotation signs | ` FSL ` , len = R, element dim = padded_dim | ** ` FSL ` , len = k × R** , element dim = B |
378- | Centroids | Computed for padded_dim distribution | ** Computed for B-dim distribution** (different codebook!) |
379- | Norms child | ` PrimitiveArray<F> ` , 1 per vector | ** ` PrimitiveArray<F> ` (k=1) or ` FixedSizeListArray<F> ` (k>1)** , same dtype F |
380- | Codes list_size | padded_dim | ** k × B** (= d) |
381- | Scheme compress() | Single SORF → quantize | ** Choose B → split → per-block normalize/rotate/quantize** |
382- | Quantized dot product | Single sum over padded_dim centroids | ** Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k) |
383- | L2 norm readthrough | O(1) — return stored norm | ** O(k)** — compute √(Σ_k norm_k²) |
373+ | Aspect | Stage 1 | Stage 2 |
374+ | --------------------- | ---------------------------------------- | ---------------------------------------------------------------------------- |
375+ | Block count | k = 1 (single power-of-2 block) | ** k = d/B** (multiple blocks) |
376+ | SORF dimension | padded_dim (next power-of-2 ≥ dim) | ** B** (e.g., 256 for d=768) |
377+ | Rotation signs | ` FSL ` , len = R, element dim = padded_dim | ** ` FSL ` , len = k × R** , element dim = B |
378+ | Centroids | Computed for padded_dim distribution | ** Computed for B-dim distribution** (different codebook!) |
379+ | Norms child | ` PrimitiveArray<F> ` , 1 per vector | ** ` PrimitiveArray<F> ` (k=1) or ` FixedSizeListArray<F> ` (k>1)** , same dtype F |
380+ | Codes list_size | padded_dim | ** k × B** (= d) |
381+ | Scheme compress() | Single SORF → quantize | ** Choose B → split → per-block normalize/rotate/quantize** |
382+ | Quantized dot product | Single sum over padded_dim centroids | ** Per-block weighted sum** (Σ_k norm_a_k · norm_b_k · unit_dot_k) |
383+ | L2 norm readthrough | O(1) — return stored norm | ** O(k)** — compute √(Σ_k norm_k²) |
384384
385385** Unchanged from Stage 1:** SORF construction (R-round HD, default R=3),
386386Max-Lloyd algorithm, f32 internal quantization, slice/take semantics (per-row
@@ -731,12 +731,12 @@ validated.
731731
732732If pursued, four strategies should be compared:
733733
734- | Strategy | Theoretical | Speed | Storage |
735- | -------------------- | --------------------- | ---------------- | --- ------------ |
736- | Per-block Gaussian | Correct (Lemma 4 [ 1] ) | O(B²)/block | k×B²×4 bytes |
737- | Per-block SORF | Approximate | O(B log B)/block | k×R×B bits |
738- | Full-dim SORF | Approximate | O(d log d) total | R×d bits |
739- | MSE-only (no QJL) | N/A | 0 | None |
734+ | Strategy | Theoretical | Speed | Storage |
735+ | ------------------ | --------------------- | ---------------- | ------------ |
736+ | Per-block Gaussian | Correct (Lemma 4 [ 1] ) | O(B²)/block | k×B²×4 bytes |
737+ | Per-block SORF | Approximate | O(B log B)/block | k×R×B bits |
738+ | Full-dim SORF | Approximate | O(d log d) total | R×d bits |
739+ | MSE-only (no QJL) | N/A | 0 | None |
740740
741741The paper's QJL uses Gaussian S (not SORF); Lemma 4 [ 1] is proved specifically
742742for Gaussian. SORF for QJL is an additional approximation (the
@@ -820,19 +820,19 @@ replace 32 with 64 in the norms row — ratios decrease accordingly):
820820
821821** At b_mse=8 (default, near-lossless):**
822822
823- | d | B | k | Per-vec bits | Ratio | Notes |
824- | ------------- | ---- | --- | -- --------------------- | ----- | ------------------------ |
825- | 768 | 256 | 3 | 3×256×8 + 3×32 = 6240 | 3.9× | Block decomp; no padding |
826- | 1024 | 1024 | 1 | 1024×8 + 32 = 8224 | 4.0× | Single block (= current) |
827- | 768 (padded)| 1024 | 1 | 1024×8 + 32 = 8224 | 3.0× | Padded; 33% overhead |
823+ | d | B | k | Per-vec bits | Ratio | Notes |
824+ | ------------ | ---- | --- | --------------------- | ----- | ------------------------ |
825+ | 768 | 256 | 3 | 3×256×8 + 3×32 = 6240 | 3.9× | Block decomp; no padding |
826+ | 1024 | 1024 | 1 | 1024×8 + 32 = 8224 | 4.0× | Single block (= current) |
827+ | 768 (padded) | 1024 | 1 | 1024×8 + 32 = 8224 | 3.0× | Padded; 33% overhead |
828828
829829** At b_mse=5 (32 centroids):**
830830
831- | d | B | k | Per-vec bits | Ratio | Notes |
832- | ------------- | ---- | --- | -- --------------------- | ----- | ------------------------ |
833- | 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding |
834- | 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) |
835- | 768 (padded)| 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead |
831+ | d | B | k | Per-vec bits | Ratio | Notes |
832+ | ------------ | ---- | --- | --------------------- | ----- | ------------------------ |
833+ | 768 | 256 | 3 | 3×256×5 + 3×32 = 3936 | 6.2× | Block decomp; no padding |
834+ | 1024 | 1024 | 1 | 1024×5 + 32 = 5152 | 6.4× | Single block (= current) |
835+ | 768 (padded) | 1024 | 1 | 1024×5 + 32 = 5152 | 4.8× | Padded; 33% overhead |
836836
837837Block decomposition improves the compression ratio at both bit widths. At b=8
838838for d=768: from ~ 3.0× (padded) to ~ 3.9× (block decomp). At b=5 for d=768: from
@@ -986,7 +986,7 @@ For common model dimensions, the most promising configurations are:
986986| ---------------------- | --------------------------- | -------------------------------------------------------------------------- |
987987| 512, 1024, 2048, 4096 | Single-block MSE-only + PDX | B=d, no decomposition needed. Same as current TQ but with PDX scan layout. |
988988| 768, 1536, 3072 | 3-block MSE-only + PDX | B=256 or 512. No padding waste. 3 blocks, shared centroids. |
989- | No qualifying B (rare) | Padded single-block | Internal zero-padding to next power-of-2, single SORF. |
989+ | No qualifying B (rare) | Padded single-block | Internal zero-padding to next power-of-2, single SORF. |
990990
991991In all cases, MSE-only is the recommended starting point. QJL should only be
992992added if experiments demonstrate clear recall@k improvements for the target
@@ -1121,11 +1121,11 @@ TurboQuant.
11211121
11221122** Incremental shipping:**
11231123
1124- | Stage | Ships to users? | Reads prior stage files? | Notes |
1125- | --------- | ---------------- | - -------------------------- | ---------------------------------- |
1126- | 1 (MSE) | Yes | N/A (first stable version) | Single block, variable SORF rounds |
1127- | 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder |
1128- | 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered |
1124+ | Stage | Ships to users? | Reads prior stage files? | Notes |
1125+ | ---------- | --------------- | -------------------------- | ---------------------------------- |
1126+ | 1 (MSE) | Yes | N/A (first stable version) | Single block, variable SORF rounds |
1127+ | 2 (blocks) | Yes | Yes (k=1 is identical) | k>1 files need Stage 2+ decoder |
1128+ | 3 (PDX) | Yes | Yes (FSL codes still work) | PDX codes need PDXArray registered |
11291129
11301130Each stage is independently shippable. Users can upgrade incrementally. Files
11311131written by earlier stages are always readable by later decoders.
0 commit comments