Commit 334d31e
perf[turboquant]: restore fast SIMD-friendly decode by expanding stored signs
The bit-packed apply_inverse_srht_from_bits path introduced a ~20%
decode throughput regression vs the original f32 sign multiply path,
because per-element bit extraction + conditional negate is hard for
the compiler to autovectorize.
Fix: expand the stored BoolArray signs into f32 ±1.0 vectors once at
decode start via RotationMatrix::from_bool_array(), then use the
original inverse_rotate() with its SIMD-friendly apply_signs() inner
loop. The expansion costs 3 × padded_dim × 4 bytes of temporary
memory (12KB for dim=1024), amortized over all rows.
We still store signs as 1-bit BoolArray on disk (32x space savings),
but recover full autovectorized throughput at decode time.
The apply_inverse_srht_from_bits function is retained (with tests) for
potential future use with explicit SIMD bit-extraction intrinsics.
Signed-off-by: Will Manning <will@spiraldb.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Will Manning <will@willmanning.io>1 parent 2d84cbf commit 334d31e
1 file changed
Lines changed: 8 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
19 | | - | |
20 | 19 | | |
21 | 20 | | |
22 | 21 | | |
| |||
45 | 44 | | |
46 | 45 | | |
47 | 46 | | |
48 | | - | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
49 | 50 | | |
50 | | - | |
51 | | - | |
52 | | - | |
| 51 | + | |
53 | 52 | | |
54 | 53 | | |
55 | 54 | | |
| |||
60 | 59 | | |
61 | 60 | | |
62 | 61 | | |
| 62 | + | |
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| |||
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
| 72 | + | |
79 | 73 | | |
80 | 74 | | |
81 | | - | |
| 75 | + | |
82 | 76 | | |
83 | 77 | | |
84 | | - | |
| 78 | + | |
85 | 79 | | |
86 | 80 | | |
87 | 81 | | |
| |||
0 commit comments