Commit 5c83b84
perf: add SIMD-accelerated u8 L2 and cosine distance kernels (#6517)
## Summary
- Add hand-written AVX2 and AVX-512 VNNI backends for u8 squared L2
distance (`Σ(a-b)²`) in new `l2_u8.rs`
- Add fused single-pass u8 cosine distance kernel in new `cosine_u8.rs`
— computes `dot(a,b)`, `‖a‖²`, `‖b‖²` simultaneously, halving memory
traffic vs the previous 2-3 pass approach
- Wire both into the `L2 for u8` and `Cosine for u8` trait impls
- Add benchmarks comparing scalar vs SIMD for both kernels
### Algorithmic approach (adapted from
[NumKong](https://github.com/ashvardanian/NumKong))
**L2 (AVX2):** Saturating subtraction for `|a-b|`, zero-extend u8→i16,
`VPMADDWD(diff, diff)` to square and accumulate into i32. 32
elements/iter.
**L2 (AVX-512 VNNI):** Same abs-diff approach with `VPDPWSSD` for fused
square-accumulate. 64 elements/iter.
**Cosine (AVX2):** Zero-extend both vectors to i16, triple `VPMADDWD`
per half (a·b, a·a, b·b). 32 elements/iter, single pass.
**Cosine (AVX-512 VNNI):** Same three-accumulator approach with
`VPDPWSSD`. 64 elements/iter.
Both kernels use `OnceLock`-based runtime CPU dispatch, falling back to
portable scalar on non-x86 platforms.
### Benchmarks
*1M × 1024-dim u8 vectors.*
**x86_64 — AMD Ryzen 5 4500 6-Core (AVX2, no AVX-512)**
| Kernel | Scalar | SIMD | Speedup |
|--------|--------|------|---------|
| L2(u8) | 73.5 ms | 58.2 ms | **1.26x** |
| Cosine(u8) | 122.2 ms | 82.1 ms | **1.49x** |
L2 auto-vectorization baseline was 91.5 ms, so SIMD is 1.57x faster than
that path.
**aarch64 — Apple Silicon M3 Max (no AVX2, scalar fallback)**
| Kernel | Scalar | SIMD (dispatch) |
|--------|--------|-----------------|
| L2(u8) | 26.8 ms | 27.3 ms |
| Cosine(u8) | 90.1 ms | 90.4 ms |
On aarch64 the SIMD path falls through to scalar (no AVX2), so times are
identical — confirms no regression on non-x86 platforms. AVX-512 VNNI
systems (Ice Lake+, Zen 4+) should see larger gains.
## Test plan
- [x] All 11 new tests pass: SIMD backends verified against scalar
reference across 18 vector sizes (0–4097), boundary values (0/255),
alternating patterns, random seeds
- [x] All 63 existing lance-linalg tests pass (no regressions)
- [x] Clippy clean, fmt clean
- [x] Benchmarked on x86_64 AVX2 (AMD Ryzen 5 4500) — L2 1.26x, Cosine
1.49x faster
- [ ] Verify on AVX-512 VNNI system for additional speedup data
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 65ac541 commit 5c83b84
7 files changed
Lines changed: 699 additions & 2 deletions
File tree
- rust/lance-linalg
- benches
- src
- distance
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
8 | 8 | | |
9 | 9 | | |
10 | 10 | | |
| 11 | + | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
| |||
76 | 77 | | |
77 | 78 | | |
78 | 79 | | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
79 | 116 | | |
80 | 117 | | |
81 | 118 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| |||
157 | 158 | | |
158 | 159 | | |
159 | 160 | | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
160 | 172 | | |
161 | 173 | | |
162 | 174 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17 | 17 | | |
18 | 18 | | |
19 | 19 | | |
| 20 | + | |
20 | 21 | | |
21 | 22 | | |
22 | 23 | | |
| 24 | + | |
23 | 25 | | |
24 | 26 | | |
25 | 27 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
65 | 65 | | |
66 | 66 | | |
67 | 67 | | |
68 | | - | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
69 | 74 | | |
70 | 75 | | |
71 | 76 | | |
| |||
0 commit comments