Commit 521d23f
committed
feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 W1)
The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142.
That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs,
hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]`
decorating scalar bodies that gave zero speedup over plain scalar. The
root cause: there was no `U8x32` type in the polyfill, so consumers
couldn't write SIMD byte code at AVX2's natural width (32 bytes = one
__m256i ymm register).
This PR adds U8x32 with real __m256i storage and 26 polyfill methods
mirroring `simd_avx512::U8x64`:
Constructors: splat, from_slice, from_array, to_array, copy_to_slice
Reductions: reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64
Min/max: simd_min, simd_max (_mm256_min_epu8, _mm256_max_epu8)
Compare→mask: cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80),
movemask → u32 (matches _mm256_movemask_epi8 width)
Saturating: saturating_add, saturating_sub (_mm256_adds/subs_epu8)
Avg: pairwise_avg (_mm256_avg_epu8, round-up)
Shifts: shr_epi16, shl_epi16 (16-bit lane shifts via _mm256_srl/sll_epi16)
Shuffles: shuffle_bytes (within-128-bit-lane, _mm256_shuffle_epi8)
permute_bytes (cross-lane, scalar fallback — AVX2 has no
native cross-lane byte permute; matches U8x64's behavior
on AVX-512F-without-VBMI hosts)
unpack_lo_epi8, unpack_hi_epi8 (_mm256_unpacklo/hi_epi8)
Conditional: mask_blend (_mm256_blendv_epi8, MSB-driven, NOT bitmask)
LUT: nibble_popcount_lut
Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping),
Debug, Default. All ~26 methods.
Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build
tiers — U8x32 is the natural AVX2 byte width and is needed regardless
of whether AVX-512's U8x64 is the consumer's preferred width.
Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics
are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that
AVX2 is the compile target (x86-64-v3 is project baseline). The codebase
uses this pattern already in the AVX2 popcount at simd_avx2.rs:357.
Test coverage:
- 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max
reductions, unsigned cmp masks (incl. high-byte > 127 to verify the
XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg
round-up, shr_epi16 nibble extraction, permute_bytes reverse,
mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes.
- All 18 pass. Total test count 1786 → 1804 with no regressions.
clippy --features rayon -- -D warnings: clean.
Companion: this PR unblocks the round-3 consumer fleet which will
rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41
and friends to use `crate::simd::U8x32` instead of `#[target_feature]`
wrappers around scalar code. Each consumer rewrite ships as its own PR
in the next wave.1 parent fd11845 commit 521d23f
2 files changed
Lines changed: 525 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
265 | 265 | | |
266 | 266 | | |
267 | 267 | | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
268 | 276 | | |
269 | 277 | | |
270 | 278 | | |
| |||
0 commit comments