Skip to content

Commit 521d23f

Browse files
committed
feat(simd_avx2): add U8x32 — native AVX2 byte vector (round-3 W1)
The keystone for the cosmetic-SIMD sweep agent #11 audited on PR #142. That audit found 8 confirmed cosmetic SIMD wrappers in hpc/byte_scan.rs, hpc/palette_codec.rs, and hpc/aabb.rs — `#[target_feature(enable = "avx2")]` decorating scalar bodies that gave zero speedup over plain scalar. The root cause: there was no `U8x32` type in the polyfill, so consumers couldn't write SIMD byte code at AVX2's natural width (32 bytes = one __m256i ymm register). This PR adds U8x32 with real __m256i storage and 26 polyfill methods mirroring `simd_avx512::U8x64`: Constructors: splat, from_slice, from_array, to_array, copy_to_slice Reductions: reduce_sum (wrap-add), reduce_min, reduce_max, sum_bytes_u64 Min/max: simd_min, simd_max (_mm256_min_epu8, _mm256_max_epu8) Compare→mask: cmpeq_mask → u32, cmpgt_mask → u32 (unsigned via xor 0x80), movemask → u32 (matches _mm256_movemask_epi8 width) Saturating: saturating_add, saturating_sub (_mm256_adds/subs_epu8) Avg: pairwise_avg (_mm256_avg_epu8, round-up) Shifts: shr_epi16, shl_epi16 (16-bit lane shifts via _mm256_srl/sll_epi16) Shuffles: shuffle_bytes (within-128-bit-lane, _mm256_shuffle_epi8) permute_bytes (cross-lane, scalar fallback — AVX2 has no native cross-lane byte permute; matches U8x64's behavior on AVX-512F-without-VBMI hosts) unpack_lo_epi8, unpack_hi_epi8 (_mm256_unpacklo/hi_epi8) Conditional: mask_blend (_mm256_blendv_epi8, MSB-driven, NOT bitmask) LUT: nibble_popcount_lut Plus operators: BitAnd, BitOr, BitXor, Add (wrapping), Sub (wrapping), Debug, Default. All ~26 methods. Re-exported from `crate::simd::U8x32` for both AVX-512 and AVX2 build tiers — U8x32 is the natural AVX2 byte width and is needed regardless of whether AVX-512's U8x64 is the consumer's preferred width. Soundness model matches the rest of simd_avx2.rs: `_mm256_*` intrinsics are wrapped in `unsafe { }` blocks inside safe `pub fn`, trusting that AVX2 is the compile target (x86-64-v3 is project baseline). The codebase uses this pattern already in the AVX2 popcount at simd_avx2.rs:357. Test coverage: - 18 new tests in `mod u8x32_tests` covering: roundtrip, sum/min/max reductions, unsigned cmp masks (incl. high-byte > 127 to verify the XOR-0x80 unsigned trick), saturating add/sub clamps, pairwise_avg round-up, shr_epi16 nibble extraction, permute_bytes reverse, mask_blend per-MSB selection, nibble_popcount_lut via shuffle_bytes. - All 18 pass. Total test count 1786 → 1804 with no regressions. clippy --features rayon -- -D warnings: clean. Companion: this PR unblocks the round-3 consumer fleet which will rewrite byte_find_all_avx2 / pack_indices / aabb_intersect_batch_sse41 and friends to use `crate::simd::U8x32` instead of `#[target_feature]` wrappers around scalar code. Each consumer rewrite ships as its own PR in the next wave.
1 parent fd11845 commit 521d23f

2 files changed

Lines changed: 525 additions & 0 deletions

File tree

src/simd.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,14 @@ pub use crate::simd_avx2::{
265265
I32x16, I64x8, I8x64, U16x32, U32x16, U64x8, U8x64,
266266
};
267267

268+
// U8x32 — native AVX2 byte width (one __m256i = 32 bytes). Available on
269+
// both AVX-512 and AVX2 builds: it's the natural width for byte-level
270+
// AVX2 ops, and on AVX-512 builds it's the half-register companion to
271+
// U8x64. Lives in simd_avx2.rs (single source of truth) and is re-exported
272+
// from both tier branches.
273+
#[cfg(target_arch = "x86_64")]
274+
pub use crate::simd_avx2::{u8x32, U8x32};
275+
268276
// ============================================================================
269277
// Non-x86: scalar fallback types with identical API
270278
// ============================================================================

0 commit comments

Comments
 (0)