You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add xsimd::get<I>() for compile-time element extraction
Adds a new public API `xsimd::get<I>(batch)` that extracts a compile-time
indexed lane from a batch. Unlike the runtime `batch::get(i)`, the index is
a template parameter so each arch can dispatch to the best single-op path.
Design per architecture (objdump-verified, pure -march flags, no reliance
on compiler optimization):
- SSE2: `first` for I==0; 32/64-bit (int, float, double) go through
`swizzle + first` so the xsimd permute API emits the shuffle; 8/16-bit
stay on `psrldq + movd` because sse2 swizzle expands to 2 ops for
broadcast-to-lane-0 (pshuflw/pshufhw + unpck) while srli keeps it at 1.
- SSE4.1: native `pextrb/w/d/q` for integer (1 op); float override removed
so it falls through to sse2's swizzle path (equivalent 1-op codegen).
- AVX/AVX2: half-extract + delegate to sse4_1 (1 op low half, 2 ops upper
half — hardware lower bound).
- AVX-512F: `valignd`/`valignq` rotate + extract for float/double — 1 op
for every I, including upper half (was 2). Integer keeps the extract +
pextr* split (2 ops, optimal).
- NEON/NEON64: native per-lane `mov`/`umov v.X[I]` (1 op).
- RVV: skip `vslidedown` when I==0.
Tests build `array_type { xsimd::get<Is>(res)... }` via pack-initialization,
compare against the reference array, and verify that reloading the extracted
values reproduces the original batch.
Verified on sse2, sse4.1, avx2, avx-512 (sde), aarch64 (qemu), rvv (qemu).
0 commit comments