You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: add mulhi, mullo, and mulhilo for integer batches
Adds three integer-multiplication primitives exposed via the public API:
- mullo(x, y): low half of the lane-wise product (equivalent to x * y)
- mulhi(x, y): high half of the lane-wise product
- mulhilo(x, y): returns {mulhi, mullo} as a pair
Native kernels are provided for:
- NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit)
- SVE (svmulh_x)
- RVV (rvvmulh)
- SSE2 (mulhi_epi16 / mulhi_epu16)
- SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core)
- AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core)
- AVX-512F (shared 64-bit core)
- AVX-512BW (mulhi_epi16/epu16)
The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp:
mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with
xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific
widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the
same 64x64 -> hi software path and unifies the signed-fixup to a single
arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on
SSE4.1/AVX2 via bitwise_rshift).
The generic fallback in common dispatches per-type through mulhi_helper,
using a wider native integer for <=32-bit types and software split-and-
multiply (or __int128 when available) for 64-bit.
0 commit comments