Feat/mulhilo#1334
Conversation
Some toolchains (notably certain GCC builds) define shift and mul immediate intrinsics as macros that apply a textual C-style cast to their operand. That cast does not traverse the multi-level alias inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd -> avx512f), so a batch<T, ISA> fails to convert to its native register type in those contexts. Declare the conversion operator on batch itself so the native type is always one member-lookup away.
Adds three integer-multiplication primitives exposed via the public API:
- mullo(x, y): low half of the lane-wise product (equivalent to x * y)
- mulhi(x, y): high half of the lane-wise product
- mulhilo(x, y): returns {mulhi, mullo} as a pair
Native kernels are provided for:
- NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit)
- SVE (svmulh_x)
- RVV (rvvmulh)
- SSE2 (mulhi_epi16 / mulhi_epu16)
- SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core)
- AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core)
- AVX-512F (shared 64-bit core)
- AVX-512BW (mulhi_epi16/epu16)
The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp:
mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with
xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific
widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the
same 64x64 -> hi software path and unifies the signed-fixup to a single
arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on
SSE4.1/AVX2 via bitwise_rshift).
The generic fallback in common dispatches per-type through mulhi_helper,
using a wider native integer for <=32-bit types and software split-and-
multiply (or __int128 when available) for 64-bit.
|
@AntoinePrv I optimized the cases I use in my RNG but it might be optimized further. What do you think? |
| /* Redeclare the conversion operator at the most-derived level. Some | ||
| * compilers fail to invoke the conversion inherited from | ||
| * types::simd_register when a batch is fed to an intrinsic defined as | ||
| * a macro (e.g. certain GCC shift/mullo imm intrinsics), because the | ||
| * textual C-style cast inside the macro does not traverse the alias | ||
| * inheritance chain. Declaring the operator here makes it visible on | ||
| * the batch type directly. */ | ||
| XSIMD_INLINE operator register_type() const noexcept | ||
| { | ||
| return this->data; | ||
| } | ||
|
|
There was a problem hiding this comment.
I've been wondering if instead of conversion operator, we should rather aim for a method batch::raw() -> register_type (that we roll out progressively).
I have also found that relying on conversions makes it a bit harder to follow what is going on, and is definitely the sort of things where we see differences between compilers. What do you think?
CC @serge-sans-paille @JohanMabille
| * Computes the low N bits of the 2N-bit product of integer batches \c x and \c y. | ||
| * Equivalent to ``mul(x, y)`` — the low half of the full product is identical for | ||
| * both signed and unsigned interpretations. |
There was a problem hiding this comment.
Can we extend this a little bit, it was not immediately clear to me. Perhaps adding something like
The function behaves as if it computes the per-slot product of x and y using double the input width as an intermediate representation (e.g. 64 bits for 32-bit inputs), then returning the lower half of the result (the lower 32 bits in this example).
Some toolchains (notably certain GCC builds) define shift and mul
immediate intrinsics as macros that apply a textual C-style cast to
their operand. That cast does not traverse the multi-level alias
inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd
-> avx512f), so a batch<T, ISA> fails to convert to its native
register type in those contexts.
Declare the conversion operator on batch itself so the native type is
always one member-lookup away.