Skip to content

Feat/mulhilo#1334

Draft
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/mulhilo
Draft

Feat/mulhilo#1334
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/mulhilo

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Some toolchains (notably certain GCC builds) define shift and mul
immediate intrinsics as macros that apply a textual C-style cast to
their operand. That cast does not traverse the multi-level alias
inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd
-> avx512f), so a batch<T, ISA> fails to convert to its native
register type in those contexts.

Declare the conversion operator on batch itself so the native type is
always one member-lookup away.

Some toolchains (notably certain GCC builds) define shift and mul
immediate intrinsics as macros that apply a textual C-style cast to
their operand. That cast does not traverse the multi-level alias
inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd
-> avx512f), so a batch<T, ISA> fails to convert to its native
register type in those contexts.

Declare the conversion operator on batch itself so the native type is
always one member-lookup away.
Adds three integer-multiplication primitives exposed via the public API:
  - mullo(x, y): low half of the lane-wise product (equivalent to x * y)
  - mulhi(x, y): high half of the lane-wise product
  - mulhilo(x, y): returns {mulhi, mullo} as a pair

Native kernels are provided for:
  - NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit)
  - SVE (svmulh_x)
  - RVV (rvvmulh)
  - SSE2 (mulhi_epi16 / mulhi_epu16)
  - SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core)
  - AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core)
  - AVX-512F (shared 64-bit core)
  - AVX-512BW (mulhi_epi16/epu16)

The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp:
mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with
xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific
widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the
same 64x64 -> hi software path and unifies the signed-fixup to a single
arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on
SSE4.1/AVX2 via bitwise_rshift).

The generic fallback in common dispatches per-type through mulhi_helper,
using a wider native integer for <=32-bit types and software split-and-
multiply (or __int128 when available) for 64-bit.
@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

@AntoinePrv I optimized the cases I use in my RNG but it might be optimized further. What do you think?

Comment on lines +132 to +143
/* Redeclare the conversion operator at the most-derived level. Some
* compilers fail to invoke the conversion inherited from
* types::simd_register when a batch is fed to an intrinsic defined as
* a macro (e.g. certain GCC shift/mullo imm intrinsics), because the
* textual C-style cast inside the macro does not traverse the alias
* inheritance chain. Declaring the operator here makes it visible on
* the batch type directly. */
XSIMD_INLINE operator register_type() const noexcept
{
return this->data;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been wondering if instead of conversion operator, we should rather aim for a method batch::raw() -> register_type (that we roll out progressively).
I have also found that relying on conversions makes it a bit harder to follow what is going on, and is definitely the sort of things where we see differences between compilers. What do you think?
CC @serge-sans-paille @JohanMabille

Comment on lines +1751 to +1753
* Computes the low N bits of the 2N-bit product of integer batches \c x and \c y.
* Equivalent to ``mul(x, y)`` — the low half of the full product is identical for
* both signed and unsigned interpretations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we extend this a little bit, it was not immediately clear to me. Perhaps adding something like

The function behaves as if it computes the per-slot product of x and y using double the input width as an intermediate representation (e.g. 64 bits for 32-bit inputs), then returning the lower half of the result (the lower 32 bits in this example).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants