Feat/mulhilo by DiamonDinoia · Pull Request #1334 · xtensor-stack/xsimd

DiamonDinoia · 2026-04-28T21:22:18Z

Some toolchains (notably certain GCC builds) define shift and mul
immediate intrinsics as macros that apply a textual C-style cast to
their operand. That cast does not traverse the multi-level alias
inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd
-> avx512f), so a batch<T, ISA> fails to convert to its native
register type in those contexts.

Declare the conversion operator on batch itself so the native type is
always one member-lookup away.

DiamonDinoia · 2026-04-28T21:22:43Z

@AntoinePrv I optimized the cases I use in my RNG but it might be optimized further. What do you think?

AntoinePrv · 2026-04-29T07:49:02Z

+        /* Redeclare the conversion operator at the most-derived level. Some
+         * compilers fail to invoke the conversion inherited from
+         * types::simd_register when a batch is fed to an intrinsic defined as
+         * a macro (e.g. certain GCC shift/mullo imm intrinsics), because the
+         * textual C-style cast inside the macro does not traverse the alias
+         * inheritance chain. Declaring the operator here makes it visible on
+         * the batch type directly. */
+        XSIMD_INLINE operator register_type() const noexcept
+        {
+            return this->data;
+        }
+


I've been wondering if instead of conversion operator, we should rather aim for a method batch::raw() -> register_type (that we roll out progressively).
I have also found that relying on conversions makes it a bit harder to follow what is going on, and is definitely the sort of things where we see differences between compilers. What do you think?
CC @serge-sans-paille @JohanMabille

AntoinePrv · 2026-04-29T08:10:13Z

+     * Computes the low N bits of the 2N-bit product of integer batches \c x and \c y.
+     * Equivalent to ``mul(x, y)`` — the low half of the full product is identical for
+     * both signed and unsigned interpretations.


Can we extend this a little bit, it was not immediately clear to me. Perhaps adding something like

The function behaves as if it computes the per-slot product of x and y using double the input width as an intermediate representation (e.g. 64 bits for 32-bit inputs), then returning the lower half of the result (the lower 32 bits in this example).

Some toolchains (notably certain GCC builds) define shift and mul immediate intrinsics as macros that apply a textual C-style cast to their operand. That cast does not traverse the multi-level alias inheritance of simd_register (e.g. avx512bw -> avx512dq -> avx512cd -> avx512f), so a batch<T, ISA> fails to convert to its native register type in those contexts. Declare the conversion operator on batch itself so the native type is always one member-lookup away.

Adds three integer-multiplication primitives exposed via the public API: - mullo(x, y): low half of the lane-wise product (equivalent to x * y) - mulhi(x, y): high half of the lane-wise product - mulhilo(x, y): returns {mulhi, mullo} as a pair Native kernels are provided for: - NEON (vmull_* + vshrn for 8/16/32-bit; software path for 64-bit) - SVE (svmulh_x) - RVV (rvvmulh) - SSE2 (mulhi_epi16 / mulhi_epu16) - SSE4.1 (mul_epi32/mul_epu32 + blend for 32-bit; shared 64-bit core) - AVX2 (mulhi_epi16/epu16, mul_epi32/mul_epu32 + blend; shared 64-bit core) - AVX-512F (shared 64-bit core) - AVX-512BW (mulhi_epi16/epu16) The 64-bit x86 cores share a single implementation in common/xsimd_common_arithmetic.hpp: mulhi_u64_core and mulhi_i64_core express the ll/lh/hl/hh decomposition with xsimd batch operators (&, >>, +, -, bitwise_cast) plus an arch-specific widening-mul functor (_mm*_mul_epu32). This eliminates three copies of the same 64x64 -> hi software path and unifies the signed-fixup to a single arithmetic-shift-by-63 pattern (maps to vpsraq on AVX-512, emulated on SSE4.1/AVX2 via bitwise_rshift). The generic fallback in common dispatches per-type through mulhi_helper, using a wider native integer for <=32-bit types and software split-and- multiply (or __int128 when available) for 64-bit.

…vlen=128) QEMU < 11's RVV TCG emulation is dramatically slower than scalar (QEMU issue #2137). At vlen=128, gcc's RVV codegen for our test_xsimd ends up running long enough under apt-shipped qemu-user-static (8.2.x noble, 9.x plucky, 10.x trixie) to overflow the 6h GHA job timeout while making no observable progress. Measured locally: qemu 8.2.2 (Ubuntu 24.04 apt) : test_xsimd at vlen=128 times out qemu 9.2.1 (Ubuntu 25.04 plucky) : ditto qemu 10.0.8 (Debian trixie) : ditto qemu 11.0.0 (Arch) + gcc 15.1 : 367 cases / 5664 asserts in <10 min Vlens >= 256 stay within the test step budget on apt qemu (smaller emulator slowdown per logical op). Keep the existing cross-rvv.yml workflow as-is — multi-compiler matrix (gcc-14, clang-17/18), apt qemu-user-static — but drop vector_bits=128 from its matrix and add fail-fast: false plus a 15 min timeout-minutes safety net so a stuck entry doesn't cancel its peers or burn 6h. Add a sibling workflow cross-rvv-arch.yml that runs the build and test inside archlinux:latest (qemu 11 + gcc 15.1) and covers vector_bits=128/256/512. This restores RVV vlen=128 coverage today without waiting for ubuntu-latest to ship qemu 11. References: QEMU 11.0.0 release notes https://www.qemu.org/2026/04/22/qemu-11-0-0/ QEMU RVV slowdowns issue https://gitlab.com/qemu-project/qemu/-/issues/2137 Ubuntu RVV vstart bug https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2095169

AntoinePrv reviewed Apr 29, 2026

View reviewed changes

DiamonDinoia force-pushed the feat/mulhilo branch 8 times, most recently from f142b9f to 244d74b Compare May 1, 2026 20:44

DiamonDinoia added 2 commits May 5, 2026 12:02

DiamonDinoia force-pushed the feat/mulhilo branch 4 times, most recently from b993a15 to 987b054 Compare May 6, 2026 15:34

DiamonDinoia closed this May 6, 2026

DiamonDinoia force-pushed the feat/mulhilo branch from 987b054 to 6a9d7f0 Compare May 6, 2026 15:51

DiamonDinoia mentioned this pull request May 6, 2026

Feat/mulhilo #1344

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/mulhilo#1334

Feat/mulhilo#1334
DiamonDinoia wants to merge 3 commits into
xtensor-stack:masterfrom
DiamonDinoia:feat/mulhilo

DiamonDinoia commented Apr 28, 2026

Uh oh!

DiamonDinoia commented Apr 28, 2026

Uh oh!

AntoinePrv Apr 29, 2026

Uh oh!

AntoinePrv Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

DiamonDinoia commented Apr 28, 2026

Uh oh!

DiamonDinoia commented Apr 28, 2026

Uh oh!

AntoinePrv Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

AntoinePrv Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants