Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions include/xsimd/arch/generic/xsimd_generic_memory.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,19 @@ namespace xsimd
return buffer[i];
}

// load
template <class A, class T>
XSIMD_INLINE batch_bool<T, A> load(bool const* mem, batch_bool<T, A>, requires_arch<generic>) noexcept
{
using batch_type = batch<T, A>;
batch_type ref(0);
constexpr auto size = batch_bool<T, A>::size;
alignas(A::alignment()) T buffer[size];
for (std::size_t i = 0; i < size; ++i)
buffer[i] = mem[i] ? 1 : 0;
return ref != batch_type::load_aligned(&buffer[0]);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is going to be faster than using _load_mask16 and the likes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, you mean load/store batch_bool with bits rather than bytes? except the load/sotre bandwidth, it seems there are little difference between two of them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

@junparser junparser Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean sort like
__mmask64 mask = _mm512_cmpgt_epu8_mask(vecInput, _mm512_setzero_epi32()); for avx512? _mm512_cmpgt_epu8_mask is under avx512bw, I'll add later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, implemented in xsimd_avx512bw.hpp

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have the possibility to check in terms of performance, I've written another approach at computing masks that looks quite efficient: https://github.com/serge-sans-paille/fast-bitset-from-bool-array

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the case with size 8 16 32 64 & cache hit, it is 2x~8x than the original version. and it's also better than avx512bw version with size 8 16 32. maybe better to deal with size 64 in avx512bw.
would you like to send another patches? or should i just update with this one?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please just update this one, that's fine.

}

// load_aligned
namespace detail
{
Expand Down
26 changes: 26 additions & 0 deletions include/xsimd/arch/xsimd_avx512bw.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,22 @@ namespace xsimd
return detail::compare_int_avx512bw<A, T, _MM_CMPINT_LT>(self, other);
}

// load
template <class A, class T>
XSIMD_INLINE batch_bool<T, A> load(bool const* mem, batch_bool<T, A>, requires_arch<avx512bw>) noexcept
{
using register_type = typename batch_bool<T, A>::register_type;
XSIMD_IF_CONSTEXPR(batch_bool<T, A>::size == 64)
{
__m512i bool_val = _mm512_loadu_si512((__m512i const*)mem);
return (register_type)_mm512_cmpgt_epu8_mask(bool_val, _mm512_setzero_si512());
}
else
{
return load(mem, batch_bool<T, A>(), avx512dq {});
}
}

// max
template <class A, class T, class = typename std::enable_if<std::is_integral<T>::value, void>::type>
XSIMD_INLINE batch<T, A> max(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx512bw>) noexcept
Expand Down Expand Up @@ -628,6 +644,16 @@ namespace xsimd
}
}

// store
template <class T, class A>
XSIMD_INLINE void store(batch_bool<T, A> const& self, bool* mem, requires_arch<avx512bw>) noexcept
{
constexpr auto size = batch_bool<T, A>::size;
__m512i bool_val = _mm512_maskz_set1_epi8(self.data, 0x01);
__mmask64 mask = size >= 64 ? ~(__mmask64)0 : (1ULL << size) - 1;
_mm512_mask_storeu_epi8((void*)mem, mask, bool_val);
}

// sub
template <class A, class T, class = typename std::enable_if<std::is_integral<T>::value, void>::type>
XSIMD_INLINE batch<T, A> sub(batch<T, A> const& self, batch<T, A> const& other, requires_arch<avx512bw>) noexcept
Expand Down
33 changes: 33 additions & 0 deletions include/xsimd/arch/xsimd_avx512f.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1193,6 +1193,39 @@ namespace xsimd
return detail::compare_int_avx512f<A, T, _MM_CMPINT_LE>(self, other);
}

namespace detail
{
// Adapted from https://github.com/serge-sans-paille/fast-bitset-from-bool-array
// Generate a bitset from an array of boolean.
XSIMD_INLINE unsigned char tobitset(unsigned char unpacked[8])
{
uint64_t data;
memcpy(&data, unpacked, sizeof(uint64_t));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const uint64_t magic = (0x80 + 0x4000 + 0x200000 + 0x10000000 + 0x0800000000 + 0x040000000000 + 0x02000000000000 + 0x0100000000000000);

unsigned char res = ((data * magic) >> 56) & 0xFF;
return res;
}
}

// load
template <class A, class T>
XSIMD_INLINE batch_bool<T, A> load(bool const* mem, batch_bool<T, A>, requires_arch<avx512f>) noexcept
{
using register_type = typename batch_bool<T, A>::register_type;
constexpr auto size = batch_bool<T, A>::size;
constexpr auto iter = size / 8;
static_assert(size % 8 == 0, "incorrect size of bool batch");
register_type mask = 0;
for (std::size_t i = 0; i < iter; ++i)
{
unsigned char block = detail::tobitset((unsigned char*)mem + i * 8);
mask |= (register_type(block) << (i * 8));
}
return mask;
}

// load_aligned
template <class A, class T, class = typename std::enable_if<std::is_integral<T>::value, void>::type>
XSIMD_INLINE batch<T, A> load_aligned(T const* mem, convert<T>, requires_arch<avx512f>) noexcept
Expand Down
6 changes: 1 addition & 5 deletions include/xsimd/types/xsimd_batch.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -968,11 +968,7 @@ namespace xsimd
template <class T, class A>
XSIMD_INLINE batch_bool<T, A> batch_bool<T, A>::load_aligned(bool const* mem) noexcept
{
batch_type ref(0);
alignas(A::alignment()) T buffer[size];
for (std::size_t i = 0; i < size; ++i)
buffer[i] = mem[i] ? 1 : 0;
return ref != batch_type::load_aligned(&buffer[0]);
return kernel::load<A>(mem, batch_bool<T, A>(), A {});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hell, I missed one point: given the implementation specialization, we now need to distinguish between the aligned and unaligned case.
Sorry for the extra back-and-forth :-/

The fact that CI passes probably means we need extra tests for this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get the point, do we need aligned and unaligned case for bool load?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we already have template <class T, class A> XSIMD_INLINE batch_bool<T, A> batch_bool<T, A>::load_unaligned(bool const* mem) noexcept in https://github.com/xtensor-stack/xsimd/pull/1095/files#diff-4225c7e955e5d693f65f483a7d38b0f6af3e229544cd037f4da05d6b5de2e8b2R975

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well if mem is unaligned and current unaligned loader calls the aligned loaded. If we end up (not the case in your avx512 implementation, but it actually could/should) calling an aligned load in that case, we're blessed with a segfault :-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. you mean A::alignment() for the access of bool const* mem. I'm not sure how we should deal with it here, since the function name kernel::load/sotre means to me is that the alignment has been ignored here. any suggestion?

Copy link
Copy Markdown
Contributor Author

@junparser junparser Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is Implementation defined behavior, how about we manually set mem unaligned when test batch_bool_type::load_unaligned as guard ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. you mean A::alignment() for the access of bool const* mem. I'm not sure how we should deal with it here, since the function name kernel::load/sotre means to me is that the alignment has been ignored here. any suggestion?

We generally provide two versions for the function: load_aligned and load_unaligned

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. you mean A::alignment() for the access of bool const* mem. I'm not sure how we should deal with it here, since the function name kernel::load/sotre means to me is that the alignment has been ignored here. any suggestion?

We generally provide two versions for the function: load_aligned and load_unaligned

batch_bool keeps using load/store before this patch. If we want unify batch_bool load/store into load_aligned and load_unaligned, it is better to do in another pr.

}

template <class T, class A>
Expand Down