-
Notifications
You must be signed in to change notification settings - Fork 297
add batch_bool kernel::load with requires_arch<avx512f> #1095
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1193,6 +1193,39 @@ namespace xsimd | |
| return detail::compare_int_avx512f<A, T, _MM_CMPINT_LE>(self, other); | ||
| } | ||
|
|
||
| namespace detail | ||
| { | ||
| // Adapted from https://github.com/serge-sans-paille/fast-bitset-from-bool-array | ||
| // Generate a bitset from an array of boolean. | ||
| XSIMD_INLINE unsigned char tobitset(unsigned char unpacked[8]) | ||
| { | ||
| uint64_t data; | ||
| memcpy(&data, unpacked, sizeof(uint64_t)); | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you quote https://github.com/serge-sans-paille/fast-bitset-from-bool-array here? |
||
| const uint64_t magic = (0x80 + 0x4000 + 0x200000 + 0x10000000 + 0x0800000000 + 0x040000000000 + 0x02000000000000 + 0x0100000000000000); | ||
|
|
||
| unsigned char res = ((data * magic) >> 56) & 0xFF; | ||
| return res; | ||
| } | ||
| } | ||
|
|
||
| // load | ||
| template <class A, class T> | ||
| XSIMD_INLINE batch_bool<T, A> load(bool const* mem, batch_bool<T, A>, requires_arch<avx512f>) noexcept | ||
| { | ||
| using register_type = typename batch_bool<T, A>::register_type; | ||
| constexpr auto size = batch_bool<T, A>::size; | ||
| constexpr auto iter = size / 8; | ||
| static_assert(size % 8 == 0, "incorrect size of bool batch"); | ||
| register_type mask = 0; | ||
| for (std::size_t i = 0; i < iter; ++i) | ||
| { | ||
| unsigned char block = detail::tobitset((unsigned char*)mem + i * 8); | ||
| mask |= (register_type(block) << (i * 8)); | ||
| } | ||
| return mask; | ||
| } | ||
|
|
||
| // load_aligned | ||
| template <class A, class T, class = typename std::enable_if<std::is_integral<T>::value, void>::type> | ||
| XSIMD_INLINE batch<T, A> load_aligned(T const* mem, convert<T>, requires_arch<avx512f>) noexcept | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -968,11 +968,7 @@ namespace xsimd | |
| template <class T, class A> | ||
| XSIMD_INLINE batch_bool<T, A> batch_bool<T, A>::load_aligned(bool const* mem) noexcept | ||
| { | ||
| batch_type ref(0); | ||
| alignas(A::alignment()) T buffer[size]; | ||
| for (std::size_t i = 0; i < size; ++i) | ||
| buffer[i] = mem[i] ? 1 : 0; | ||
| return ref != batch_type::load_aligned(&buffer[0]); | ||
| return kernel::load<A>(mem, batch_bool<T, A>(), A {}); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hell, I missed one point: given the implementation specialization, we now need to distinguish between the aligned and unaligned case. The fact that CI passes probably means we need extra tests for this.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dont get the point, do we need aligned and unaligned case for bool load?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. and we already have
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. well if
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. you mean
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. since this is Implementation defined behavior, how about we manually set
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We generally provide two versions for the function: load_aligned and load_unaligned
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
batch_bool keeps using load/store before this patch. If we want unify batch_bool load/store into load_aligned and load_unaligned, it is better to do in another pr. |
||
| } | ||
|
|
||
| template <class T, class A> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is going to be faster than using
_load_mask16and the likes?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, you mean load/store batch_bool with bits rather than bytes? except the load/sotre bandwidth, it seems there are little difference between two of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=movemask&ig_expand=4606 looks like a good match, right?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean sort like
__mmask64 mask = _mm512_cmpgt_epu8_mask(vecInput, _mm512_setzero_epi32());for avx512?_mm512_cmpgt_epu8_maskis under avx512bw, I'll add later.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, implemented in xsimd_avx512bw.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have the possibility to check in terms of performance, I've written another approach at computing masks that looks quite efficient: https://github.com/serge-sans-paille/fast-bitset-from-bool-array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested the case with size 8 16 32 64 & cache hit, it is 2x~8x than the original version. and it's also better than avx512bw version with size 8 16 32. maybe better to deal with size 64 in avx512bw.
would you like to send another patches? or should i just update with this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please just update this one, that's fine.