Commit 2af87f5
committed
perf: collapse same-type aligned batch_bool_constant load to a select
The cross-type ``load_masked`` overload for ``batch_bool_constant`` builds
the result via a per-lane scalar buffer:
for (i = 0; i < size; ++i)
buffer[i] = mask[i] ? T(mem[i]) : T(0);
GCC -O3 -DNDEBUG folds this for wide types (4-lane f32: 4 instructions
``movd + pinsrq``) but not for narrow types — for a 16-lane uint8_t mask
on SSE4.2 it emits ~50 asm lines of stack ``mov``/``shl``/``and``/``or``
round-trips through ``-0x18(%rsp)``.
Add a same-type, aligned-mode overload that lowers to ``select``
against the constant mask. Aligned mode guarantees the whole vector
lives in one alignment unit (alignof(A) >= sizeof(batch) on every
common-fallback arch), so an unconditional load cannot fault on
inactive lanes. The new overload is more specialized than the existing
``T_in, T_out, alignment`` template, so it wins overload resolution for
the same-type aligned case while leaving cross-type and unaligned paths
untouched.
Codegen probe (``g++ -O3 -DNDEBUG -msse4.2``):
function before after
load_aligned_const_u8 (mixed mask) ~50 inst 2 inst (``pand mem, k``)
load_aligned_const_f32 (T,F,T,F) 4 inst 2 inst (``pxor + blendps``)
Tests: 6 of 7 multi-arch builds (SSE2, SSE4.1, AVX2, AVX-512 via sde64,
RVV via qemu, emulated256) pass full ``test_xsimd``. AArch64 via qemu
shows 8 pre-existing test failures in ``[basic api] store_as(bool*,
batch_bool)`` reproduced on pristine master, unrelated to this change.1 parent 5141ff0 commit 2af87f5
1 file changed
Lines changed: 18 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
374 | 374 | | |
375 | 375 | | |
376 | 376 | | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
377 | 395 | | |
378 | 396 | | |
379 | 397 | | |
| |||
0 commit comments