Commit 860bb55
committed
perf: native AVX512BW masked load/store for 8/16-bit integers
8/16-bit int masked load/store on AVX512BW previously fell through to the
branchy common scalar fallback because xsimd_avx512bw.hpp had no
load_masked/store_masked overloads. Add four requires_arch<avx512bw>
overloads (runtime batch_bool + compile-time batch_bool_constant, load +
store) constrained to sizeof(T)==1||2, emitting the native vmovdqu8 /
vmovdqu16 predicated moves (2 instructions, no branch).
The size branch lives only in the runtime overloads; the constant
overloads delegate via mask.as_batch_bool(), which also avoids
batch_bool_constant::mask() (return type int) truncating a 64-lane int8
compile-time mask.
32/64-bit stays on the avx512f path; SSE/AVX2 8/16-bit scalar fallback is
hardware-forced and unchanged.1 parent 262f5a7 commit 860bb55
2 files changed
Lines changed: 49 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
378 | 378 | | |
379 | 379 | | |
380 | 380 | | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
381 | 428 | | |
382 | 429 | | |
383 | 430 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
356 | 356 | | |
357 | 357 | | |
358 | 358 | | |
359 | | - | |
| 359 | + | |
| 360 | + | |
360 | 361 | | |
361 | 362 | | |
362 | 363 | | |
| |||
0 commit comments