Commit 7d40704
committed
feat(backend/native): TD-T6 — real AVX2 kernels for scal/nrm2/asum
Closes TD-T6 (critical audit finding from the per-CPU matrix doc).
Before this commit, the AVX2 native BLAS-1 module had:
pub fn scal_f32(alpha: f32, x: &mut [f32]) {
super::scalar::scal_f32(alpha, x); // ← scalar shim, no AVX2
}
pub fn nrm2_f32(x: &[f32]) -> f32 {
super::scalar::nrm2_f32(x) // ← scalar shim
}
pub fn asum_f32(x: &[f32]) -> f32 {
super::scalar::asum_f32(x) // ← scalar shim
}
// ... and f64 siblings, same shape
These were the documented "// No AVX2 specialization — fall through
to scalar" path. Three operations on every Haswell+ host fell to
scalar even though `dot_f32_avx2` and `axpy_f32_avx2` shipped real
AVX2 in the same module since day one. PR #180's audit flagged this
as TD-T6 (critical: blocks BLAS-1 throughput on Haswell / Arrow
Lake / Zen 1-3).
New AVX2 kernels (6 total — f32 + f64 for each of scal / nrm2 / asum):
scal: broadcast α to ymm via `_mm256_set1_ps`, multiply 8/4 lanes
at a time via `_mm256_mul_ps`/`_mm256_mul_pd`, scalar tail.
Stores result back to the same buffer in-place.
nrm2: two-accumulator unroll with `_mm256_fmadd_ps`/`_pd` (x²
accumulated via FMA, single-rounded per IEEE), horizontal
reduce + scalar sqrt. Same shape as `dot_f32_avx2` (which
also unrolls 2 accumulators + uses FMA), just operates on
one input vector instead of two.
asum: abs via `_mm256_and_ps`/`_pd` with a sign-bit-cleared mask
(0x7FFFFFFF for f32, 0x7FFFFFFFFFFFFFFF for f64) — one
AVX instruction (VANDPS) is faster than calling f32::abs()
lane-by-lane. Two-accumulator unroll + horizontal reduce.
All three follow the existing `dot_f32_avx2` template:
- `#[target_feature(enable = "avx2[,fma]")]` on the inner unsafe fn.
- Public wrapper does `cfg(target_arch = "x86_64")` and dispatches
to the unsafe fn (tier detection in caller-of-caller verified
AVX2 before reaching this module).
- Non-x86_64 builds: pass through to `super::scalar::*`.
- Scalar tail handles `n % chunk_size` lanes via the same fold the
scalar reference uses.
Numerical contract:
scal: byte-equal to scalar (`x[i] *= α` is the same op).
asum: small ULP drift on long vectors because the SIMD horizontal
reduce orders the sum differently from strict left-fold.
Test tolerance: `|got - expected| <= |expected|*1e-5 + 1e-6`.
nrm2: same — drifts ~1-2 ULP on long vectors via reduce-order +
sqrt rounding. Same tolerance.
3 new parity tests (`td_t6_scal_f32_parity`,
`td_t6_nrm2_f32_parity`, `td_t6_asum_f32_parity`) sweep
n ∈ {0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 64, 100} — covers the
chunk-of-16 unroll path, the chunk-of-8 cleanup path, and the
scalar tail for every kernel.
Verification:
* 2090 lib tests pass (was 2087 — +3 new parity tests; the
existing test_scal_f32 / test_nrm2_f64 / test_asum_f32 that
used to hit the scalar shims now exercise the AVX2 kernels
and continue to pass).
* cargo clippy --lib --tests --features rayon,native -- -D warnings
clean.
* cargo clippy --lib --tests --features rayon,native,runtime-dispatch
-- -D warnings clean.
* cargo fmt --all --check clean.
Throughput impact (back-of-envelope on Sapphire Rapids, n=4096):
scal_f32: scalar 4096 cycles (1 mul/lane) → AVX2 ~520 cycles
(8 lanes/instr + 1-cycle issue) = ~8× faster.
asum_f32: scalar 4096 cycles → AVX2 ~520 cycles = ~8× faster.
nrm2_f32: scalar 4096 cycles (1 FMA/lane) → AVX2 ~260 cycles
(16 lanes via 2-acc unroll, 1-cycle issue) = ~16×.
Out of scope (separate PRs):
* AVX-512 versions of the same three ops — `kernels_avx512.rs`
has them already (lines 137-209), wired through the
cfg(target_feature = "avx512f") path. This commit fixes the
AVX2 tier, which serves Haswell through Arrow Lake / Zen 1-3.
* Runtime-dispatch trampolines for these ops (would go in
`simd_runtime/blas_l1.rs` mirroring the matmul.rs pattern from
the runtime-dispatch PR).
https://claude.ai/code/session_01HbqooFZHAjaUtFEzhA1R2u1 parent 71f1973 commit 7d40704
1 file changed
Lines changed: 308 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
540 | 540 | | |
541 | 541 | | |
542 | 542 | | |
543 | | - | |
544 | 543 | | |
545 | | - | |
| 544 | + | |
| 545 | + | |
| 546 | + | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
546 | 553 | | |
547 | 554 | | |
548 | | - | |
| 555 | + | |
| 556 | + | |
| 557 | + | |
| 558 | + | |
| 559 | + | |
| 560 | + | |
| 561 | + | |
| 562 | + | |
| 563 | + | |
549 | 564 | | |
550 | 565 | | |
551 | | - | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
| 570 | + | |
| 571 | + | |
| 572 | + | |
| 573 | + | |
| 574 | + | |
552 | 575 | | |
553 | 576 | | |
554 | | - | |
| 577 | + | |
| 578 | + | |
| 579 | + | |
| 580 | + | |
| 581 | + | |
| 582 | + | |
| 583 | + | |
| 584 | + | |
| 585 | + | |
555 | 586 | | |
556 | 587 | | |
557 | | - | |
| 588 | + | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
558 | 597 | | |
559 | 598 | | |
560 | | - | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
561 | 608 | | |
562 | 609 | | |
563 | 610 | | |
| |||
677 | 724 | | |
678 | 725 | | |
679 | 726 | | |
| 727 | + | |
| 728 | + | |
| 729 | + | |
| 730 | + | |
| 731 | + | |
| 732 | + | |
| 733 | + | |
| 734 | + | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
| 743 | + | |
| 744 | + | |
| 745 | + | |
| 746 | + | |
| 747 | + | |
| 748 | + | |
| 749 | + | |
| 750 | + | |
| 751 | + | |
| 752 | + | |
| 753 | + | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
| 758 | + | |
| 759 | + | |
| 760 | + | |
| 761 | + | |
| 762 | + | |
| 763 | + | |
| 764 | + | |
| 765 | + | |
| 766 | + | |
| 767 | + | |
| 768 | + | |
| 769 | + | |
| 770 | + | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
| 779 | + | |
| 780 | + | |
| 781 | + | |
| 782 | + | |
| 783 | + | |
| 784 | + | |
| 785 | + | |
| 786 | + | |
| 787 | + | |
| 788 | + | |
| 789 | + | |
| 790 | + | |
| 791 | + | |
| 792 | + | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
| 833 | + | |
| 834 | + | |
| 835 | + | |
| 836 | + | |
| 837 | + | |
| 838 | + | |
| 839 | + | |
| 840 | + | |
| 841 | + | |
| 842 | + | |
| 843 | + | |
| 844 | + | |
| 845 | + | |
| 846 | + | |
| 847 | + | |
| 848 | + | |
| 849 | + | |
| 850 | + | |
| 851 | + | |
| 852 | + | |
| 853 | + | |
| 854 | + | |
| 855 | + | |
| 856 | + | |
| 857 | + | |
| 858 | + | |
| 859 | + | |
| 860 | + | |
| 861 | + | |
| 862 | + | |
| 863 | + | |
| 864 | + | |
| 865 | + | |
| 866 | + | |
| 867 | + | |
| 868 | + | |
| 869 | + | |
| 870 | + | |
| 871 | + | |
| 872 | + | |
| 873 | + | |
| 874 | + | |
| 875 | + | |
| 876 | + | |
| 877 | + | |
| 878 | + | |
| 879 | + | |
| 880 | + | |
| 881 | + | |
| 882 | + | |
| 883 | + | |
| 884 | + | |
| 885 | + | |
| 886 | + | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
| 892 | + | |
| 893 | + | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
| 898 | + | |
| 899 | + | |
| 900 | + | |
| 901 | + | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
680 | 922 | | |
681 | 923 | | |
682 | 924 | | |
| |||
760 | 1002 | | |
761 | 1003 | | |
762 | 1004 | | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
| 1014 | + | |
| 1015 | + | |
| 1016 | + | |
| 1017 | + | |
| 1018 | + | |
| 1019 | + | |
| 1020 | + | |
| 1021 | + | |
| 1022 | + | |
| 1023 | + | |
| 1024 | + | |
| 1025 | + | |
| 1026 | + | |
| 1027 | + | |
| 1028 | + | |
| 1029 | + | |
| 1030 | + | |
| 1031 | + | |
| 1032 | + | |
| 1033 | + | |
| 1034 | + | |
| 1035 | + | |
| 1036 | + | |
| 1037 | + | |
| 1038 | + | |
| 1039 | + | |
| 1040 | + | |
| 1041 | + | |
| 1042 | + | |
| 1043 | + | |
| 1044 | + | |
| 1045 | + | |
| 1046 | + | |
| 1047 | + | |
| 1048 | + | |
| 1049 | + | |
| 1050 | + | |
| 1051 | + | |
| 1052 | + | |
| 1053 | + | |
| 1054 | + | |
| 1055 | + | |
| 1056 | + | |
| 1057 | + | |
| 1058 | + | |
| 1059 | + | |
| 1060 | + | |
| 1061 | + | |
| 1062 | + | |
| 1063 | + | |
763 | 1064 | | |
0 commit comments