Commit 8cc6e02
perf: use vmovntdqa NT loads on unmask NT paths
Replace vmovdqu64/vmovdqu with vmovntdqa on the AVX-512 and AVX2
unmask NT-store loops. The prologue guarantees rdi is aligned to
64/32 bytes respectively, making vmovntdqa safe. Combined with
existing prefetchnta, this completes the non-temporal memory
access pattern for streaming workloads (>48MB on 7950X3D).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent ad195f5 commit 8cc6e02
1 file changed
Lines changed: 9 additions & 8 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
16 | 16 | | |
17 | 17 | | |
18 | 18 | | |
| 19 | + | |
19 | 20 | | |
20 | 21 | | |
21 | 22 | | |
| |||
750 | 751 | | |
751 | 752 | | |
752 | 753 | | |
753 | | - | |
754 | | - | |
755 | | - | |
756 | | - | |
| 754 | + | |
| 755 | + | |
| 756 | + | |
| 757 | + | |
757 | 758 | | |
758 | 759 | | |
759 | 760 | | |
| |||
890 | 891 | | |
891 | 892 | | |
892 | 893 | | |
893 | | - | |
894 | | - | |
895 | | - | |
896 | | - | |
| 894 | + | |
| 895 | + | |
| 896 | + | |
| 897 | + | |
897 | 898 | | |
898 | 899 | | |
899 | 900 | | |
| |||
0 commit comments