Skip to content

Commit 8cc6e02

Browse files
mvogttechclaude
andcommitted
perf: use vmovntdqa NT loads on unmask NT paths
Replace vmovdqu64/vmovdqu with vmovntdqa on the AVX-512 and AVX2 unmask NT-store loops. The prologue guarantees rdi is aligned to 64/32 bytes respectively, making vmovntdqa safe. Combined with existing prefetchnta, this completes the non-temporal memory access pattern for streaming workloads (>48MB on 7950X3D). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent ad195f5 commit 8cc6e02

1 file changed

Lines changed: 9 additions & 8 deletions

File tree

src/ws_mask_asm.asm

Lines changed: 9 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
; PCMPISTRI xmm, m, im — string comparison (SSE4.2)
1717
; PREFETCHT0 — temporal prefetch into all cache levels (cached path)
1818
; PREFETCHNTA — non-temporal prefetch (NT-store path only)
19+
; VMOVNTDQA — non-temporal load hint (NT unmask path, rdi aligned)
1920
; VMOVNTDQ — non-temporal store (cache-bypass)
2021
; REP MOVSB — fast memcpy (ERMS/FSRM)
2122
;
@@ -750,10 +751,10 @@ ws_unmask:
750751
align 32
751752
.u_nt512_loop:
752753
prefetchnta [rdi + 1024]
753-
vmovdqu64 zmm1, [rdi]
754-
vmovdqu64 zmm2, [rdi + 64]
755-
vmovdqu64 zmm3, [rdi + 128]
756-
vmovdqu64 zmm4, [rdi + 192]
754+
vmovntdqa zmm1, [rdi] ; NT load (rdi 64-byte aligned by prologue)
755+
vmovntdqa zmm2, [rdi + 64]
756+
vmovntdqa zmm3, [rdi + 128]
757+
vmovntdqa zmm4, [rdi + 192]
757758
vpxord zmm1, zmm1, zmm0
758759
vpxord zmm2, zmm2, zmm0
759760
vpxord zmm3, zmm3, zmm0
@@ -890,10 +891,10 @@ ws_unmask:
890891
align 32
891892
.u_nt_avx2_loop:
892893
prefetchnta [rdi + 512]
893-
vmovdqu ymm1, [rdi]
894-
vmovdqu ymm2, [rdi + 32]
895-
vmovdqu ymm3, [rdi + 64]
896-
vmovdqu ymm4, [rdi + 96]
894+
vmovntdqa ymm1, [rdi] ; NT load (rdi 32-byte aligned by prologue)
895+
vmovntdqa ymm2, [rdi + 32]
896+
vmovntdqa ymm3, [rdi + 64]
897+
vmovntdqa ymm4, [rdi + 96]
897898
vpxor ymm1, ymm1, ymm0
898899
vpxor ymm2, ymm2, ymm0
899900
vpxor ymm3, ymm3, ymm0

0 commit comments

Comments
 (0)