Commit d4d9950
perf: condense ws_mask_asm.asm with memory-operand VPXORD and redundant-test removal
Three instruction-level optimizations applied across all mask/unmask paths:
1. Memory-operand VPXORD/VPXOR: fuse separate load + XOR into one instruction
(e.g., `vpxord zmm1, zmm0, [rdi]` replaces `vmovdqu64 zmm1, [rdi]` +
`vpxord zmm1, zmm1, zmm0`). ~50 fusions across AVX-512 and AVX2 paths.
NT unmask paths (vmovntdqa) intentionally preserved for cache-bypass hint.
2. Remove redundant `test rax, rax` after `shr`: SHR sets ZF, so the test
is dead code. ~30 instances removed.
3. RORX for 8-byte GPR mask broadcast in ws_mask_gfni: replaces 3-instruction
mov+shl+or with 2-instruction rorx+or (BMI2 guaranteed by AVX-512 gate).
Net: -120 lines, 89/89 tests passing.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>1 parent 7ed0e8c commit d4d9950
1 file changed
Lines changed: 84 additions & 204 deletions
0 commit comments