Skip to content

Commit d4d9950

Browse files
mvogttechclaude
andcommitted
perf: condense ws_mask_asm.asm with memory-operand VPXORD and redundant-test removal
Three instruction-level optimizations applied across all mask/unmask paths: 1. Memory-operand VPXORD/VPXOR: fuse separate load + XOR into one instruction (e.g., `vpxord zmm1, zmm0, [rdi]` replaces `vmovdqu64 zmm1, [rdi]` + `vpxord zmm1, zmm1, zmm0`). ~50 fusions across AVX-512 and AVX2 paths. NT unmask paths (vmovntdqa) intentionally preserved for cache-bypass hint. 2. Remove redundant `test rax, rax` after `shr`: SHR sets ZF, so the test is dead code. ~30 instances removed. 3. RORX for 8-byte GPR mask broadcast in ws_mask_gfni: replaces 3-instruction mov+shl+or with 2-instruction rorx+or (BMI2 guaranteed by AVX-512 gate). Net: -120 lines, 89/89 tests passing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 7ed0e8c commit d4d9950

1 file changed

Lines changed: 84 additions & 204 deletions

File tree

0 commit comments

Comments
 (0)