Add SIMD (AVX2 + NEON) acceleration for BITOP AND/OR/XOR/NOT#3605
Add SIMD (AVX2 + NEON) acceleration for BITOP AND/OR/XOR/NOT#3605ihabwahbi wants to merge 4 commits into
Conversation
Add runtime-dispatched AVX2 and NEON kernels for BITOP AND, OR, XOR, and NOT. The layout mirrors the existing popcountAVX2/popcountNEON pattern in bitops.c so SIMD support remains isolated behind the same compiler feature gates. Dispatch only when the common input length reaches 32 bytes on AVX2 or 16 bytes on NEON, with NOT using the destination length threshold. The existing scalar and word-at-a-time paths remain the fallback, preserving bit-exact results for all lengths and operands. Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
Add GoogleTest coverage for the AVX2 and NEON BITOP helpers in
src/unit/test_bitop_simd.cpp. The suite runs ~720 cases per architecture
across 18 sizes (including 0, 1, 31-33, 4096-4097 boundaries), 9
operand counts (1-33, exercising paths beyond the existing 16-source
word fast path cap), and four length layouts (equal, ascending,
descending, mixed-with-zeros). Each case compares the SIMD output
against a byte-by-byte scalar oracle and verifies guard bytes around
the destination buffer.
Add an opt-in microbenchmark in src/unit/bench_bitop_simd.cpp that
prints GB/s for AND/OR/XOR/NOT across 4 KB to 4 MB sizes. The
benchmark is gated by BITOP_RUN_BENCH=1 so default test runs skip it;
reviewers can opt in via:
BITOP_RUN_BENCH=1 ./src/unit/valkey-unit-gtests \
--gtest_filter='BitopSimdBench.*'
Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #3605 +/- ##
==========================================
Coverage 76.65% 76.66%
==========================================
Files 162 164 +2
Lines 80612 81046 +434
==========================================
+ Hits 61795 62133 +338
- Misses 18817 18913 +96
🚀 New features to boost your workflow:
|
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
Thanks for the change! The implementation looks correct to me mostly. I have some minor comments.
| class BitopSimdTest : public ::testing::Test { | ||
| protected: | ||
| #if HAVE_X86_SIMD || HAVE_ARM_NEON | ||
| void runCases(const char *impl, BitopSimdFunc func, int op) { |
There was a problem hiding this comment.
These tests are nice but maybe we should add more command level tcl tests around the simd thresholds - minlen, numkeys etc.
There was a problem hiding this comment.
Added command-level Tcl coverage in tests/unit/bitops.tcl for the SIMD thresholds (15/16/17 and 31/32/33 byte boundaries), mixed minlen/maxlen cases, and source-count boundaries (16/17/33). This is in follow-up commit 75bc35586.
|
|
||
| static int bitopTrySimd(int op, unsigned char *dst, unsigned char **src, unsigned long *len, unsigned long numkeys, unsigned long minlen, unsigned long maxlen) { | ||
| #if HAVE_X86_SIMD | ||
| if (((op == BITOP_NOT && maxlen >= 32) || (op != BITOP_NOT && minlen >= 32)) && |
There was a problem hiding this comment.
I think the old code, for numkeys > 16, used the byte loop to process the AND / OR operation. If the AND byte hit zero fairly early, it used to break and not process remaining key. With this new implementation, I guess we won't get that short circuit anymore. Do we know if AVX2 is still faster in such cases?
Maybe we can test with numkeys 17, 32, 64 against a mostly zero first source?
There was a problem hiding this comment.
Good catch. Rather than relying on a benchmark-sensitive tradeoff here, I changed dispatch so AND/OR with numkeys > 16 stay on the legacy byte loop, preserving the old early-break behavior when accumulated bytes reach 0x00 or 0xff. I also added command-level tests for AND/OR with 17/32/64 sources and a short-circuit-friendly first source. XOR/NOT still use SIMD when eligible because they did not have that short-circuit behavior.
| #endif | ||
|
|
||
| TEST(BitopSimdBench, HelperThroughput) { | ||
| if (std::getenv("BITOP_RUN_BENCH") == nullptr) GTEST_SKIP() << "Set BITOP_RUN_BENCH=1 to run"; |
There was a problem hiding this comment.
are the tests going to be skipped every time?
There was a problem hiding this comment.
Removed src/unit/bench_bitop_simd.cpp from the PR, so the default unit binary no longer contains an always-skipped benchmark test and Codecov should no longer count that benchmark-only file as uncovered patch code. The PR body now keeps only the end-to-end valkey-benchmark results.
Preserve the legacy many-source AND/OR byte fallback so short-circuit-friendly workloads keep the old behavior, and add command-level BITOP coverage around SIMD thresholds and source-count boundaries. Remove the skipped helper benchmark from the default unit binary to avoid coverage noise and always-skipped test output. Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
📝 WalkthroughWalkthroughThis PR adds SIMD acceleration (AVX2 on x86, NEON on ARM) to Valkey's BITOP command, implementing vectorized AND/OR/XOR/NOT operations. A dispatcher selects the appropriate backend at runtime, falling back to the existing scalar path when SIMD is not beneficial or supported. Comprehensive unit tests verify correctness across architectures, and integration tests exercise real command scenarios. ChangesSIMD-accelerated BITOP
🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/bitops.c`:
- Around line 756-1236: Add short function-level rationale comments for the SIMD
helpers and dispatchers: document intents and constraints for bitopScalarRange
(zero-extension/byte-fallback semantics and early-exit behavior),
bitopAndAVX2/bitopOrAVX2/bitopXorAVX2/bitopNotAVX2 (what they optimize, expected
alignment/length assumptions, why they call _mm256_zeroupper() before falling
back to scalar), the NEON helpers
(bitopAndNEON/bitopOrNEON/bitopXorNEON/bitopNotNEON) with analogous notes, and
the dispatchers bitopUseSimd and bitopTrySimd (explain threshold choices and why
AND/OR with many sources keep scalar loop to allow early zero/0xff
short-circuiting and why NOT uses maxlen). Keep each comment brief (1–3 lines),
focused on why the function exists and any important correctness constraints.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: ba421993-2695-4e26-8bd3-69eee8ec7882
📒 Files selected for processing (3)
src/bitops.csrc/unit/test_bitop_simd.cpptests/unit/bitops.tcl
Add concise comments explaining the SIMD helper constraints, zero-extension semantics, AVX2 transition handling, and dispatch choices requested during review. Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/bitops.c (1)
797-797: ⚡ Quick winMake the new SIMD kernels file-local.
Lines 797, 852, 906, 960, 1011, 1063, 1114, and 1165 define helpers that are only used inside
src/bitops.c, so keeping external linkage unnecessarily exports symbols from this translation unit. As per coding guidelines, "Use static keyword for file-local functions in C code".Also applies to: 852-852, 906-906, 960-960, 1011-1011, 1063-1063, 1114-1114, 1165-1165
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/bitops.c` at line 797, Several helper functions (e.g., bitopAndAVX2, bitopAndAVX512, bitopAndSSE41, bitopOrAVX2, bitopOrAVX512, bitopOrSSE41, bitopXorAVX2, bitopXorAVX512 — the new SIMD kernels defined at lines noted) are currently emitted with external linkage; mark each of these functions as file-local by adding the static keyword to their definitions so they are only visible within src/bitops.c, following the project guideline to use static for internal helpers. Ensure you update the function definitions (not just declarations) to "static" for all listed kernel functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/bitops.c`:
- Line 797: Several helper functions (e.g., bitopAndAVX2, bitopAndAVX512,
bitopAndSSE41, bitopOrAVX2, bitopOrAVX512, bitopOrSSE41, bitopXorAVX2,
bitopXorAVX512 — the new SIMD kernels defined at lines noted) are currently
emitted with external linkage; mark each of these functions as file-local by
adding the static keyword to their definitions so they are only visible within
src/bitops.c, following the project guideline to use static for internal
helpers. Ensure you update the function definitions (not just declarations) to
"static" for all listed kernel functions.
| if (bitopUseSimd(op, numkeys, minlen, maxlen, 32) && __builtin_cpu_supports("avx2")) { | ||
| switch (op) { | ||
| case BITOP_AND: | ||
| bitopAndAVX2(dst, src, len, numkeys, minlen, maxlen); |
There was a problem hiding this comment.
Nit. bitopXXXAVX2 and bitopXXXNEON have the same signature. We should be able to reduce the code duplication in bitopTrySimd. And then the "fast path" fall back can be moved here, which makes the upstream call site even cleaner
| * operations that are not supported even in ARM >= v6. */ | ||
| j = 0; | ||
| if (!bitopTrySimd(op, res, src, len, numkeys, minlen, maxlen)) { | ||
| /* Fast path: as far as we have data for all the input bitmaps we |
|
|
||
| /* Preserve the legacy byte loop for many-source AND/OR so it can stop early | ||
| * when the accumulated byte reaches 0x00 or 0xff. */ | ||
| if ((op == BITOP_AND || op == BITOP_OR) && numkeys > 16) return 0; |
There was a problem hiding this comment.
16 is an implementation limit for the old code but the SIMD implementation doesn't have the same limit so this feels a bit too conservative to me. Not a blocker to me though: 16 is a reasonably large number already.
TL;DR
BITOP AND,OR,XOR, andNOT.AND/OR(numkeys > 16) so short-circuit-friendly workloads preserve the old behavior.minlen/maxlen, and source-count boundaries.Why
BITCOUNTalready has SIMD dispatch insrc/bitops.c(AVX2 on x86 whencount >= 32, NEON on AArch64 whencount >= 16).BITOPtraverses similarly large contiguous bitmap payloads but previously had no vector path.This PR adds SIMD paths for the common
BITOPoperations while preserving existing fallback behavior for small inputs, unsupported CPUs, and short-circuit-sensitive many-sourceAND/ORcases.What's new
Runtime-dispatched AVX2 and NEON kernels for
AND,OR,XOR, andNOT.SIMD is used when:
AND/ORuse at most 16 sources,XORandNOTmeet their length threshold.The existing scalar and word-at-a-time fallbacks still run when SIMD is not selected.
Review Feedback Addressed
tests/unit/bitops.tclaround 15/16/17 and 31/32/33 byte thresholds, mixedminlen/maxlen, and 16/17/33 source counts.AND/ORwith 17/32/64 sources and a short-circuit-friendly first source.AND/ORwithnumkeys > 16stay on the legacy byte loop, preserving the old early-break behavior when accumulated bytes reach0x00or0xff.Architecture
bitopScalarRange()insrc/bitops.c: shared zero-extending byte-tail helper.#if HAVE_X86_SIMDwithATTRIBUTE_TARGET_AVX2:bitopAndAVX2()bitopOrAVX2()bitopXorAVX2()bitopNotAVX2()#if HAVE_ARM_NEON:bitopAndNEON()bitopOrNEON()bitopXorNEON()bitopNotNEON()bitopUseSimd()centralizes threshold and source-count selection.bitopTrySimd()dispatches to AVX2/NEON or returns0so the existing fallback path runs.Implementation details:
_mm256_loadu_si256/_mm256_storeu_si256,vld1q_u8/vst1q_u8).ANDzeroes the suffix beyondminlen;OR/XORuse the scalar tail helper to preserve zero-extension semantics._mm256_zeroupper()before scalar fallback.Benchmarks
Reference machine:
INTEL(R) XEON(R) PLATINUM 8573C. Server and client both pinned to a single core viataskset -c 0.Numbers below were captured with two binaries built from the same tree: stock
unstablevsunstable + this PR. Bitmap sources were preloaded with deterministic random bytes.Headline: 1 MB
BITOP ANDwith 8 sources improves from 2076 rps to 2522 rps (1.21x), and p99 falls from 0.66 ms to 0.46 ms (~30% reduction).Correctness
GoogleTest coverage lives at
src/unit/test_bitop_simd.cpp.0, 1, 31, 32, 33, 4096, 4097.1, 17, 33at the helper level.Command-level coverage lives at
tests/unit/bitops.tcl.minlen < maxlenbehavior around thresholds.AND/ORwith 17/32/64 sources.Compatibility
__builtin_cpu_supports("avx2").AND/ORwith more than 16 sources intentionally use the legacy byte loop to preserve old short-circuit behavior.Testing Performed
clang-format-18 -i src/bitops.cgit diff --checkmake -j$(nproc)./runtest --single unit/bitopscmake -S . -B /tmp/valkey-build -DCMAKE_BUILD_TYPE=Release -DBUILD_UNIT_GTESTS=ON && cmake --build /tmp/valkey-build --target test-unit -j$(nproc)valkey-benchmarkstock-vs-SIMD comparison for the table aboveNot In This PR
BITOP DIFF/DIFF1/ANDOR/ONEoperations.