Skip to content

Add SIMD (AVX2 + NEON) acceleration for BITOP AND/OR/XOR/NOT#3605

Open
ihabwahbi wants to merge 4 commits into
valkey-io:unstablefrom
ihabwahbi:feat/simd-bitop
Open

Add SIMD (AVX2 + NEON) acceleration for BITOP AND/OR/XOR/NOT#3605
ihabwahbi wants to merge 4 commits into
valkey-io:unstablefrom
ihabwahbi:feat/simd-bitop

Conversation

@ihabwahbi
Copy link
Copy Markdown

@ihabwahbi ihabwahbi commented May 2, 2026

TL;DR

  • Adds runtime-dispatched AVX2 and NEON kernels for BITOP AND, OR, XOR, and NOT.
  • Keeps the legacy byte loop for many-source AND/OR (numkeys > 16) so short-circuit-friendly workloads preserve the old behavior.
  • Adds both low-level GoogleTest coverage and command-level Tcl coverage for SIMD thresholds, mixed minlen/maxlen, and source-count boundaries.

Why

BITCOUNT already has SIMD dispatch in src/bitops.c (AVX2 on x86 when count >= 32, NEON on AArch64 when count >= 16). BITOP traverses similarly large contiguous bitmap payloads but previously had no vector path.

This PR adds SIMD paths for the common BITOP operations while preserving existing fallback behavior for small inputs, unsupported CPUs, and short-circuit-sensitive many-source AND/OR cases.

What's new

Runtime-dispatched AVX2 and NEON kernels for AND, OR, XOR, and NOT.

SIMD is used when:

  • x86-64 builds have AVX2 available at runtime and the relevant length is at least 32 bytes,
  • AArch64/NEON builds have the relevant length at least 16 bytes,
  • AND/OR use at most 16 sources,
  • XOR and NOT meet their length threshold.

The existing scalar and word-at-a-time fallbacks still run when SIMD is not selected.

Review Feedback Addressed

  • Added command-level Tcl tests in tests/unit/bitops.tcl around 15/16/17 and 31/32/33 byte thresholds, mixed minlen/maxlen, and 16/17/33 source counts.
  • Added command-level tests for AND/OR with 17/32/64 sources and a short-circuit-friendly first source.
  • Changed dispatch so AND/OR with numkeys > 16 stay on the legacy byte loop, preserving the old early-break behavior when accumulated bytes reach 0x00 or 0xff.
  • Removed the benchmark-only gtest file so normal unit runs no longer contain an always-skipped benchmark test and Codecov no longer sees that file as uncovered patch code.

Architecture

  • bitopScalarRange() in src/bitops.c: shared zero-extending byte-tail helper.
  • AVX2 helpers under #if HAVE_X86_SIMD with ATTRIBUTE_TARGET_AVX2:
    • bitopAndAVX2()
    • bitopOrAVX2()
    • bitopXorAVX2()
    • bitopNotAVX2()
  • NEON helpers under #if HAVE_ARM_NEON:
    • bitopAndNEON()
    • bitopOrNEON()
    • bitopXorNEON()
    • bitopNotNEON()
  • bitopUseSimd() centralizes threshold and source-count selection.
  • bitopTrySimd() dispatches to AVX2/NEON or returns 0 so the existing fallback path runs.

Implementation details:

  • All vector loads and stores are unaligned (_mm256_loadu_si256 / _mm256_storeu_si256, vld1q_u8 / vst1q_u8).
  • AVX2 processes 8 x 32-byte blocks per unrolled loop, then 32-byte blocks, then scalar tail.
  • NEON processes 8 x 16-byte blocks per unrolled loop, then 16-byte blocks, then scalar tail.
  • AND zeroes the suffix beyond minlen; OR/XOR use the scalar tail helper to preserve zero-extension semantics.
  • AVX2 helpers call _mm256_zeroupper() before scalar fallback.

Benchmarks

Reference machine: INTEL(R) XEON(R) PLATINUM 8573C. Server and client both pinned to a single core via taskset -c 0.

Numbers below were captured with two binaries built from the same tree: stock unstable vs unstable + this PR. Bitmap sources were preloaded with deterministic random bytes.

size op sources stock rps SIMD rps speedup stock p99 ms SIMD p99 ms
1 MB AND 2 5113.78 5925.93 1.16x 0.263 0.215
1 MB OR 2 5045.41 5897.97 1.17x 0.263 0.215
1 MB XOR 2 5050.50 5851.38 1.16x 0.255 0.223
1 MB NOT 1 6693.44 7636.50 1.14x 0.207 0.183
1 MB AND 4 3484.32 3920.80 1.13x 0.383 0.311
1 MB AND 8 2076.63 2522.07 1.21x 0.663 0.463
4 KB AND 2 53619.30 56022.41 1.04x 0.039 0.039
4 KB OR 2 57471.27 57142.86 0.99x 0.031 0.039
4 KB XOR 2 56338.03 57471.27 1.02x 0.039 0.039
4 KB NOT 1 55248.62 58139.53 1.05x 0.039 0.039

Headline: 1 MB BITOP AND with 8 sources improves from 2076 rps to 2522 rps (1.21x), and p99 falls from 0.66 ms to 0.46 ms (~30% reduction).

Correctness

GoogleTest coverage lives at src/unit/test_bitop_simd.cpp.

  • Tests compare every SIMD helper output against a byte-by-byte scalar oracle.
  • Sizes include 0, 1, 31, 32, 33, 4096, 4097.
  • Source counts include 1, 17, 33 at the helper level.
  • Guard bytes around the destination buffer verify no out-of-bounds writes.

Command-level coverage lives at tests/unit/bitops.tcl.

  • Threshold tests cover 15/16/17 and 31/32/33 byte boundaries.
  • Mixed-length tests cover minlen < maxlen behavior around thresholds.
  • Source-count tests cover 16/17/33 sources.
  • Short-circuit-friendly tests cover AND/OR with 17/32/64 sources.

Compatibility

  • Output remains bit-exact identical to the existing implementation.
  • Non-x86 / non-AArch64 builds compile no SIMD code and use the existing fallback path.
  • x86 CPUs without AVX2 fall back automatically through __builtin_cpu_supports("avx2").
  • AND/OR with more than 16 sources intentionally use the legacy byte loop to preserve old short-circuit behavior.

Testing Performed

  • clang-format-18 -i src/bitops.c
  • git diff --check
  • make -j$(nproc)
  • ./runtest --single unit/bitops
  • cmake -S . -B /tmp/valkey-build -DCMAKE_BUILD_TYPE=Release -DBUILD_UNIT_GTESTS=ON && cmake --build /tmp/valkey-build --target test-unit -j$(nproc)
  • End-to-end valkey-benchmark stock-vs-SIMD comparison for the table above

Not In This PR

  • AVX-512 path.
  • SIMD acceleration for future/nonexistent BITOP DIFF / DIFF1 / ANDOR / ONE operations.
  • Benchmark-only gtest target; the earlier skipped benchmark helper was removed after review feedback.

ihabwahbi added 2 commits May 2, 2026 15:57
Add runtime-dispatched AVX2 and NEON kernels for BITOP AND, OR, XOR,
and NOT. The layout mirrors the existing popcountAVX2/popcountNEON
pattern in bitops.c so SIMD support remains isolated behind the same
compiler feature gates.

Dispatch only when the common input length reaches 32 bytes on AVX2 or
16 bytes on NEON, with NOT using the destination length threshold. The
existing scalar and word-at-a-time paths remain the fallback,
preserving bit-exact results for all lengths and operands.

Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
Add GoogleTest coverage for the AVX2 and NEON BITOP helpers in
src/unit/test_bitop_simd.cpp. The suite runs ~720 cases per architecture
across 18 sizes (including 0, 1, 31-33, 4096-4097 boundaries), 9
operand counts (1-33, exercising paths beyond the existing 16-source
word fast path cap), and four length layouts (equal, ascending,
descending, mixed-with-zeros). Each case compares the SIMD output
against a byte-by-byte scalar oracle and verifies guard bytes around
the destination buffer.

Add an opt-in microbenchmark in src/unit/bench_bitop_simd.cpp that
prints GB/s for AND/OR/XOR/NOT across 4 KB to 4 MB sizes. The
benchmark is gated by BITOP_RUN_BENCH=1 so default test runs skip it;
reviewers can opt in via:
  BITOP_RUN_BENCH=1 ./src/unit/valkey-unit-gtests \
    --gtest_filter='BitopSimdBench.*'

Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 62.54980% with 188 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.66%. Comparing base (f2f4e5d) to head (87f368a).
⚠️ Report is 1 commits behind head on unstable.

Files with missing lines Patch % Lines
src/unit/bench_bitop_simd.cpp 1.43% 137 Missing ⚠️
src/bitops.c 79.82% 46 Missing ⚠️
src/unit/test_bitop_simd.cpp 96.29% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           unstable    #3605    +/-   ##
==========================================
  Coverage     76.65%   76.66%            
==========================================
  Files           162      164     +2     
  Lines         80612    81046   +434     
==========================================
+ Hits          61795    62133   +338     
- Misses        18817    18913    +96     
Files with missing lines Coverage Δ
src/unit/test_bitop_simd.cpp 96.29% <96.29%> (ø)
src/bitops.c 90.91% <79.82%> (-4.00%) ⬇️
src/unit/bench_bitop_simd.cpp 1.43% <1.43%> (ø)

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! The implementation looks correct to me mostly. I have some minor comments.

class BitopSimdTest : public ::testing::Test {
protected:
#if HAVE_X86_SIMD || HAVE_ARM_NEON
void runCases(const char *impl, BitopSimdFunc func, int op) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are nice but maybe we should add more command level tcl tests around the simd thresholds - minlen, numkeys etc.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added command-level Tcl coverage in tests/unit/bitops.tcl for the SIMD thresholds (15/16/17 and 31/32/33 byte boundaries), mixed minlen/maxlen cases, and source-count boundaries (16/17/33). This is in follow-up commit 75bc35586.

Comment thread src/bitops.c Outdated

static int bitopTrySimd(int op, unsigned char *dst, unsigned char **src, unsigned long *len, unsigned long numkeys, unsigned long minlen, unsigned long maxlen) {
#if HAVE_X86_SIMD
if (((op == BITOP_NOT && maxlen >= 32) || (op != BITOP_NOT && minlen >= 32)) &&
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the old code, for numkeys > 16, used the byte loop to process the AND / OR operation. If the AND byte hit zero fairly early, it used to break and not process remaining key. With this new implementation, I guess we won't get that short circuit anymore. Do we know if AVX2 is still faster in such cases?

Maybe we can test with numkeys 17, 32, 64 against a mostly zero first source?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Rather than relying on a benchmark-sensitive tradeoff here, I changed dispatch so AND/OR with numkeys > 16 stay on the legacy byte loop, preserving the old early-break behavior when accumulated bytes reach 0x00 or 0xff. I also added command-level tests for AND/OR with 17/32/64 sources and a short-circuit-friendly first source. XOR/NOT still use SIMD when eligible because they did not have that short-circuit behavior.

Comment thread src/unit/bench_bitop_simd.cpp Outdated
#endif

TEST(BitopSimdBench, HelperThroughput) {
if (std::getenv("BITOP_RUN_BENCH") == nullptr) GTEST_SKIP() << "Set BITOP_RUN_BENCH=1 to run";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the tests going to be skipped every time?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed src/unit/bench_bitop_simd.cpp from the PR, so the default unit binary no longer contains an always-skipped benchmark test and Codecov should no longer count that benchmark-only file as uncovered patch code. The PR body now keeps only the end-to-end valkey-benchmark results.

Preserve the legacy many-source AND/OR byte fallback so short-circuit-friendly workloads keep the old behavior, and add command-level BITOP coverage around SIMD thresholds and source-count boundaries.

Remove the skipped helper benchmark from the default unit binary to avoid coverage noise and always-skipped test output.

Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 13, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds SIMD acceleration (AVX2 on x86, NEON on ARM) to Valkey's BITOP command, implementing vectorized AND/OR/XOR/NOT operations. A dispatcher selects the appropriate backend at runtime, falling back to the existing scalar path when SIMD is not beneficial or supported. Comprehensive unit tests verify correctness across architectures, and integration tests exercise real command scenarios.

Changes

SIMD-accelerated BITOP

Layer / File(s) Summary
Scalar byte-range helper
src/bitops.c
New internal utility bitopScalarRange computes bitwise operations across a byte range with per-operation semantics and early-exit optimization for AND/OR terminal values.
AVX2 and NEON SIMD implementations
src/bitops.c
x86 AVX2 and ARM NEON vector implementations for AND/OR/XOR/NOT that process large chunks via intrinsics, delegate tails to the scalar helper, and zero-fill destination padding for AND/OR operations.
SIMD dispatch and eligibility heuristics
src/bitops.c
Heuristic function bitopUseSimd determines SIMD eligibility by operation type, key count, and data length; dispatcher bitopTrySimd selects AVX2 or NEON based on CPU support and runs the chosen path, returning success/failure for caller fallback.
Integration into bitopCommand
src/bitops.c
Updates bitopCommand to call bitopTrySimd() when applicable; adjusts op variable type from unsigned long to int to match dispatcher interface, preserving fallback to existing scalar fast path and final padding logic on SIMD decline.
SIMD unit test infrastructure and cases
src/unit/test_bitop_simd.cpp
GoogleTest harness with CPU feature detection (cpuHasAvx2), scalar oracle reference implementation, deterministic RNG, and test fixture that validates SIMD results against oracle across multiple buffer sizes, key counts, and layouts, plus architecture-conditional test cases for AVX2 and NEON.
Tcl integration tests for BITOP
tests/unit/bitops.tcl
Adds helper bitop_test_pattern for generating test strings; inserts threshold and edge-case BITOP tests covering mixed input lengths, source key count variants (16/17/33 keys), and short-circuit scenarios, validating server output against scalar simulation.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding SIMD acceleration (AVX2 and NEON) for BITOP operations (AND/OR/XOR/NOT).
Description check ✅ Passed The description comprehensively details the changes, rationale, architecture, benchmarks, and testing—all directly related to the SIMD acceleration implementation in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bitops.c`:
- Around line 756-1236: Add short function-level rationale comments for the SIMD
helpers and dispatchers: document intents and constraints for bitopScalarRange
(zero-extension/byte-fallback semantics and early-exit behavior),
bitopAndAVX2/bitopOrAVX2/bitopXorAVX2/bitopNotAVX2 (what they optimize, expected
alignment/length assumptions, why they call _mm256_zeroupper() before falling
back to scalar), the NEON helpers
(bitopAndNEON/bitopOrNEON/bitopXorNEON/bitopNotNEON) with analogous notes, and
the dispatchers bitopUseSimd and bitopTrySimd (explain threshold choices and why
AND/OR with many sources keep scalar loop to allow early zero/0xff
short-circuiting and why NOT uses maxlen). Keep each comment brief (1–3 lines),
focused on why the function exists and any important correctness constraints.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ba421993-2695-4e26-8bd3-69eee8ec7882

📥 Commits

Reviewing files that changed from the base of the PR and between f2f4e5d and 75bc355.

📒 Files selected for processing (3)
  • src/bitops.c
  • src/unit/test_bitop_simd.cpp
  • tests/unit/bitops.tcl

Comment thread src/bitops.c
Add concise comments explaining the SIMD helper constraints, zero-extension semantics, AVX2 transition handling, and dispatch choices requested during review.

Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/bitops.c (1)

797-797: ⚡ Quick win

Make the new SIMD kernels file-local.

Lines 797, 852, 906, 960, 1011, 1063, 1114, and 1165 define helpers that are only used inside src/bitops.c, so keeping external linkage unnecessarily exports symbols from this translation unit. As per coding guidelines, "Use static keyword for file-local functions in C code".

Also applies to: 852-852, 906-906, 960-960, 1011-1011, 1063-1063, 1114-1114, 1165-1165

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bitops.c` at line 797, Several helper functions (e.g., bitopAndAVX2,
bitopAndAVX512, bitopAndSSE41, bitopOrAVX2, bitopOrAVX512, bitopOrSSE41,
bitopXorAVX2, bitopXorAVX512 — the new SIMD kernels defined at lines noted) are
currently emitted with external linkage; mark each of these functions as
file-local by adding the static keyword to their definitions so they are only
visible within src/bitops.c, following the project guideline to use static for
internal helpers. Ensure you update the function definitions (not just
declarations) to "static" for all listed kernel functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/bitops.c`:
- Line 797: Several helper functions (e.g., bitopAndAVX2, bitopAndAVX512,
bitopAndSSE41, bitopOrAVX2, bitopOrAVX512, bitopOrSSE41, bitopXorAVX2,
bitopXorAVX512 — the new SIMD kernels defined at lines noted) are currently
emitted with external linkage; mark each of these functions as file-local by
adding the static keyword to their definitions so they are only visible within
src/bitops.c, following the project guideline to use static for internal
helpers. Ensure you update the function definitions (not just declarations) to
"static" for all listed kernel functions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c2b0164b-f2a7-4a96-8bcc-33b81ec8d3db

📥 Commits

Reviewing files that changed from the base of the PR and between 75bc355 and 56de912.

📒 Files selected for processing (1)
  • src/bitops.c

Copy link
Copy Markdown
Member

@PingXie PingXie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

Comment thread src/bitops.c
if (bitopUseSimd(op, numkeys, minlen, maxlen, 32) && __builtin_cpu_supports("avx2")) {
switch (op) {
case BITOP_AND:
bitopAndAVX2(dst, src, len, numkeys, minlen, maxlen);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. bitopXXXAVX2 and bitopXXXNEON have the same signature. We should be able to reduce the code duplication in bitopTrySimd. And then the "fast path" fall back can be moved here, which makes the upstream call site even cleaner

Comment thread src/bitops.c
* operations that are not supported even in ARM >= v6. */
j = 0;
if (!bitopTrySimd(op, res, src, len, numkeys, minlen, maxlen)) {
/* Fast path: as far as we have data for all the input bitmaps we
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. No longer "fast path" :)

Comment thread src/bitops.c

/* Preserve the legacy byte loop for many-source AND/OR so it can stop early
* when the accumulated byte reaches 0x00 or 0xff. */
if ((op == BITOP_AND || op == BITOP_OR) && numkeys > 16) return 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 is an implementation limit for the old code but the SIMD implementation doesn't have the same limit so this feels a bit too conservative to me. Not a blocker to me though: 16 is a reasonably large number already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants