Add SIMD (AVX2 + NEON) acceleration for BITOP AND/OR/XOR/NOT by ihabwahbi · Pull Request #3605 · valkey-io/valkey

ihabwahbi · 2026-05-02T16:02:18Z

TL;DR

Adds runtime-dispatched AVX2 and NEON kernels for BITOP AND, OR, XOR, and NOT.
Keeps the legacy byte loop for many-source AND/OR (numkeys > 16) so short-circuit-friendly workloads preserve the old behavior.
Adds both low-level GoogleTest coverage and command-level Tcl coverage for SIMD thresholds, mixed minlen/maxlen, and source-count boundaries.

Why

BITCOUNT already has SIMD dispatch in src/bitops.c (AVX2 on x86 when count >= 32, NEON on AArch64 when count >= 16). BITOP traverses similarly large contiguous bitmap payloads but previously had no vector path.

This PR adds SIMD paths for the common BITOP operations while preserving existing fallback behavior for small inputs, unsupported CPUs, and short-circuit-sensitive many-source AND/OR cases.

What's new

Runtime-dispatched AVX2 and NEON kernels for AND, OR, XOR, and NOT.

SIMD is used when:

x86-64 builds have AVX2 available at runtime and the relevant length is at least 32 bytes,
AArch64/NEON builds have the relevant length at least 16 bytes,
AND/OR use at most 16 sources,
XOR and NOT meet their length threshold.

The existing scalar and word-at-a-time fallbacks still run when SIMD is not selected.

Review Feedback Addressed

Added command-level Tcl tests in tests/unit/bitops.tcl around 15/16/17 and 31/32/33 byte thresholds, mixed minlen/maxlen, and 16/17/33 source counts.
Added command-level tests for AND/OR with 17/32/64 sources and a short-circuit-friendly first source.
Changed dispatch so AND/OR with numkeys > 16 stay on the legacy byte loop, preserving the old early-break behavior when accumulated bytes reach 0x00 or 0xff.
Removed the benchmark-only gtest file so normal unit runs no longer contain an always-skipped benchmark test and Codecov no longer sees that file as uncovered patch code.

Architecture

bitopScalarRange() in src/bitops.c: shared zero-extending byte-tail helper.
AVX2 helpers under #if HAVE_X86_SIMD with ATTRIBUTE_TARGET_AVX2:
- bitopAndAVX2()
- bitopOrAVX2()
- bitopXorAVX2()
- bitopNotAVX2()
NEON helpers under #if HAVE_ARM_NEON:
- bitopAndNEON()
- bitopOrNEON()
- bitopXorNEON()
- bitopNotNEON()
bitopUseSimd() centralizes threshold and source-count selection.
bitopTrySimd() dispatches to AVX2/NEON or returns 0 so the existing fallback path runs.

Implementation details:

All vector loads and stores are unaligned (_mm256_loadu_si256 / _mm256_storeu_si256, vld1q_u8 / vst1q_u8).
AVX2 processes 8 x 32-byte blocks per unrolled loop, then 32-byte blocks, then scalar tail.
NEON processes 8 x 16-byte blocks per unrolled loop, then 16-byte blocks, then scalar tail.
AND zeroes the suffix beyond minlen; OR/XOR use the scalar tail helper to preserve zero-extension semantics.
AVX2 helpers call _mm256_zeroupper() before scalar fallback.

Benchmarks

Reference machine: INTEL(R) XEON(R) PLATINUM 8573C. Server and client both pinned to a single core via taskset -c 0.

Numbers below were captured with two binaries built from the same tree: stock unstable vs unstable + this PR. Bitmap sources were preloaded with deterministic random bytes.

size	op	sources	stock rps	SIMD rps	speedup	stock p99 ms	SIMD p99 ms
1 MB	AND	2	5113.78	5925.93	1.16x	0.263	0.215
1 MB	OR	2	5045.41	5897.97	1.17x	0.263	0.215
1 MB	XOR	2	5050.50	5851.38	1.16x	0.255	0.223
1 MB	NOT	1	6693.44	7636.50	1.14x	0.207	0.183
1 MB	AND	4	3484.32	3920.80	1.13x	0.383	0.311
1 MB	AND	8	2076.63	2522.07	1.21x	0.663	0.463
4 KB	AND	2	53619.30	56022.41	1.04x	0.039	0.039
4 KB	OR	2	57471.27	57142.86	0.99x	0.031	0.039
4 KB	XOR	2	56338.03	57471.27	1.02x	0.039	0.039
4 KB	NOT	1	55248.62	58139.53	1.05x	0.039	0.039

Headline: 1 MB BITOP AND with 8 sources improves from 2076 rps to 2522 rps (1.21x), and p99 falls from 0.66 ms to 0.46 ms (~30% reduction).

Correctness

GoogleTest coverage lives at src/unit/test_bitop_simd.cpp.

Tests compare every SIMD helper output against a byte-by-byte scalar oracle.
Sizes include 0, 1, 31, 32, 33, 4096, 4097.
Source counts include 1, 17, 33 at the helper level.
Guard bytes around the destination buffer verify no out-of-bounds writes.

Command-level coverage lives at tests/unit/bitops.tcl.

Threshold tests cover 15/16/17 and 31/32/33 byte boundaries.
Mixed-length tests cover minlen < maxlen behavior around thresholds.
Source-count tests cover 16/17/33 sources.
Short-circuit-friendly tests cover AND/OR with 17/32/64 sources.

Compatibility

Output remains bit-exact identical to the existing implementation.
Non-x86 / non-AArch64 builds compile no SIMD code and use the existing fallback path.
x86 CPUs without AVX2 fall back automatically through __builtin_cpu_supports("avx2").
AND/OR with more than 16 sources intentionally use the legacy byte loop to preserve old short-circuit behavior.

Testing Performed

clang-format-18 -i src/bitops.c
git diff --check
make -j$(nproc)
./runtest --single unit/bitops
cmake -S . -B /tmp/valkey-build -DCMAKE_BUILD_TYPE=Release -DBUILD_UNIT_GTESTS=ON && cmake --build /tmp/valkey-build --target test-unit -j$(nproc)
End-to-end valkey-benchmark stock-vs-SIMD comparison for the table above

Not In This PR

AVX-512 path.
SIMD acceleration for future/nonexistent BITOP DIFF / DIFF1 / ANDOR / ONE operations.
Benchmark-only gtest target; the earlier skipped benchmark helper was removed after review feedback.

Add runtime-dispatched AVX2 and NEON kernels for BITOP AND, OR, XOR, and NOT. The layout mirrors the existing popcountAVX2/popcountNEON pattern in bitops.c so SIMD support remains isolated behind the same compiler feature gates. Dispatch only when the common input length reaches 32 bytes on AVX2 or 16 bytes on NEON, with NOT using the destination length threshold. The existing scalar and word-at-a-time paths remain the fallback, preserving bit-exact results for all lengths and operands. Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>

Add GoogleTest coverage for the AVX2 and NEON BITOP helpers in src/unit/test_bitop_simd.cpp. The suite runs ~720 cases per architecture across 18 sizes (including 0, 1, 31-33, 4096-4097 boundaries), 9 operand counts (1-33, exercising paths beyond the existing 16-source word fast path cap), and four length layouts (equal, ascending, descending, mixed-with-zeros). Each case compares the SIMD output against a byte-by-byte scalar oracle and verifies guard bytes around the destination buffer. Add an opt-in microbenchmark in src/unit/bench_bitop_simd.cpp that prints GB/s for AND/OR/XOR/NOT across 4 KB to 4 MB sizes. The benchmark is gated by BITOP_RUN_BENCH=1 so default test runs skip it; reviewers can opt in via: BITOP_RUN_BENCH=1 ./src/unit/valkey-unit-gtests \ --gtest_filter='BitopSimdBench.*' Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>

codecov · 2026-05-04T09:22:16Z

Codecov Report

❌ Patch coverage is 62.54980% with 188 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.66%. Comparing base (f2f4e5d) to head (87f368a).
⚠️ Report is 1 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/unit/bench_bitop_simd.cpp	1.43%	137 Missing ⚠️
src/bitops.c	79.82%	46 Missing ⚠️
src/unit/test_bitop_simd.cpp	96.29%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           unstable    #3605    +/-   ##
==========================================
  Coverage     76.65%   76.66%            
==========================================
  Files           162      164     +2     
  Lines         80612    81046   +434     
==========================================
+ Hits          61795    62133   +338     
- Misses        18817    18913    +96

Files with missing lines	Coverage Δ
src/unit/test_bitop_simd.cpp	`96.29% <96.29%> (ø)`
src/bitops.c	`90.91% <79.82%> (-4.00%)`	⬇️
src/unit/bench_bitop_simd.cpp	`1.43% <1.43%> (ø)`

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sarthakaggarwal97

Thanks for the change! The implementation looks correct to me mostly. I have some minor comments.

sarthakaggarwal97 · 2026-05-13T05:39:52Z

+class BitopSimdTest : public ::testing::Test {
+  protected:
+#if HAVE_X86_SIMD || HAVE_ARM_NEON
+    void runCases(const char *impl, BitopSimdFunc func, int op) {


These tests are nice but maybe we should add more command level tcl tests around the simd thresholds - minlen, numkeys etc.

Added command-level Tcl coverage in tests/unit/bitops.tcl for the SIMD thresholds (15/16/17 and 31/32/33 byte boundaries), mixed minlen/maxlen cases, and source-count boundaries (16/17/33). This is in follow-up commit 75bc35586.

sarthakaggarwal97 · 2026-05-13T05:48:54Z

+
+static int bitopTrySimd(int op, unsigned char *dst, unsigned char **src, unsigned long *len, unsigned long numkeys, unsigned long minlen, unsigned long maxlen) {
+#if HAVE_X86_SIMD
+    if (((op == BITOP_NOT && maxlen >= 32) || (op != BITOP_NOT && minlen >= 32)) &&


I think the old code, for numkeys > 16, used the byte loop to process the AND / OR operation. If the AND byte hit zero fairly early, it used to break and not process remaining key. With this new implementation, I guess we won't get that short circuit anymore. Do we know if AVX2 is still faster in such cases?

Maybe we can test with numkeys 17, 32, 64 against a mostly zero first source?

Good catch. Rather than relying on a benchmark-sensitive tradeoff here, I changed dispatch so AND/OR with numkeys > 16 stay on the legacy byte loop, preserving the old early-break behavior when accumulated bytes reach 0x00 or 0xff. I also added command-level tests for AND/OR with 17/32/64 sources and a short-circuit-friendly first source. XOR/NOT still use SIMD when eligible because they did not have that short-circuit behavior.

sarthakaggarwal97 · 2026-05-13T05:49:57Z

+#endif
+
+TEST(BitopSimdBench, HelperThroughput) {
+    if (std::getenv("BITOP_RUN_BENCH") == nullptr) GTEST_SKIP() << "Set BITOP_RUN_BENCH=1 to run";


are the tests going to be skipped every time?

Removed src/unit/bench_bitop_simd.cpp from the PR, so the default unit binary no longer contains an always-skipped benchmark test and Codecov should no longer count that benchmark-only file as uncovered patch code. The PR body now keeps only the end-to-end valkey-benchmark results.

Preserve the legacy many-source AND/OR byte fallback so short-circuit-friendly workloads keep the old behavior, and add command-level BITOP coverage around SIMD thresholds and source-count boundaries. Remove the skipped helper benchmark from the default unit binary to avoid coverage noise and always-skipped test output. Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>

coderabbitai · 2026-05-13T07:54:38Z

📝 Walkthrough

Walkthrough

This PR adds SIMD acceleration (AVX2 on x86, NEON on ARM) to Valkey's BITOP command, implementing vectorized AND/OR/XOR/NOT operations. A dispatcher selects the appropriate backend at runtime, falling back to the existing scalar path when SIMD is not beneficial or supported. Comprehensive unit tests verify correctness across architectures, and integration tests exercise real command scenarios.

Changes

SIMD-accelerated BITOP

Layer / File(s)	Summary
Scalar byte-range helper `src/bitops.c`	New internal utility `bitopScalarRange` computes bitwise operations across a byte range with per-operation semantics and early-exit optimization for AND/OR terminal values.
AVX2 and NEON SIMD implementations `src/bitops.c`	x86 AVX2 and ARM NEON vector implementations for AND/OR/XOR/NOT that process large chunks via intrinsics, delegate tails to the scalar helper, and zero-fill destination padding for AND/OR operations.
SIMD dispatch and eligibility heuristics `src/bitops.c`	Heuristic function `bitopUseSimd` determines SIMD eligibility by operation type, key count, and data length; dispatcher `bitopTrySimd` selects AVX2 or NEON based on CPU support and runs the chosen path, returning success/failure for caller fallback.
Integration into bitopCommand `src/bitops.c`	Updates bitopCommand to call `bitopTrySimd()` when applicable; adjusts `op` variable type from `unsigned long` to `int` to match dispatcher interface, preserving fallback to existing scalar fast path and final padding logic on SIMD decline.
SIMD unit test infrastructure and cases `src/unit/test_bitop_simd.cpp`	GoogleTest harness with CPU feature detection (`cpuHasAvx2`), scalar oracle reference implementation, deterministic RNG, and test fixture that validates SIMD results against oracle across multiple buffer sizes, key counts, and layouts, plus architecture-conditional test cases for AVX2 and NEON.
Tcl integration tests for BITOP `tests/unit/bitops.tcl`	Adds helper `bitop_test_pattern` for generating test strings; inserts threshold and edge-case BITOP tests covering mixed input lengths, source key count variants (16/17/33 keys), and short-circuit scenarios, validating server output against scalar simulation.

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding SIMD acceleration (AVX2 and NEON) for BITOP operations (AND/OR/XOR/NOT).
Description check	✅ Passed	The description comprehensively details the changes, rationale, architecture, benchmarks, and testing—all directly related to the SIMD acceleration implementation in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 92.31% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/bitops.c`:
- Around line 756-1236: Add short function-level rationale comments for the SIMD
helpers and dispatchers: document intents and constraints for bitopScalarRange
(zero-extension/byte-fallback semantics and early-exit behavior),
bitopAndAVX2/bitopOrAVX2/bitopXorAVX2/bitopNotAVX2 (what they optimize, expected
alignment/length assumptions, why they call _mm256_zeroupper() before falling
back to scalar), the NEON helpers
(bitopAndNEON/bitopOrNEON/bitopXorNEON/bitopNotNEON) with analogous notes, and
the dispatchers bitopUseSimd and bitopTrySimd (explain threshold choices and why
AND/OR with many sources keep scalar loop to allow early zero/0xff
short-circuiting and why NOT uses maxlen). Keep each comment brief (1–3 lines),
focused on why the function exists and any important correctness constraints.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: ba421993-2695-4e26-8bd3-69eee8ec7882

📥 Commits

Reviewing files that changed from the base of the PR and between f2f4e5d and 75bc355.

📒 Files selected for processing (3)

src/bitops.c
src/unit/test_bitop_simd.cpp
tests/unit/bitops.tcl

Add concise comments explaining the SIMD helper constraints, zero-extension semantics, AVX2 transition handling, and dispatch choices requested during review. Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>

coderabbitai

🧹 Nitpick comments (1)

src/bitops.c (1)
797-797: ⚡ Quick win

Make the new SIMD kernels file-local.

Lines 797, 852, 906, 960, 1011, 1063, 1114, and 1165 define helpers that are only used inside src/bitops.c, so keeping external linkage unnecessarily exports symbols from this translation unit. As per coding guidelines, "Use static keyword for file-local functions in C code".

Also applies to: 852-852, 906-906, 960-960, 1011-1011, 1063-1063, 1114-1114, 1165-1165
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/bitops.c` at line 797, Several helper functions (e.g., bitopAndAVX2,
bitopAndAVX512, bitopAndSSE41, bitopOrAVX2, bitopOrAVX512, bitopOrSSE41,
bitopXorAVX2, bitopXorAVX512 — the new SIMD kernels defined at lines noted) are
currently emitted with external linkage; mark each of these functions as
file-local by adding the static keyword to their definitions so they are only
visible within src/bitops.c, following the project guideline to use static for
internal helpers. Ensure you update the function definitions (not just
declarations) to "static" for all listed kernel functions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/bitops.c`:
- Line 797: Several helper functions (e.g., bitopAndAVX2, bitopAndAVX512,
bitopAndSSE41, bitopOrAVX2, bitopOrAVX512, bitopOrSSE41, bitopXorAVX2,
bitopXorAVX512 — the new SIMD kernels defined at lines noted) are currently
emitted with external linkage; mark each of these functions as file-local by
adding the static keyword to their definitions so they are only visible within
src/bitops.c, following the project guideline to use static for internal
helpers. Ensure you update the function definitions (not just declarations) to
"static" for all listed kernel functions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c2b0164b-f2a7-4a96-8bcc-33b81ec8d3db

📥 Commits

Reviewing files that changed from the base of the PR and between 75bc355 and 56de912.

📒 Files selected for processing (1)

src/bitops.c

PingXie

Overall LGTM.

PingXie · 2026-05-18T05:40:20Z

+    if (bitopUseSimd(op, numkeys, minlen, maxlen, 32) && __builtin_cpu_supports("avx2")) {
+        switch (op) {
+        case BITOP_AND:
+            bitopAndAVX2(dst, src, len, numkeys, minlen, maxlen);


Nit. bitopXXXAVX2 and bitopXXXNEON have the same signature. We should be able to reduce the code duplication in bitopTrySimd. And then the "fast path" fall back can be moved here, which makes the upstream call site even cleaner

PingXie · 2026-05-18T05:42:46Z

-         * operations that are not supported even in ARM >= v6. */
-        j = 0;
+        if (!bitopTrySimd(op, res, src, len, numkeys, minlen, maxlen)) {
+            /* Fast path: as far as we have data for all the input bitmaps we


Nit. No longer "fast path" :)

PingXie · 2026-05-18T05:57:10Z

+
+    /* Preserve the legacy byte loop for many-source AND/OR so it can stop early
+     * when the accumulated byte reaches 0x00 or 0xff. */
+    if ((op == BITOP_AND || op == BITOP_OR) && numkeys > 16) return 0;


16 is an implementation limit for the old code but the SIMD implementation doesn't have the same limit so this feels a bit too conservative to me. Not a blocker to me though: 16 is a reasonably large number already.

ihabwahbi added 2 commits May 2, 2026 15:57

github-actions Bot assigned ihabwahbi May 2, 2026

sarthakaggarwal97 reviewed May 13, 2026

View reviewed changes

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/bitops.c

bitops: document SIMD BITOP helper rationale

56de912

Add concise comments explaining the SIMD helper constraints, zero-extension semantics, AVX2 transition handling, and dispatch choices requested during review. Signed-off-by: Ihab Wahbi <ihab.a.wahbi@gmail.com>

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

PingXie approved these changes May 18, 2026

View reviewed changes

Conversation

ihabwahbi commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Why

What's new

Review Feedback Addressed

Architecture

Benchmarks

Correctness

Compatibility

Testing Performed

Not In This PR

Uh oh!

codecov Bot commented May 4, 2026

Codecov Report

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ihabwahbi commented May 2, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading