Skip to content

perf: tiered SIMD scanning and memory cache-line hygiene#119

Merged
tkhquang merged 3 commits into
mainfrom
perf/simd-scan-cache-hygiene
Jun 13, 2026
Merged

perf: tiered SIMD scanning and memory cache-line hygiene#119
tkhquang merged 3 commits into
mainfrom
perf/simd-scan-cache-hygiene

Conversation

@tkhquang

@tkhquang tkhquang commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Summary

Performance work across the AOB scanner and the memory region cache, each gated by a microbenchmark. Default behavior and the public API are unchanged.

  • Scanner prefilter: the self-provided dmk_memchr now tiers SSE2/AVX2 at runtime over a scalar tail (previously scalar/SWAR), while keeping its AddressSanitizer-interceptor immunity.
  • AVX-512 verify tier: opt-in via the new DMK_ENABLE_AVX512 build option (off by default), runtime-gated by CPUID + XGETBV, and confined to one function so the rest of the library keeps its AVX2 baseline and still runs on CPUs without AVX-512. A new CI workflow validates its correctness under Intel SDE and runs the throughput gate on an opt-in self-hosted AVX-512 runner.
  • Memory cache: per-thread striped reader counters (summed at shutdown) replace the single global counter, and each shard is cache-line-aligned with its lock word and stampede flag stored inline, so concurrent readers no longer re-serialize on one shared cache line.
  • Scanner cascade: the enclosing-function back-scan now reads page-wise instead of byte-by-byte, and the prologue fallback reuses the match it already counted.
  • Adds the supporting tests, benchmark scenarios, documentation updates, and the AVX-512 CI workflow.

Summary by CodeRabbit

  • New Features

    • Added opt-in AVX-512 verification tier for pattern scanning, runtime-gated on CPU capability and OS state
  • Documentation

    • Expanded guides on AVX-512 configuration, memory caching architecture, and benchmark interpretation
  • Performance Improvements

    • Optimized memory cache concurrency with striped reader tracking, reducing contention
    • Improved prefilter throughput by eliminating libc dependencies
    • Enhanced string cross-reference back-scan efficiency via buffered page reads
  • Tests

    • Added validation for AVX-512 verification paths, concurrent memory operations, and contention scenarios

- scanner: SSE2/AVX2-tiered dmk_memchr prefilter; opt-in AVX-512F+BW verify tier (DMK_ENABLE_AVX512, runtime-gated, off by default)
- memory: per-thread striped reader counters and cache-line-aligned shards with the lock and in-flight flag stored inline
- scanner cascade: page-wise function back-scan and reuse of the counted match in the prologue fallback
- tests, bench scenarios, docs, and an AVX-512 CI workflow
@tkhquang tkhquang self-assigned this Jun 13, 2026
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@tkhquang, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 14 minutes and 25 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38bf0af6-f3a7-45ae-ae1e-42bc3906a4be

📥 Commits

Reviewing files that changed from the base of the PR and between 06e2574 and cbbdf9c.

📒 Files selected for processing (3)
  • .github/workflows/avx512.yml
  • docs/analysis/scanner_bench_v3.x/README.md
  • tests/test_memory.cpp
📝 Walkthrough

Walkthrough

This PR introduces AVX-512 optional support for scanner pattern verification with runtime gating, refactors memory cache shards to use fixed-size arrays with striped per-thread reader tracking, rewrites the prefilter to avoid libc memchr via SIMD implementations, and adds CI enforcement alongside comprehensive tests and optimizations.

Changes

Memory Cache Architecture Refactor

Layer / File(s) Summary
CacheShard struct redesign and fixed-size allocation
src/memory.cpp
CacheShard now includes cache-line alignment, embedded SrwSharedMutex, and per-shard in-flight atomic; storage changes from resizable vector to fixed-size unique_ptr heap array allocation.
Striped reader tracking and ActiveReaderGuard
src/memory.cpp
Replaces single global reader counter with per-thread striped reader-count array; ActiveReaderGuard now increments/decrements per-stripe counts to reduce shutdown contention.
Cache initialization with fixed allocation
src/memory.cpp
Updates perform_cache_initialization to allocate fixed CacheShard array once, reserve per-shard capacity in single pass, and handle allocation failure via pointer reset.
Query and cache-update paths with embedded locks
src/memory.cpp
Refactors query_and_update_cache stampede coalescing and leader/follower paths to use embedded shard.mtx and shard.in_flight instead of external arrays.
Permission checking and stats paths
src/memory.cpp
Updates is_readable, is_writable, invalidate_range, and get_cache_stats to use per-shard embedded locks under both blocking and non-blocking acquisition.
Shutdown and cleanup with striped reader-epoch
src/memory.cpp
Refactors shutdown_cache to sum striped reader counts instead of single counter; updates teardown to iterate fixed shard array under embedded locks.
Multithreaded striped reader shutdown test
tests/test_memory.cpp
Tests concurrent reader draining via ShutdownCache_DrainsManyStripedReaders with 16 reader threads exercising striped tracking.

Scanner SIMD Pipeline: Prefilter Rewrite and AVX-512 Verify Tier

Layer / File(s) Summary
CPUID and XCR0 capability detection
src/scanner.cpp
Introduces CPUID/XCR0 feature-bit checking; refactors cpu_has_avx2() to use shared OS-enabled verification; adds cpu_has_avx512() with AVX-512F+AVX-512BW feature and XCR0 gating.
SIMD needle-search prefilter implementations
src/scanner.cpp
Implements SSE2 and AVX2 needle-search bodies decorated for ASan avoidance; reroutes dmk_memchr to accept hoisted use_avx2 parameter for conditional dispatch.
Pattern verification with widest-first SIMD ordering
src/scanner.cpp
Updates find_pattern_raw to hoist AVX2/AVX-512 support once, pass use_avx2 to prefilter, and verify tiers widest-first: AVX-512 (64B) → AVX2 (32B) → SSE2/scalar.
Public SimdLevel enum and active_simd_level reporting
include/DetourModKit/scanner.hpp, src/scanner.cpp
Extends Scanner::SimdLevel with Avx512 value; updates active_simd_level() to prefer Avx512 when available, otherwise falls back to AVX2/SSE2/Scalar.
Scanner prefilter and AVX-512 path tests
tests/test_scanner.cpp
Expands prefilter boundary tests with AVX2-focused and SSE2-focused position sets; adds find_pattern_avx512_path tests for 96/128-byte patterns with seam and mismatch scenarios.
Scanner benchmark extensions: prefilter and verify gates
tests/bench_scanner.cpp
Adds run_prefilter_bench (prefilter throughput vs libc memchr) and run_verify_bench (deep-verify throughput with #GATE output for CI); extends SIMD tier reporting.

Supporting Optimizations and CI/Test Infrastructure

Layer / File(s) Summary
Scanner cascade first-match capture optimization
src/scanner_cascade.cpp
Extends count_pattern_hits_bounded with optional first_match_out parameter; updates scan_candidates_hooked_prologue to use captured match instead of re-scanning.
String xref back-scan buffering with page-aligned reads
src/string_xref.cpp
Refactors enclosing_function_start back-scan to buffer up to 8KiB in page-sized chunks via seh_read_bytes instead of per-byte guarded reads.
String xref page-boundary crossing tests
tests/test_string_xref.cpp
Adds EnclosingFunctionBackScanCrossesPageBoundary and EnclosingFunctionBackScanSkipsUnmappedLowerPage tests validating back-scan across page boundaries and unmapped regions.
AVX-512 CI verification workflow
.github/workflows/avx512.yml
Adds correctness-sde job (runs scanner tests under Intel SDE on every push/PR) and throughput-gate dispatch job (measures AVX-512 vs AVX2 speedup on self-hosted runner, enforces 1.30x minimum).
CMake option and build documentation
CMakeLists.txt, README.md
Adds DMK_ENABLE_AVX512 CMake option; documents build commands and per-function target attribute gating with runtime CPU/OS checks and AVX2 fallback.
Documentation updates: architecture and benchmarks
AGENTS.md, README.md, docs/analysis/*
Updates AGENTS.md with memory cache shard/alignment and striped-reader details; updates benchmark READMEs with prefilter throughput analysis and AVX-512 CI gate behavior.
Memory benchmark warm-hit contention phase
tests/bench_memory.cpp
Adds Phase 9 with run_warm_contention helper, pre-warms cache pool, measures is_readable throughput for 1/2/4/8 threads, emits warm_hit_mops results.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Possibly related PRs

  • tkhquang/DetourModKit#57: Both PRs modify the scanner's SIMD pattern-verification pipeline and Scanner::active_simd_level() runtime detection (main PR extends existing AVX2/SSE2 tiers by adding optional AVX-512 verify tier on top).
  • tkhquang/DetourModKit#10: Both PRs refactor memory cache implementation in src/memory.cpp around cache lookup/hit behavior, shutdown safety, and active-reader tracking via shutdown_cache().
  • tkhquang/DetourModKit#103: Both PRs modify src/scanner.cpp's AOB searching hot paths around byte scanning and tiered verification logic, including refactoring away from libc memchr.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf: tiered SIMD scanning and memory cache-line hygiene' directly and clearly summarizes the main changes: tiered SIMD scanning improvements and memory cache optimizations.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_memory.cpp (1)

1365-1379: ⚡ Quick win

Make shutdown overlap deterministic in this stress test.

Line 1365 increments started before the read loop, so by Line 1379 some/all workers may already be done. That can let the test pass without really exercising the striped-reader drain path.

Suggested adjustment
+    std::atomic<bool> go{false};
     std::atomic<unsigned> started{0};
     std::atomic<unsigned> finished{0};
@@
                 started.fetch_add(1, std::memory_order_acq_rel);
+                while (!go.load(std::memory_order_acquire))
+                {
+                    std::this_thread::yield();
+                }
                 for (int i = 0; i < READS_PER_THREAD; ++i)
                 {
                     Memory::is_readable(mem, 4);
                 }
@@
     while (started.load(std::memory_order_acquire) < READER_THREADS)
     {
         std::this_thread::yield();
     }
+    go.store(true, std::memory_order_release);
 
     Memory::shutdown_cache();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_memory.cpp` around lines 1365 - 1379, The test currently
increments started before the reader loop so shutdown may run after readers
finish; modify the reader lambda so started.fetch_add(1,
std::memory_order_acq_rel) happens only once when the thread actually begins
hammering reads (e.g., move the started.fetch_add into the read loop before the
first Memory::is_readable call or guard it with if (i==0)), ensuring the main
thread's while (started.load(... ) < READER_THREADS) truly waits for active
readers before calling Memory::shutdown_cache(); keep finished.fetch_add(...)
unchanged and retain existing READS_PER_THREAD/READER_THREADS usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/avx512.yml:
- Line 52: The workflow references the reusable action "ilammy/msvc-dev-cmd@v1"
which should be pinned to an immutable commit; locate the two uses inside the
"correctness-sde" and "throughput-gate" jobs and replace the tag "`@v1`" with the
exact commit SHA for ilammy/msvc-dev-cmd (obtain the SHA from the action's
GitHub repository) so both occurrences are pinned to that commit SHA instead of
the mutable tag.

---

Nitpick comments:
In `@tests/test_memory.cpp`:
- Around line 1365-1379: The test currently increments started before the reader
loop so shutdown may run after readers finish; modify the reader lambda so
started.fetch_add(1, std::memory_order_acq_rel) happens only once when the
thread actually begins hammering reads (e.g., move the started.fetch_add into
the read loop before the first Memory::is_readable call or guard it with if
(i==0)), ensuring the main thread's while (started.load(... ) < READER_THREADS)
truly waits for active readers before calling Memory::shutdown_cache(); keep
finished.fetch_add(...) unchanged and retain existing
READS_PER_THREAD/READER_THREADS usage.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b9fde956-2409-4e28-8ab3-d431398a99ae

📥 Commits

Reviewing files that changed from the base of the PR and between 20a5935 and 06e2574.

📒 Files selected for processing (16)
  • .github/workflows/avx512.yml
  • AGENTS.md
  • CMakeLists.txt
  • README.md
  • docs/analysis/memory_bench_v3.x/README.md
  • docs/analysis/scanner_bench_v3.x/README.md
  • include/DetourModKit/scanner.hpp
  • src/memory.cpp
  • src/scanner.cpp
  • src/scanner_cascade.cpp
  • src/string_xref.cpp
  • tests/bench_memory.cpp
  • tests/bench_scanner.cpp
  • tests/test_memory.cpp
  • tests/test_scanner.cpp
  • tests/test_string_xref.cpp

Comment thread .github/workflows/avx512.yml Outdated
tkhquang added 2 commits June 13, 2026 10:38
- replace the broken manual Intel SDE download with the SHA-pinned petarpetrovt/setup-sde action, with a skip guard so the advisory job never hangs or hard-fails
- pin ilammy/msvc-dev-cmd to its commit SHA in both jobs
- always build the AVX-512 tier so the job verifies it compiles on MSVC even when SDE is skipped
- tighten the striped-drain test: count a reader as started only once it enters the read loop, so shutdown overlaps active readers
SDE 10.8.0 has an install-folder-dependent startup bug (the Pin injector failed from the runner's extraction path), so the hosted SDE correctness run never started. Pin a known-good earlier release and prefer the sde64.exe launcher.

Ref: https://community.intel.com/t5/Intel-ISA-Extensions/SDE-10-8-fails-to-start-depending-on-the-installation-folder/m-p/1746386
@tkhquang tkhquang merged commit 2375e07 into main Jun 13, 2026
6 checks passed
@tkhquang tkhquang deleted the perf/simd-scan-cache-hygiene branch June 13, 2026 04:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant