perf: tiered SIMD scanning and memory cache-line hygiene#119
Conversation
- scanner: SSE2/AVX2-tiered dmk_memchr prefilter; opt-in AVX-512F+BW verify tier (DMK_ENABLE_AVX512, runtime-gated, off by default) - memory: per-thread striped reader counters and cache-line-aligned shards with the lock and in-flight flag stored inline - scanner cascade: page-wise function back-scan and reuse of the counted match in the prologue fallback - tests, bench scenarios, docs, and an AVX-512 CI workflow
|
Warning Review limit reached
More reviews will be available in 14 minutes and 25 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR introduces AVX-512 optional support for scanner pattern verification with runtime gating, refactors memory cache shards to use fixed-size arrays with striped per-thread reader tracking, rewrites the prefilter to avoid libc memchr via SIMD implementations, and adds CI enforcement alongside comprehensive tests and optimizations. ChangesMemory Cache Architecture Refactor
Scanner SIMD Pipeline: Prefilter Rewrite and AVX-512 Verify Tier
Supporting Optimizations and CI/Test Infrastructure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_memory.cpp (1)
1365-1379: ⚡ Quick winMake shutdown overlap deterministic in this stress test.
Line 1365 increments
startedbefore the read loop, so by Line 1379 some/all workers may already be done. That can let the test pass without really exercising the striped-reader drain path.Suggested adjustment
+ std::atomic<bool> go{false}; std::atomic<unsigned> started{0}; std::atomic<unsigned> finished{0}; @@ started.fetch_add(1, std::memory_order_acq_rel); + while (!go.load(std::memory_order_acquire)) + { + std::this_thread::yield(); + } for (int i = 0; i < READS_PER_THREAD; ++i) { Memory::is_readable(mem, 4); } @@ while (started.load(std::memory_order_acquire) < READER_THREADS) { std::this_thread::yield(); } + go.store(true, std::memory_order_release); Memory::shutdown_cache();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/test_memory.cpp` around lines 1365 - 1379, The test currently increments started before the reader loop so shutdown may run after readers finish; modify the reader lambda so started.fetch_add(1, std::memory_order_acq_rel) happens only once when the thread actually begins hammering reads (e.g., move the started.fetch_add into the read loop before the first Memory::is_readable call or guard it with if (i==0)), ensuring the main thread's while (started.load(... ) < READER_THREADS) truly waits for active readers before calling Memory::shutdown_cache(); keep finished.fetch_add(...) unchanged and retain existing READS_PER_THREAD/READER_THREADS usage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/avx512.yml:
- Line 52: The workflow references the reusable action "ilammy/msvc-dev-cmd@v1"
which should be pinned to an immutable commit; locate the two uses inside the
"correctness-sde" and "throughput-gate" jobs and replace the tag "`@v1`" with the
exact commit SHA for ilammy/msvc-dev-cmd (obtain the SHA from the action's
GitHub repository) so both occurrences are pinned to that commit SHA instead of
the mutable tag.
---
Nitpick comments:
In `@tests/test_memory.cpp`:
- Around line 1365-1379: The test currently increments started before the reader
loop so shutdown may run after readers finish; modify the reader lambda so
started.fetch_add(1, std::memory_order_acq_rel) happens only once when the
thread actually begins hammering reads (e.g., move the started.fetch_add into
the read loop before the first Memory::is_readable call or guard it with if
(i==0)), ensuring the main thread's while (started.load(... ) < READER_THREADS)
truly waits for active readers before calling Memory::shutdown_cache(); keep
finished.fetch_add(...) unchanged and retain existing
READS_PER_THREAD/READER_THREADS usage.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b9fde956-2409-4e28-8ab3-d431398a99ae
📒 Files selected for processing (16)
.github/workflows/avx512.ymlAGENTS.mdCMakeLists.txtREADME.mddocs/analysis/memory_bench_v3.x/README.mddocs/analysis/scanner_bench_v3.x/README.mdinclude/DetourModKit/scanner.hppsrc/memory.cppsrc/scanner.cppsrc/scanner_cascade.cppsrc/string_xref.cpptests/bench_memory.cpptests/bench_scanner.cpptests/test_memory.cpptests/test_scanner.cpptests/test_string_xref.cpp
- replace the broken manual Intel SDE download with the SHA-pinned petarpetrovt/setup-sde action, with a skip guard so the advisory job never hangs or hard-fails - pin ilammy/msvc-dev-cmd to its commit SHA in both jobs - always build the AVX-512 tier so the job verifies it compiles on MSVC even when SDE is skipped - tighten the striped-drain test: count a reader as started only once it enters the read loop, so shutdown overlaps active readers
SDE 10.8.0 has an install-folder-dependent startup bug (the Pin injector failed from the runner's extraction path), so the hosted SDE correctness run never started. Pin a known-good earlier release and prefer the sde64.exe launcher. Ref: https://community.intel.com/t5/Intel-ISA-Extensions/SDE-10-8-fails-to-start-depending-on-the-installation-folder/m-p/1746386
Summary
Performance work across the AOB scanner and the memory region cache, each gated by a microbenchmark. Default behavior and the public API are unchanged.
dmk_memchrnow tiers SSE2/AVX2 at runtime over a scalar tail (previously scalar/SWAR), while keeping its AddressSanitizer-interceptor immunity.DMK_ENABLE_AVX512build option (off by default), runtime-gated by CPUID + XGETBV, and confined to one function so the rest of the library keeps its AVX2 baseline and still runs on CPUs without AVX-512. A new CI workflow validates its correctness under Intel SDE and runs the throughput gate on an opt-in self-hosted AVX-512 runner.Summary by CodeRabbit
New Features
Documentation
Performance Improvements
Tests