Skip to content

Commit a08f609

Browse files
committed
perf(event_dispatcher): lock-free emit via atomic shared_ptr snapshot
Replace shared_mutex reader lock with std::atomic<std::shared_ptr<const HandlerList>> copy-on-write snapshot for emit()/emit_safe(), plus an atomic handler counter fast path for the zero-subscriber case. Add DetourModKit_bench microbench (DMK_BUILD_BENCHMARKS) and three new tests covering the fast path, snapshot stability, and reclamation. Include loader-lock destructor hardening in HookManager to keep move-only VmtHookEntry values out of container copy fallbacks. CI: add MSVC verify job to coverage-pages.yml, gate pages deploy on both MinGW and MSVC passing.
1 parent 587f8c0 commit a08f609

14 files changed

Lines changed: 925 additions & 164 deletions

File tree

.github/workflows/coverage-pages.yml

Lines changed: 39 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -27,14 +27,10 @@ env:
2727
MINGW_BIN: C:\mingw64\bin
2828

2929
jobs:
30-
build-and-deploy:
31-
name: Build, Test & Deploy Coverage
30+
mingw-coverage:
31+
name: MinGW Build, Test & Coverage Artifact
3232
runs-on: windows-latest
3333

34-
environment:
35-
name: github-pages
36-
url: ${{ steps.deployment.outputs.page_url }}
37-
3834
steps:
3935
- name: Checkout code
4036
uses: actions/checkout@v4
@@ -102,6 +98,43 @@ jobs:
10298
with:
10399
path: coverage-report
104100

101+
msvc-verify:
102+
name: MSVC Build & Test
103+
runs-on: windows-latest
104+
105+
steps:
106+
- name: Checkout code
107+
uses: actions/checkout@v4
108+
with:
109+
submodules: "recursive"
110+
111+
- name: Set up MSVC developer environment
112+
uses: ilammy/msvc-dev-cmd@v1
113+
with:
114+
arch: x64
115+
116+
- name: Configure (MSVC Debug + Tests)
117+
run: cmake --preset msvc-debug
118+
shell: cmd
119+
120+
- name: Build
121+
run: cmake --build --preset msvc-debug --parallel
122+
shell: cmd
123+
124+
- name: Run Tests
125+
run: ctest --preset msvc-debug
126+
shell: cmd
127+
128+
deploy-pages:
129+
name: Deploy Coverage Report
130+
runs-on: ubuntu-latest
131+
needs: [mingw-coverage, msvc-verify]
132+
133+
environment:
134+
name: github-pages
135+
url: ${{ steps.deployment.outputs.page_url }}
136+
137+
steps:
105138
- name: Deploy to GitHub Pages
106139
id: deployment
107140
uses: actions/deploy-pages@v4

AGENTS.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -167,7 +167,7 @@ pending_messages_.fetch_add(1, std::memory_order_acq_rel);
167167

168168
### Resource management and patterns
169169

170-
- **RAII everywhere:** `std::unique_ptr`, `std::shared_ptr`, `std::lock_guard`, `std::scoped_lock`. No naked `new`/`delete` in application code. The only permitted exception is leak-on-purpose singletons to avoid the static destruction order fiasco (must be documented with a comment explaining why).
170+
- **RAII everywhere:** `std::unique_ptr`, `std::shared_ptr`, `std::lock_guard`, `std::scoped_lock`. No naked `new`/`delete` in application code. The only permitted exception is leak-on-purpose state to avoid teardown hazards -- specifically the static destruction order fiasco or deadlock when destruction would run under the Windows loader lock. Any such leak must be documented with a comment explaining why, must use `new (std::nothrow)` so the enclosing `noexcept` path stays honest, and must pin the current module so code pages referenced by the leaked state stay mapped (see `HookManager::~HookManager` and `Logger::shutdown_internal`).
171171
- **Rule of Zero/Five:** Prefer Rule of Zero (let compiler generate special members). When custom resource management is needed, implement all five special members. Delete copy/move when the type is non-copyable/non-movable.
172172
- **Atomic memory orderings:** Use the weakest correct ordering. `memory_order_relaxed` for counters and non-critical flags. `acquire`/`release` pairs for synchronization. Document why in comments only when the ordering is non-obvious.
173173
- **Lock ordering:** When acquiring multiple locks, document the order in the class header and follow it strictly. Example from `logger.hpp`: `1. async_mutex_` then `2. *log_mutex_ptr_`.
@@ -253,14 +253,14 @@ PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-debug/tests/DetourModKit_tests.
253253
| Module | Thread safety | Hot-path mechanism |
254254
|--------|--------------|-------------------|
255255
| Scanner | Stateless -- inherently safe | N/A (startup only) |
256-
| HookManager | `shared_mutex` (readers) / `unique_lock` (writers); two-phase shutdown (disable under shared lock, clear under exclusive lock); `m_mutator_gate` (shared_mutex) blocks new mutators (including all VMT operations) during teardown; CAS on `m_shutdown_called` serializes shutdown/remove_all_hooks; double-checked fast-fail on `m_shutdown_called` in all mutators; destructor fallback (when `DMK_Shutdown()` was not called) acquires `m_mutator_gate` exclusively, flips `m_shutdown_called`, drains readers via exclusive `m_hooks_mutex`, then clears the maps -- under loader lock it pins the module and leaks the maps into a static vector instead of draining (mirrors the Logger::shutdown_internal pattern) | `shared_lock` for `with_inline_hook()` |
256+
| HookManager | `shared_mutex` (readers) / `unique_lock` (writers); two-phase shutdown (disable under shared lock, clear under exclusive lock); `m_mutator_gate` (shared_mutex) blocks new mutators (including all VMT operations) during teardown; CAS on `m_shutdown_called` serializes shutdown/remove_all_hooks; double-checked fast-fail on `m_shutdown_called` in all mutators; destructor fallback (when `DMK_Shutdown()` was not called) acquires `m_mutator_gate` exclusively, flips `m_shutdown_called`, drains readers via exclusive `m_hooks_mutex`, then clears the maps -- under loader lock it pins the module and move-constructs each map onto the heap via `new (std::nothrow)` so the storage outlives the destructor without ever draining, mirroring the leak-on-loader-lock discipline used in `Logger::shutdown_internal` | `shared_lock` for `with_inline_hook()` |
257257
| Logger | `atomic<shared_ptr>` for lock-free async reads; `shutdown_internal` is safe across repeated shutdown / enable_async_mode cycles: when the writer thread has to be detached under loader lock, the `shared_ptr<AsyncLogger>` is appended to a static `std::vector` rather than overwriting a single static slot, so prior handles are never dropped while their writer threads may still be running | Single atomic load on log level check |
258258
| AsyncLogger | Lock-free MPMC queue (Vyukov-style); post-join drain on shutdown (at most one message per producer can be lost in the nanosecond race between drain and force-zero -- accepted trade-off to avoid atomic overhead on every enqueue); timestamp caching in write batches | Atomic sequence numbers per slot |
259259
| InputPoller | Atomic `active_states_[]` array | `memory_order_relaxed` load per binding |
260260
| InputManager | `mutex` for lifecycle, `atomic<InputPoller*>` for reads | Lock-free `is_binding_active()` |
261261
| Memory cache | Sharded `SRWLOCK` + epoch-based shutdown | Shared reader locks per shard |
262262
| Config | `mutex` for registration; deferred setter invocation outside lock (no reentrancy guard needed -- setters may call back into Config) | N/A (startup only) |
263-
| EventDispatcher | `shared_mutex` -- shared lock for `emit()`/`emit_safe()`, exclusive lock for subscribe/unsubscribe; thread-local reentrancy guard rejects subscribe/unsubscribe from within handlers; `emit()` propagates handler exceptions, `emit_safe()` catches and skips them | `shared_lock` + contiguous vector iteration in subscription order |
263+
| EventDispatcher | Lock-free `emit()` / `emit_safe()` via `std::atomic<std::shared_ptr<const std::vector<Entry>>>` snapshot (copy-on-write publish, acquire-load on read); zero-subscriber fast path skips the snapshot load via an atomic handler counter; writers serialize on a small `std::mutex` that never touches the emit hot path; thread-local reentrancy guard rejects subscribe/unsubscribe from within handlers so the no-mutation-during-emit invariant holds; `emit()` propagates handler exceptions, `emit_safe()` catches and skips them | Atomic acquire-load of a `shared_ptr` snapshot plus linear iteration over a contiguous vector; no reader lock |
264264
| Profiler | Lock-free ring buffer via atomic `fetch_add` on write position; odd/even sequence counter per sample slot prevents torn reads during concurrent export -- the sequence is opened and closed with unconditional `fetch_add` (never a load-then-store) so concurrent producers racing on the same slot cannot roll the counter backwards; `DMK_PROFILE_SCOPE(name)` requires `name` to be a string literal, enforced at compile time by a `ScopedProfile` constructor that only binds to `const char (&)[N]` | Single atomic increment + sequence-guarded field writes per sample |
265265

266266
### Performance-critical paths

CMakeLists.txt

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -467,8 +467,19 @@ endif()
467467
# --- Unit Tests ---
468468
option(DMK_BUILD_TESTS "Build unit tests" OFF)
469469

470-
if(DMK_BUILD_TESTS)
471-
message(STATUS "Building unit tests...")
472-
enable_testing()
470+
# --- Benchmarks ---
471+
# Opt-in microbenchmark executables (e.g. DetourModKit_bench). Gated so that
472+
# normal consumer builds do not produce extra targets. The bench sources live
473+
# alongside tests/ and are wired up in tests/CMakeLists.txt.
474+
option(DMK_BUILD_BENCHMARKS "Build benchmark executables" OFF)
475+
476+
if(DMK_BUILD_TESTS OR DMK_BUILD_BENCHMARKS)
477+
if(DMK_BUILD_TESTS)
478+
message(STATUS "Building unit tests...")
479+
enable_testing()
480+
endif()
481+
if(DMK_BUILD_BENCHMARKS)
482+
message(STATUS "Building benchmarks...")
483+
endif()
473484
add_subdirectory(tests)
474485
endif()

README.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,13 +105,15 @@ DetourModKit is a lightweight C++ toolkit designed to simplify common tasks in g
105105

106106
- Typed pub/sub event system with RAII subscription management
107107
- Each `EventDispatcher<Event>` manages a single event type
108-
- `shared_mutex` concurrency: concurrent `emit()` via shared lock, exclusive lock for subscribe/unsubscribe
108+
- Lock-free `emit()` / `emit_safe()`: atomic acquire-load of a `std::shared_ptr<const vector>` snapshot and iterate, no reader lock; `subscribe()` / `unsubscribe()` are copy-on-write under a small writer mutex
109+
- Zero-subscriber fast path: `emit()` / `emit_safe()` short-circuit on a lock-free atomic handler counter, skipping the snapshot load entirely
109110
- Subscriptions auto-unsubscribe on destruction
110111
- Handlers invoked in subscription order (preserved across unsubscribe)
111-
- Thread-local reentrancy guard detects and rejects subscribe/unsubscribe calls from within a handler, preventing deadlock
112+
- Thread-local reentrancy guard detects and rejects subscribe/unsubscribe calls from within a handler, keeping the no-mutation-during-emit invariant intact
112113
- Compose multiple dispatchers for multi-event architectures
113114
- `emit_safe()` for exception-tolerant dispatch (recommended for hook callbacks)
114115
- Safe when the dispatcher is destroyed before its subscriptions (weak_ptr guard)
116+
- Trade-off: `subscribe()` / `unsubscribe()` allocate a new handler list each call (O(n) publish). Suited for 1-10 subscribers per event and write-rarely access patterns, which matches typical mod usage
115117

116118
</details>
117119

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
# EventDispatcher Bench, v3.2.0
2+
3+
Before/after numbers for the lock-free COW snapshot `emit()` landed in v3.2.0.
4+
The previous implementation used `std::shared_mutex` for `emit()` / `emit_safe()`
5+
and an exclusive lock for `subscribe()` / `unsubscribe()`. The new implementation
6+
stores handlers in a `std::atomic<std::shared_ptr<const std::vector<Entry>>>`
7+
snapshot published on mutation, with a lock-free atomic handler-count fast
8+
path for the zero-subscriber case.
9+
10+
## Results
11+
12+
| Scenario | Subs | Before (ns/op) | After (ns/op) | Delta |
13+
| --------------------------- | ---: | -------------: | ------------: | ----------------- |
14+
| `emit` | 0 | 105.20 | **6.47** | **-94% (16.3x)** |
15+
| `emit` | 1 | 126.23 | 106.85 | -15% |
16+
| `emit` | 8 | 253.99 | 249.52 | -2% |
17+
| `emit` | 64 | 1121.43 | 1324.66 | +18% (regression) |
18+
| `emit_safe` | 0 | 103.55 | **6.32** | **-94% (16.4x)** |
19+
| `emit_safe` | 1 | 119.27 | 106.76 | -10% |
20+
| `emit_safe` | 8 | 231.13 | 208.92 | -10% |
21+
| `emit_safe` | 64 | 1169.86 | 1077.59 | -8% |
22+
| `subscribe_unsub_roundtrip` | 0 | 487.18 | 1125.23 | +131% (expected) |
23+
| `emit_concurrent_4_threads` | 8 | 551.73 | **268.07** | **-51% (2.06x)** |
24+
| `reentrancy_rejection` | 1 | 239.07 | 202.82 | -15% |
25+
26+
Raw TSVs in [before.tsv](before.tsv) and [after.tsv](after.tsv). Each row is the
27+
median of 11 samples. Iteration counts vary per row (10M for fast cases down to
28+
200K for the slowest) to keep per-scenario wall time comparable.
29+
30+
## Interpretation
31+
32+
**Zero-subscriber fast path.** The atomic handler-count short-circuit in
33+
`emit()` / `emit_safe()` collapses a `shared_mutex` acquire/release plus
34+
iteration setup into a single `memory_order_acquire` load of an 8-byte counter.
35+
The 16x factor is the cost of an uncontended `shared_mutex` acquire/release
36+
on Windows SRWLOCK relative to a naked atomic load, and it is the dominant
37+
result for dispatchers that are wired up at init but rarely subscribed to.
38+
39+
**1 to 8 subscriber uncontended emit.** Small consistent wins (10% to 15%)
40+
from removing the reader lock. The snapshot load is a release-acquire atomic
41+
plus a `shared_ptr` refcount bump, which is cheaper than touching a mutex's
42+
state word unconditionally.
43+
44+
**Concurrent emit (4 threads, 8 subs).** 2.06x throughput. No reader lock
45+
means no cache-line contention on the mutex state, so all four threads make
46+
progress in parallel instead of serializing on the SRWLOCK read side.
47+
48+
**64 subscriber emit, single thread.** 18% slower (+203 ns on a 1121 ns
49+
baseline). Two plausible causes:
50+
51+
1. Timer noise. On an 1100 ns run, 200 ns is 2-3 cycles worth of timer jitter
52+
amplified across the sample; the noise floor on `steady_clock` is
53+
typically in the tens of nanoseconds per sample.
54+
2. `std::atomic<std::shared_ptr>` load cost dominates over the old loop's
55+
single mutex acquire when amortized over only 64 handlers. libstdc++'s
56+
implementation uses DWCAS (cmpxchg16b) on the snapshot atomic; MSVC
57+
uses an internal spinlock.
58+
59+
Typical DetourModKit usage (per the README: 1-10 subscribers per event,
60+
dispatchers wired once at init) stays well inside the range where the
61+
optimization is a pure win. The 64 subscriber row should be treated as a
62+
worst-case indicator, not representative load.
63+
64+
**Subscribe / unsubscribe round-trip.** 2.31x slower (487 ns to 1125 ns).
65+
Each mutation allocates a fresh handler vector, appends or removes the
66+
entry, and publishes via atomic store. This is documented in the header
67+
and is the accepted tradeoff for lock-free reads. Subscribe is not on a
68+
hot path in any realistic mod workload.
69+
70+
**Concurrent emit, reentrancy rejection.** Small wins from the same
71+
fast-path removal of the shared lock.
72+
73+
## Methodology
74+
75+
- Host: Windows 11, MinGW `mingw-release` preset (GCC 13, libstdc++, -O3 LTO).
76+
- CMake: `cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF`.
77+
- Build: `DetourModKit_bench` target only. No gtest linkage, no other test deps.
78+
- Each sample runs N iterations of the scenario inside a single
79+
`steady_clock::now()` pair. Reported value is the median per-op cost across
80+
11 samples. Iteration counts are chosen so each sample takes roughly the
81+
same wall time.
82+
- Back-to-back runs, same machine, same process start, thermal state
83+
comparable. Numbers are not hermetic; reruns on the same machine drift by
84+
a few percent at this granularity.
85+
86+
## Reproduce
87+
88+
```bash
89+
cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF
90+
PATH="/c/msys64/mingw64/bin:$PATH" cmake --build build/mingw-release --target DetourModKit_bench --parallel
91+
PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-release/tests/DetourModKit_bench.exe > after.tsv
92+
```
93+
94+
For a clean before/after comparison, bench the new implementation first,
95+
copy the header aside, `git checkout HEAD -- include/DetourModKit/event_dispatcher.hpp`
96+
to restore the baseline header, rebuild the `DetourModKit_bench` target, run
97+
again into `before.tsv`, then restore the new header.
98+
99+
## Caveat on committed TSVs
100+
101+
The `before.tsv` and `after.tsv` files in this directory are raw artifacts
102+
from one run on one machine. They are not a stable baseline. Treat them as
103+
evidence for the claims in this document, not as a regression gate. Future
104+
bench runs should regenerate their own numbers and compare against the
105+
structure of the results (16x fast-path win, 2x concurrent win, COW
106+
subscribe cost) rather than the absolute nanosecond values.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
scenario subscribers iterations median_ns_per_op total_ms
2+
emit 0 10000000 6.47 718
3+
emit 1 5000000 106.85 5860
4+
emit 8 1000000 249.52 2756
5+
emit 64 200000 1324.66 2940
6+
emit_safe 0 10000000 6.32 696
7+
emit_safe 1 5000000 106.76 5727
8+
emit_safe 8 1000000 208.92 2341
9+
emit_safe 64 200000 1077.59 2401
10+
subscribe_unsub_roundtrip 0 100000 1125.23 1259
11+
emit_concurrent_4_threads 8 4000000 268.07 1072
12+
reentrancy_rejection 1 500000 202.82 1148
13+
# sink=23940842247
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
scenario subscribers iterations median_ns_per_op total_ms
2+
emit 0 10000000 105.20 11778
3+
emit 1 5000000 126.23 6977
4+
emit 8 1000000 253.99 3539
5+
emit 64 200000 1121.43 2538
6+
emit_safe 0 10000000 103.55 11537
7+
emit_safe 1 5000000 119.27 6594
8+
emit_safe 8 1000000 231.13 2555
9+
emit_safe 64 200000 1169.86 2613
10+
subscribe_unsub_roundtrip 0 100000 487.18 545
11+
emit_concurrent_4_threads 8 4000000 551.73 2206
12+
reentrancy_rejection 1 500000 239.07 1326
13+
# sink=24038801584

0 commit comments

Comments
 (0)