|
| 1 | +# EventDispatcher Bench, v3.2.0 |
| 2 | + |
| 3 | +Before/after numbers for the lock-free COW snapshot `emit()` landed in v3.2.0. |
| 4 | +The previous implementation used `std::shared_mutex` for `emit()` / `emit_safe()` |
| 5 | +and an exclusive lock for `subscribe()` / `unsubscribe()`. The new implementation |
| 6 | +stores handlers in a `std::atomic<std::shared_ptr<const std::vector<Entry>>>` |
| 7 | +snapshot published on mutation, with a lock-free atomic handler-count fast |
| 8 | +path for the zero-subscriber case. |
| 9 | + |
| 10 | +## Results |
| 11 | + |
| 12 | +| Scenario | Subs | Before (ns/op) | After (ns/op) | Delta | |
| 13 | +| --------------------------- | ---: | -------------: | ------------: | ----------------- | |
| 14 | +| `emit` | 0 | 105.20 | **6.47** | **-94% (16.3x)** | |
| 15 | +| `emit` | 1 | 126.23 | 106.85 | -15% | |
| 16 | +| `emit` | 8 | 253.99 | 249.52 | -2% | |
| 17 | +| `emit` | 64 | 1121.43 | 1324.66 | +18% (regression) | |
| 18 | +| `emit_safe` | 0 | 103.55 | **6.32** | **-94% (16.4x)** | |
| 19 | +| `emit_safe` | 1 | 119.27 | 106.76 | -10% | |
| 20 | +| `emit_safe` | 8 | 231.13 | 208.92 | -10% | |
| 21 | +| `emit_safe` | 64 | 1169.86 | 1077.59 | -8% | |
| 22 | +| `subscribe_unsub_roundtrip` | 0 | 487.18 | 1125.23 | +131% (expected) | |
| 23 | +| `emit_concurrent_4_threads` | 8 | 551.73 | **268.07** | **-51% (2.06x)** | |
| 24 | +| `reentrancy_rejection` | 1 | 239.07 | 202.82 | -15% | |
| 25 | + |
| 26 | +Raw TSVs in [before.tsv](before.tsv) and [after.tsv](after.tsv). Each row is the |
| 27 | +median of 11 samples. Iteration counts vary per row (10M for fast cases down to |
| 28 | +200K for the slowest) to keep per-scenario wall time comparable. |
| 29 | + |
| 30 | +## Interpretation |
| 31 | + |
| 32 | +**Zero-subscriber fast path.** The atomic handler-count short-circuit in |
| 33 | +`emit()` / `emit_safe()` collapses a `shared_mutex` acquire/release plus |
| 34 | +iteration setup into a single `memory_order_acquire` load of an 8-byte counter. |
| 35 | +The 16x factor is the cost of an uncontended `shared_mutex` acquire/release |
| 36 | +on Windows SRWLOCK relative to a naked atomic load, and it is the dominant |
| 37 | +result for dispatchers that are wired up at init but rarely subscribed to. |
| 38 | + |
| 39 | +**1 to 8 subscriber uncontended emit.** Small consistent wins (10% to 15%) |
| 40 | +from removing the reader lock. The snapshot load is a release-acquire atomic |
| 41 | +plus a `shared_ptr` refcount bump, which is cheaper than touching a mutex's |
| 42 | +state word unconditionally. |
| 43 | + |
| 44 | +**Concurrent emit (4 threads, 8 subs).** 2.06x throughput. No reader lock |
| 45 | +means no cache-line contention on the mutex state, so all four threads make |
| 46 | +progress in parallel instead of serializing on the SRWLOCK read side. |
| 47 | + |
| 48 | +**64 subscriber emit, single thread.** 18% slower (+203 ns on a 1121 ns |
| 49 | +baseline). Two plausible causes: |
| 50 | + |
| 51 | +1. Timer noise. On an 1100 ns run, 200 ns is 2-3 cycles worth of timer jitter |
| 52 | + amplified across the sample; the noise floor on `steady_clock` is |
| 53 | + typically in the tens of nanoseconds per sample. |
| 54 | +2. `std::atomic<std::shared_ptr>` load cost dominates over the old loop's |
| 55 | + single mutex acquire when amortized over only 64 handlers. libstdc++'s |
| 56 | + implementation uses DWCAS (cmpxchg16b) on the snapshot atomic; MSVC |
| 57 | + uses an internal spinlock. |
| 58 | + |
| 59 | +Typical DetourModKit usage (per the README: 1-10 subscribers per event, |
| 60 | +dispatchers wired once at init) stays well inside the range where the |
| 61 | +optimization is a pure win. The 64 subscriber row should be treated as a |
| 62 | +worst-case indicator, not representative load. |
| 63 | + |
| 64 | +**Subscribe / unsubscribe round-trip.** 2.31x slower (487 ns to 1125 ns). |
| 65 | +Each mutation allocates a fresh handler vector, appends or removes the |
| 66 | +entry, and publishes via atomic store. This is documented in the header |
| 67 | +and is the accepted tradeoff for lock-free reads. Subscribe is not on a |
| 68 | +hot path in any realistic mod workload. |
| 69 | + |
| 70 | +**Concurrent emit, reentrancy rejection.** Small wins from the same |
| 71 | +fast-path removal of the shared lock. |
| 72 | + |
| 73 | +## Methodology |
| 74 | + |
| 75 | +- Host: Windows 11, MinGW `mingw-release` preset (GCC 13, libstdc++, -O3 LTO). |
| 76 | +- CMake: `cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF`. |
| 77 | +- Build: `DetourModKit_bench` target only. No gtest linkage, no other test deps. |
| 78 | +- Each sample runs N iterations of the scenario inside a single |
| 79 | + `steady_clock::now()` pair. Reported value is the median per-op cost across |
| 80 | + 11 samples. Iteration counts are chosen so each sample takes roughly the |
| 81 | + same wall time. |
| 82 | +- Back-to-back runs, same machine, same process start, thermal state |
| 83 | + comparable. Numbers are not hermetic; reruns on the same machine drift by |
| 84 | + a few percent at this granularity. |
| 85 | + |
| 86 | +## Reproduce |
| 87 | + |
| 88 | +```bash |
| 89 | +cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF |
| 90 | +PATH="/c/msys64/mingw64/bin:$PATH" cmake --build build/mingw-release --target DetourModKit_bench --parallel |
| 91 | +PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-release/tests/DetourModKit_bench.exe > after.tsv |
| 92 | +``` |
| 93 | + |
| 94 | +For a clean before/after comparison, bench the new implementation first, |
| 95 | +copy the header aside, `git checkout HEAD -- include/DetourModKit/event_dispatcher.hpp` |
| 96 | +to restore the baseline header, rebuild the `DetourModKit_bench` target, run |
| 97 | +again into `before.tsv`, then restore the new header. |
| 98 | + |
| 99 | +## Caveat on committed TSVs |
| 100 | + |
| 101 | +The `before.tsv` and `after.tsv` files in this directory are raw artifacts |
| 102 | +from one run on one machine. They are not a stable baseline. Treat them as |
| 103 | +evidence for the claims in this document, not as a regression gate. Future |
| 104 | +bench runs should regenerate their own numbers and compare against the |
| 105 | +structure of the results (16x fast-path win, 2x concurrent win, COW |
| 106 | +subscribe cost) rather than the absolute nanosecond values. |
0 commit comments