|
| 1 | +# EventDispatcher Bench, v3.1.0 |
| 2 | + |
| 3 | +Before/after numbers for the lock-free COW snapshot `emit()` landed in v3.1.0. |
| 4 | +The previous implementation used `std::shared_mutex` for `emit()` / `emit_safe()` |
| 5 | +and an exclusive lock for `subscribe()` / `unsubscribe()`. The new implementation |
| 6 | +stores handlers in a `std::atomic<std::shared_ptr<const std::vector<Entry>>>` |
| 7 | +snapshot published on mutation, with a lock-free atomic handler-count fast |
| 8 | +path for the zero-subscriber case. |
| 9 | + |
| 10 | +## Results (median of 5 runs per side) |
| 11 | + |
| 12 | +| Scenario | Subs | Before (ns/op) | After (ns/op) | Delta | Verdict | |
| 13 | +| --------------------------- | ---: | -------------: | ------------: | ----------------- | :------ | |
| 14 | +| `emit` | 0 | 103.9 | **6.0** | **-94.2% (17x)** | REAL | |
| 15 | +| `emit` | 1 | 120.1 | 94.4 | **-21.4%** | REAL | |
| 16 | +| `emit` | 8 | 245.6 | 216.3 | **-11.9%** | REAL | |
| 17 | +| `emit` | 64 | 1103.5 | 1092.1 | -1.0% | NOISE | |
| 18 | +| `emit_safe` | 0 | 103.1 | **5.7** | **-94.5% (18x)** | REAL | |
| 19 | +| `emit_safe` | 1 | 118.6 | 96.4 | **-18.7%** | REAL | |
| 20 | +| `emit_safe` | 8 | 233.2 | 219.1 | -6.0% | REAL | |
| 21 | +| `emit_safe` | 64 | 1086.3 | 1099.8 | +1.2% | NOISE | |
| 22 | +| `emit_concurrent_4_threads` | 8 | 517.9 | **248.2** | **-52.1% (2.1x)** | REAL | |
| 23 | +| `subscribe_unsub_roundtrip` | — | 446.0 | 1150.4 | +158.0% | REAL | |
| 24 | +| `reentrancy_rejection` | 1 | 212.5 | 192.7 | -9.4% | marginal| |
| 25 | + |
| 26 | +Verdict key: |
| 27 | + |
| 28 | +- **REAL**: median delta exceeds 1.5x the combined run-to-run spread on both sides. |
| 29 | +- **NOISE**: median delta is smaller than the run-to-run spread; cannot be distinguished from measurement jitter. |
| 30 | +- **marginal**: delta is larger than spread but smaller than 1.5x spread. |
| 31 | + |
| 32 | +Run-to-run coefficient of variation was 1% to 5% per scenario. Full per-run |
| 33 | +TSVs live in [runs/](runs/) (5 OLD + 5 NEW). A representative single run per |
| 34 | +side is preserved in [before.tsv](before.tsv) and [after.tsv](after.tsv) for |
| 35 | +quick reference. |
| 36 | + |
| 37 | +## Interpretation |
| 38 | + |
| 39 | +**Zero-subscriber fast path.** The atomic handler-count short-circuit in |
| 40 | +`emit()` / `emit_safe()` collapses a `shared_mutex` acquire/release plus |
| 41 | +iteration setup into a single `memory_order_acquire` load of an 8-byte counter. |
| 42 | +The 17x factor is the cost of an uncontended `shared_mutex` acquire/release |
| 43 | +on Windows SRWLOCK relative to a naked atomic load, and it is the dominant |
| 44 | +result for dispatchers that are wired up at init but rarely subscribed to. |
| 45 | + |
| 46 | +**1 to 8 subscriber uncontended emit.** Consistent wins (6% to 21%) from |
| 47 | +removing the reader lock. The snapshot load is a release-acquire atomic plus |
| 48 | +a `shared_ptr` refcount bump, which is cheaper than touching a mutex's state |
| 49 | +word unconditionally. |
| 50 | + |
| 51 | +**Concurrent emit (4 threads, 8 subs).** 2.1x throughput. No reader lock |
| 52 | +means no cache-line contention on the mutex state, so all four threads make |
| 53 | +progress in parallel instead of serializing on the SRWLOCK read side. |
| 54 | + |
| 55 | +**64 subscriber emit.** Within noise on both `emit` (-1.0%) and `emit_safe` |
| 56 | +(+1.2%). An earlier single-run measurement suggested an 18% regression; that |
| 57 | +was a statistical outlier. Across 5 runs per side the two implementations |
| 58 | +are indistinguishable at this subscriber count: the per-handler iteration |
| 59 | +cost dominates and both paths reach the same `std::vector<Entry>` buffer |
| 60 | +layout through one extra dereference either way. |
| 61 | + |
| 62 | +**Subscribe / unsubscribe round-trip.** 2.6x slower (446 ns to 1150 ns). |
| 63 | +Each mutation allocates a fresh handler vector, appends or removes the |
| 64 | +entry, and publishes via atomic store. This is documented in the header |
| 65 | +and is the accepted tradeoff for lock-free reads. Subscribe is not on a |
| 66 | +hot path in any realistic mod workload. |
| 67 | + |
| 68 | +**Reentrancy rejection.** Marginal improvement (within 1.5x spread). Not a |
| 69 | +meaningful claim; effectively unchanged. |
| 70 | + |
| 71 | +## Methodology |
| 72 | + |
| 73 | +- Host: Windows 11, MinGW `mingw-release` preset (GCC 13, libstdc++, -O3 LTO). |
| 74 | +- CMake: `cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF`. |
| 75 | +- Build: `DetourModKit_bench` target only. No gtest linkage, no other test deps. |
| 76 | +- Each sample runs N iterations of the scenario inside a single |
| 77 | + `steady_clock::now()` pair. Reported value is the median per-op cost across |
| 78 | + 11 samples inside one process invocation. Iteration counts are chosen so |
| 79 | + each sample takes roughly the same wall time. |
| 80 | +- 5 process invocations per side (OLD vs NEW), back-to-back, same machine, |
| 81 | + same thermal state. Tables above report the median across those 5 runs |
| 82 | + for each scenario. |
| 83 | +- Verdicts use run-to-run spread (max minus min across the 5 runs) as the |
| 84 | + noise floor. A claim is "REAL" only when the median delta exceeds 1.5x |
| 85 | + that noise floor on both sides. |
| 86 | + |
| 87 | +## Reproduce |
| 88 | + |
| 89 | +```bash |
| 90 | +cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF |
| 91 | +PATH="/c/msys64/mingw64/bin:$PATH" cmake --build build/mingw-release --target DetourModKit_bench --parallel |
| 92 | +PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-release/tests/DetourModKit_bench.exe > run.tsv |
| 93 | +``` |
| 94 | + |
| 95 | +For a clean before/after comparison, bench the new implementation first, |
| 96 | +copy the header aside, `git checkout HEAD -- include/DetourModKit/event_dispatcher.hpp` |
| 97 | +to restore the baseline header, rebuild the `DetourModKit_bench` target, run |
| 98 | +again into the baseline TSV, then restore the new header. Repeat N times |
| 99 | +per side and compare medians with an explicit noise-floor check. |
| 100 | + |
| 101 | +## Caveat on committed TSVs |
| 102 | + |
| 103 | +The TSVs in this directory are raw artifacts from a specific host and |
| 104 | +compiler version. They are not a stable baseline. Treat them as evidence |
| 105 | +for the claims in this document, not as a regression gate. Future bench |
| 106 | +runs should regenerate their own numbers and compare against the structure |
| 107 | +of the results (17x fast-path win, 2x concurrent win, COW subscribe cost) |
| 108 | +rather than the absolute nanosecond values. |
0 commit comments