tkhquang
diff --git a/‎.github/workflows/coverage-pages.yml‎
Lines changed: 39 additions & 6 deletions b/‎.github/workflows/coverage-pages.yml‎
Lines changed: 39 additions & 6 deletions
diff --git a/‎AGENTS.md‎
Lines changed: 3 additions & 3 deletions b/‎AGENTS.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎CMakeLists.txt‎
Lines changed: 14 additions & 3 deletions b/‎CMakeLists.txt‎
Lines changed: 14 additions & 3 deletions
diff --git a/‎README.md‎
Lines changed: 4 additions & 2 deletions b/‎README.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/analysis/event_dispatcher_bench_v3.2.0/README.md‎
Lines changed: 106 additions & 0 deletions b/‎docs/analysis/event_dispatcher_bench_v3.2.0/README.md‎
Lines changed: 106 additions & 0 deletions
diff --git a/‎docs/analysis/event_dispatcher_bench_v3.2.0/after.tsv‎
Lines changed: 13 additions & 0 deletions b/‎docs/analysis/event_dispatcher_bench_v3.2.0/after.tsv‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎docs/analysis/event_dispatcher_bench_v3.2.0/before.tsv‎
Lines changed: 13 additions & 0 deletions b/‎docs/analysis/event_dispatcher_bench_v3.2.0/before.tsv‎
Lines changed: 13 additions & 0 deletions
@@ -27,14 +27,10 @@ env:
   MINGW_BIN: C:\mingw64\bin
 
 jobs:
-  build-and-deploy:
-    name: Build, Test & Deploy Coverage
+  mingw-coverage:
+    name: MinGW Build, Test & Coverage Artifact
     runs-on: windows-latest
 
-    environment:
-      name: github-pages
-      url: ${{ steps.deployment.outputs.page_url }}
-
     steps:
       - name: Checkout code
         uses: actions/checkout@v4
@@ -102,6 +98,43 @@ jobs:
         with:
           path: coverage-report
 
+  msvc-verify:
+    name: MSVC Build & Test
+    runs-on: windows-latest
+
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v4
+        with:
+          submodules: "recursive"
+
+      - name: Set up MSVC developer environment
+        uses: ilammy/msvc-dev-cmd@v1
+        with:
+          arch: x64
+
+      - name: Configure (MSVC Debug + Tests)
+        run: cmake --preset msvc-debug
+        shell: cmd
+
+      - name: Build
+        run: cmake --build --preset msvc-debug --parallel
+        shell: cmd
+
+      - name: Run Tests
+        run: ctest --preset msvc-debug
+        shell: cmd
+
+  deploy-pages:
+    name: Deploy Coverage Report
+    runs-on: ubuntu-latest
+    needs: [mingw-coverage, msvc-verify]
+
+    environment:
+      name: github-pages
+      url: ${{ steps.deployment.outputs.page_url }}
+
+    steps:
       - name: Deploy to GitHub Pages
         id: deployment
         uses: actions/deploy-pages@v4
@@ -167,7 +167,7 @@ pending_messages_.fetch_add(1, std::memory_order_acq_rel);
 
 ### Resource management and patterns
 
-- **RAII everywhere:** `std::unique_ptr`, `std::shared_ptr`, `std::lock_guard`, `std::scoped_lock`. No naked `new`/`delete` in application code. The only permitted exception is leak-on-purpose singletons to avoid the static destruction order fiasco (must be documented with a comment explaining why).
+- **RAII everywhere:** `std::unique_ptr`, `std::shared_ptr`, `std::lock_guard`, `std::scoped_lock`. No naked `new`/`delete` in application code. The only permitted exception is leak-on-purpose state to avoid teardown hazards -- specifically the static destruction order fiasco or deadlock when destruction would run under the Windows loader lock. Any such leak must be documented with a comment explaining why, must use `new (std::nothrow)` so the enclosing `noexcept` path stays honest, and must pin the current module so code pages referenced by the leaked state stay mapped (see `HookManager::~HookManager` and `Logger::shutdown_internal`).
 - **Rule of Zero/Five:** Prefer Rule of Zero (let compiler generate special members). When custom resource management is needed, implement all five special members. Delete copy/move when the type is non-copyable/non-movable.
 - **Atomic memory orderings:** Use the weakest correct ordering. `memory_order_relaxed` for counters and non-critical flags. `acquire`/`release` pairs for synchronization. Document why in comments only when the ordering is non-obvious.
 - **Lock ordering:** When acquiring multiple locks, document the order in the class header and follow it strictly. Example from `logger.hpp`: `1. async_mutex_` then `2. *log_mutex_ptr_`.
@@ -253,14 +253,14 @@ PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-debug/tests/DetourModKit_tests.
 | Module | Thread safety | Hot-path mechanism |
 |--------|--------------|-------------------|
 | Scanner | Stateless -- inherently safe | N/A (startup only) |
-| HookManager | `shared_mutex` (readers) / `unique_lock` (writers); two-phase shutdown (disable under shared lock, clear under exclusive lock); `m_mutator_gate` (shared_mutex) blocks new mutators (including all VMT operations) during teardown; CAS on `m_shutdown_called` serializes shutdown/remove_all_hooks; double-checked fast-fail on `m_shutdown_called` in all mutators; destructor fallback (when `DMK_Shutdown()` was not called) acquires `m_mutator_gate` exclusively, flips `m_shutdown_called`, drains readers via exclusive `m_hooks_mutex`, then clears the maps -- under loader lock it pins the module and leaks the maps into a static vector instead of draining (mirrors the Logger::shutdown_internal pattern) | `shared_lock` for `with_inline_hook()` |
+| HookManager | `shared_mutex` (readers) / `unique_lock` (writers); two-phase shutdown (disable under shared lock, clear under exclusive lock); `m_mutator_gate` (shared_mutex) blocks new mutators (including all VMT operations) during teardown; CAS on `m_shutdown_called` serializes shutdown/remove_all_hooks; double-checked fast-fail on `m_shutdown_called` in all mutators; destructor fallback (when `DMK_Shutdown()` was not called) acquires `m_mutator_gate` exclusively, flips `m_shutdown_called`, drains readers via exclusive `m_hooks_mutex`, then clears the maps -- under loader lock it pins the module and move-constructs each map onto the heap via `new (std::nothrow)` so the storage outlives the destructor without ever draining, mirroring the leak-on-loader-lock discipline used in `Logger::shutdown_internal` | `shared_lock` for `with_inline_hook()` |
 | Logger | `atomic<shared_ptr>` for lock-free async reads; `shutdown_internal` is safe across repeated shutdown / enable_async_mode cycles: when the writer thread has to be detached under loader lock, the `shared_ptr<AsyncLogger>` is appended to a static `std::vector` rather than overwriting a single static slot, so prior handles are never dropped while their writer threads may still be running | Single atomic load on log level check |
 | AsyncLogger | Lock-free MPMC queue (Vyukov-style); post-join drain on shutdown (at most one message per producer can be lost in the nanosecond race between drain and force-zero -- accepted trade-off to avoid atomic overhead on every enqueue); timestamp caching in write batches | Atomic sequence numbers per slot |
 | InputPoller | Atomic `active_states_[]` array | `memory_order_relaxed` load per binding |
 | InputManager | `mutex` for lifecycle, `atomic<InputPoller*>` for reads | Lock-free `is_binding_active()` |
 | Memory cache | Sharded `SRWLOCK` + epoch-based shutdown | Shared reader locks per shard |
 | Config | `mutex` for registration; deferred setter invocation outside lock (no reentrancy guard needed -- setters may call back into Config) | N/A (startup only) |
-| EventDispatcher | `shared_mutex` -- shared lock for `emit()`/`emit_safe()`, exclusive lock for subscribe/unsubscribe; thread-local reentrancy guard rejects subscribe/unsubscribe from within handlers; `emit()` propagates handler exceptions, `emit_safe()` catches and skips them | `shared_lock` + contiguous vector iteration in subscription order |
+| EventDispatcher | Lock-free `emit()` / `emit_safe()` via `std::atomic<std::shared_ptr<const std::vector<Entry>>>` snapshot (copy-on-write publish, acquire-load on read); zero-subscriber fast path skips the snapshot load via an atomic handler counter; writers serialize on a small `std::mutex` that never touches the emit hot path; thread-local reentrancy guard rejects subscribe/unsubscribe from within handlers so the no-mutation-during-emit invariant holds; `emit()` propagates handler exceptions, `emit_safe()` catches and skips them | Atomic acquire-load of a `shared_ptr` snapshot plus linear iteration over a contiguous vector; no reader lock |
 | Profiler | Lock-free ring buffer via atomic `fetch_add` on write position; odd/even sequence counter per sample slot prevents torn reads during concurrent export -- the sequence is opened and closed with unconditional `fetch_add` (never a load-then-store) so concurrent producers racing on the same slot cannot roll the counter backwards; `DMK_PROFILE_SCOPE(name)` requires `name` to be a string literal, enforced at compile time by a `ScopedProfile` constructor that only binds to `const char (&)[N]` | Single atomic increment + sequence-guarded field writes per sample |
 
 ### Performance-critical paths
 
@@ -467,8 +467,19 @@ endif()
 # --- Unit Tests ---
 option(DMK_BUILD_TESTS "Build unit tests" OFF)
 
-if(DMK_BUILD_TESTS)
-  message(STATUS "Building unit tests...")
-  enable_testing()
+# --- Benchmarks ---
+# Opt-in microbenchmark executables (e.g. DetourModKit_bench). Gated so that
+# normal consumer builds do not produce extra targets. The bench sources live
+# alongside tests/ and are wired up in tests/CMakeLists.txt.
+option(DMK_BUILD_BENCHMARKS "Build benchmark executables" OFF)
+
+if(DMK_BUILD_TESTS OR DMK_BUILD_BENCHMARKS)
+  if(DMK_BUILD_TESTS)
+    message(STATUS "Building unit tests...")
+    enable_testing()
+  endif()
+  if(DMK_BUILD_BENCHMARKS)
+    message(STATUS "Building benchmarks...")
+  endif()
   add_subdirectory(tests)
 endif()
@@ -105,13 +105,15 @@ DetourModKit is a lightweight C++ toolkit designed to simplify common tasks in g
 
 - Typed pub/sub event system with RAII subscription management
 - Each `EventDispatcher<Event>` manages a single event type
-- `shared_mutex` concurrency: concurrent `emit()` via shared lock, exclusive lock for subscribe/unsubscribe
+- Lock-free `emit()` / `emit_safe()`: atomic acquire-load of a `std::shared_ptr<const vector>` snapshot and iterate, no reader lock; `subscribe()` / `unsubscribe()` are copy-on-write under a small writer mutex
+- Zero-subscriber fast path: `emit()` / `emit_safe()` short-circuit on a lock-free atomic handler counter, skipping the snapshot load entirely
 - Subscriptions auto-unsubscribe on destruction
 - Handlers invoked in subscription order (preserved across unsubscribe)
-- Thread-local reentrancy guard detects and rejects subscribe/unsubscribe calls from within a handler, preventing deadlock
+- Thread-local reentrancy guard detects and rejects subscribe/unsubscribe calls from within a handler, keeping the no-mutation-during-emit invariant intact
 - Compose multiple dispatchers for multi-event architectures
 - `emit_safe()` for exception-tolerant dispatch (recommended for hook callbacks)
 - Safe when the dispatcher is destroyed before its subscriptions (weak_ptr guard)
+- Trade-off: `subscribe()` / `unsubscribe()` allocate a new handler list each call (O(n) publish). Suited for 1-10 subscribers per event and write-rarely access patterns, which matches typical mod usage
 
 </details>
 
 
@@ -0,0 +1,106 @@
+# EventDispatcher Bench, v3.2.0
+
+Before/after numbers for the lock-free COW snapshot `emit()` landed in v3.2.0.
+The previous implementation used `std::shared_mutex` for `emit()` / `emit_safe()`
+and an exclusive lock for `subscribe()` / `unsubscribe()`. The new implementation
+stores handlers in a `std::atomic<std::shared_ptr<const std::vector<Entry>>>`
+snapshot published on mutation, with a lock-free atomic handler-count fast
+path for the zero-subscriber case.
+
+## Results
+
+| Scenario                    | Subs | Before (ns/op) | After (ns/op) | Delta             |
+| --------------------------- | ---: | -------------: | ------------: | ----------------- |
+| `emit`                      |    0 |         105.20 |      **6.47** | **-94% (16.3x)**  |
+| `emit`                      |    1 |         126.23 |        106.85 | -15%              |
+| `emit`                      |    8 |         253.99 |        249.52 | -2%               |
+| `emit`                      |   64 |        1121.43 |       1324.66 | +18% (regression) |
+| `emit_safe`                 |    0 |         103.55 |      **6.32** | **-94% (16.4x)**  |
+| `emit_safe`                 |    1 |         119.27 |        106.76 | -10%              |
+| `emit_safe`                 |    8 |         231.13 |        208.92 | -10%              |
+| `emit_safe`                 |   64 |        1169.86 |       1077.59 | -8%               |
+| `subscribe_unsub_roundtrip` |    0 |         487.18 |       1125.23 | +131% (expected)  |
+| `emit_concurrent_4_threads` |    8 |         551.73 |    **268.07** | **-51% (2.06x)**  |
+| `reentrancy_rejection`      |    1 |         239.07 |        202.82 | -15%              |
+
+Raw TSVs in [before.tsv](before.tsv) and [after.tsv](after.tsv). Each row is the
+median of 11 samples. Iteration counts vary per row (10M for fast cases down to
+200K for the slowest) to keep per-scenario wall time comparable.
+
+## Interpretation
+
+**Zero-subscriber fast path.** The atomic handler-count short-circuit in
+`emit()` / `emit_safe()` collapses a `shared_mutex` acquire/release plus
+iteration setup into a single `memory_order_acquire` load of an 8-byte counter.
+The 16x factor is the cost of an uncontended `shared_mutex` acquire/release
+on Windows SRWLOCK relative to a naked atomic load, and it is the dominant
+result for dispatchers that are wired up at init but rarely subscribed to.
+
+**1 to 8 subscriber uncontended emit.** Small consistent wins (10% to 15%)
+from removing the reader lock. The snapshot load is a release-acquire atomic
+plus a `shared_ptr` refcount bump, which is cheaper than touching a mutex's
+state word unconditionally.
+
+**Concurrent emit (4 threads, 8 subs).** 2.06x throughput. No reader lock
+means no cache-line contention on the mutex state, so all four threads make
+progress in parallel instead of serializing on the SRWLOCK read side.
+
+**64 subscriber emit, single thread.** 18% slower (+203 ns on a 1121 ns
+baseline). Two plausible causes:
+
+1. Timer noise. On an 1100 ns run, 200 ns is 2-3 cycles worth of timer jitter
+   amplified across the sample; the noise floor on `steady_clock` is
+   typically in the tens of nanoseconds per sample.
+2. `std::atomic<std::shared_ptr>` load cost dominates over the old loop's
+   single mutex acquire when amortized over only 64 handlers. libstdc++'s
+   implementation uses DWCAS (cmpxchg16b) on the snapshot atomic; MSVC
+   uses an internal spinlock.
+
+Typical DetourModKit usage (per the README: 1-10 subscribers per event,
+dispatchers wired once at init) stays well inside the range where the
+optimization is a pure win. The 64 subscriber row should be treated as a
+worst-case indicator, not representative load.
+
+**Subscribe / unsubscribe round-trip.** 2.31x slower (487 ns to 1125 ns).
+Each mutation allocates a fresh handler vector, appends or removes the
+entry, and publishes via atomic store. This is documented in the header
+and is the accepted tradeoff for lock-free reads. Subscribe is not on a
+hot path in any realistic mod workload.
+
+**Concurrent emit, reentrancy rejection.** Small wins from the same
+fast-path removal of the shared lock.
+
+## Methodology
+
+- Host: Windows 11, MinGW `mingw-release` preset (GCC 13, libstdc++, -O3 LTO).
+- CMake: `cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF`.
+- Build: `DetourModKit_bench` target only. No gtest linkage, no other test deps.
+- Each sample runs N iterations of the scenario inside a single
+  `steady_clock::now()` pair. Reported value is the median per-op cost across
+  11 samples. Iteration counts are chosen so each sample takes roughly the
+  same wall time.
+- Back-to-back runs, same machine, same process start, thermal state
+  comparable. Numbers are not hermetic; reruns on the same machine drift by
+  a few percent at this granularity.
+
+## Reproduce
+
+```bash
+cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF
+PATH="/c/msys64/mingw64/bin:$PATH" cmake --build build/mingw-release --target DetourModKit_bench --parallel
+PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-release/tests/DetourModKit_bench.exe > after.tsv
+```
+
+For a clean before/after comparison, bench the new implementation first,
+copy the header aside, `git checkout HEAD -- include/DetourModKit/event_dispatcher.hpp`
+to restore the baseline header, rebuild the `DetourModKit_bench` target, run
+again into `before.tsv`, then restore the new header.
+
+## Caveat on committed TSVs
+
+The `before.tsv` and `after.tsv` files in this directory are raw artifacts
+from one run on one machine. They are not a stable baseline. Treat them as
+evidence for the claims in this document, not as a regression gate. Future
+bench runs should regenerate their own numbers and compare against the
+structure of the results (16x fast-path win, 2x concurrent win, COW
+subscribe cost) rather than the absolute nanosecond values.
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	6.47	718
+emit	1	5000000	106.85	5860
+emit	8	1000000	249.52	2756
+emit	64	200000	1324.66	2940
+emit_safe	0	10000000	6.32	696
+emit_safe	1	5000000	106.76	5727
+emit_safe	8	1000000	208.92	2341
+emit_safe	64	200000	1077.59	2401
+subscribe_unsub_roundtrip	0	100000	1125.23	1259
+emit_concurrent_4_threads	8	4000000	268.07	1072
+reentrancy_rejection	1	500000	202.82	1148
+# sink=23940842247
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	105.20	11778
+emit	1	5000000	126.23	6977
+emit	8	1000000	253.99	3539
+emit	64	200000	1121.43	2538
+emit_safe	0	10000000	103.55	11537
+emit_safe	1	5000000	119.27	6594
+emit_safe	8	1000000	231.13	2555
+emit_safe	64	200000	1169.86	2613
+subscribe_unsub_roundtrip	0	100000	487.18	545
+emit_concurrent_4_threads	8	4000000	551.73	2206
+reentrancy_rejection	1	500000	239.07	1326
+# sink=24038801584