docs(bench): use multi-run medians and correct version label

tkhquang · tkhquang · commit 01bb72da77de · 2026-04-24T04:25:14.000+07:00
Archive 5 runs per side under runs/, report medians with run-to-run
spread as the noise floor. The prior single-run 64-subscriber regression
was a statistical outlier (within 1% across 5 runs per side). Rename the
directory to match the v3.1.0 release label.
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/README.md b/docs/analysis/event_dispatcher_bench_v3.1.0/README.md
@@ -0,0 +1,108 @@
+# EventDispatcher Bench, v3.1.0
+
+Before/after numbers for the lock-free COW snapshot `emit()` landed in v3.1.0.
+The previous implementation used `std::shared_mutex` for `emit()` / `emit_safe()`
+and an exclusive lock for `subscribe()` / `unsubscribe()`. The new implementation
+stores handlers in a `std::atomic<std::shared_ptr<const std::vector<Entry>>>`
+snapshot published on mutation, with a lock-free atomic handler-count fast
+path for the zero-subscriber case.
+
+## Results (median of 5 runs per side)
+
+| Scenario                    | Subs | Before (ns/op) | After (ns/op) | Delta             | Verdict |
+| --------------------------- | ---: | -------------: | ------------: | ----------------- | :------ |
+| `emit`                      |    0 |          103.9 |       **6.0** | **-94.2% (17x)**  | REAL    |
+| `emit`                      |    1 |          120.1 |          94.4 | **-21.4%**        | REAL    |
+| `emit`                      |    8 |          245.6 |         216.3 | **-11.9%**        | REAL    |
+| `emit`                      |   64 |         1103.5 |        1092.1 | -1.0%             | NOISE   |
+| `emit_safe`                 |    0 |          103.1 |       **5.7** | **-94.5% (18x)**  | REAL    |
+| `emit_safe`                 |    1 |          118.6 |          96.4 | **-18.7%**        | REAL    |
+| `emit_safe`                 |    8 |          233.2 |         219.1 | -6.0%             | REAL    |
+| `emit_safe`                 |   64 |         1086.3 |        1099.8 | +1.2%             | NOISE   |
+| `emit_concurrent_4_threads` |    8 |          517.9 |     **248.2** | **-52.1% (2.1x)** | REAL    |
+| `subscribe_unsub_roundtrip` |    — |          446.0 |        1150.4 | +158.0%           | REAL    |
+| `reentrancy_rejection`      |    1 |          212.5 |         192.7 | -9.4%             | marginal|
+
+Verdict key:
+
+- **REAL**: median delta exceeds 1.5x the combined run-to-run spread on both sides.
+- **NOISE**: median delta is smaller than the run-to-run spread; cannot be distinguished from measurement jitter.
+- **marginal**: delta is larger than spread but smaller than 1.5x spread.
+
+Run-to-run coefficient of variation was 1% to 5% per scenario. Full per-run
+TSVs live in [runs/](runs/) (5 OLD + 5 NEW). A representative single run per
+side is preserved in [before.tsv](before.tsv) and [after.tsv](after.tsv) for
+quick reference.
+
+## Interpretation
+
+**Zero-subscriber fast path.** The atomic handler-count short-circuit in
+`emit()` / `emit_safe()` collapses a `shared_mutex` acquire/release plus
+iteration setup into a single `memory_order_acquire` load of an 8-byte counter.
+The 17x factor is the cost of an uncontended `shared_mutex` acquire/release
+on Windows SRWLOCK relative to a naked atomic load, and it is the dominant
+result for dispatchers that are wired up at init but rarely subscribed to.
+
+**1 to 8 subscriber uncontended emit.** Consistent wins (6% to 21%) from
+removing the reader lock. The snapshot load is a release-acquire atomic plus
+a `shared_ptr` refcount bump, which is cheaper than touching a mutex's state
+word unconditionally.
+
+**Concurrent emit (4 threads, 8 subs).** 2.1x throughput. No reader lock
+means no cache-line contention on the mutex state, so all four threads make
+progress in parallel instead of serializing on the SRWLOCK read side.
+
+**64 subscriber emit.** Within noise on both `emit` (-1.0%) and `emit_safe`
+(+1.2%). An earlier single-run measurement suggested an 18% regression; that
+was a statistical outlier. Across 5 runs per side the two implementations
+are indistinguishable at this subscriber count: the per-handler iteration
+cost dominates and both paths reach the same `std::vector<Entry>` buffer
+layout through one extra dereference either way.
+
+**Subscribe / unsubscribe round-trip.** 2.6x slower (446 ns to 1150 ns).
+Each mutation allocates a fresh handler vector, appends or removes the
+entry, and publishes via atomic store. This is documented in the header
+and is the accepted tradeoff for lock-free reads. Subscribe is not on a
+hot path in any realistic mod workload.
+
+**Reentrancy rejection.** Marginal improvement (within 1.5x spread). Not a
+meaningful claim; effectively unchanged.
+
+## Methodology
+
+- Host: Windows 11, MinGW `mingw-release` preset (GCC 13, libstdc++, -O3 LTO).
+- CMake: `cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF`.
+- Build: `DetourModKit_bench` target only. No gtest linkage, no other test deps.
+- Each sample runs N iterations of the scenario inside a single
+  `steady_clock::now()` pair. Reported value is the median per-op cost across
+  11 samples inside one process invocation. Iteration counts are chosen so
+  each sample takes roughly the same wall time.
+- 5 process invocations per side (OLD vs NEW), back-to-back, same machine,
+  same thermal state. Tables above report the median across those 5 runs
+  for each scenario.
+- Verdicts use run-to-run spread (max minus min across the 5 runs) as the
+  noise floor. A claim is "REAL" only when the median delta exceeds 1.5x
+  that noise floor on both sides.
+
+## Reproduce
+
+```bash
+cmake --preset mingw-release -DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF
+PATH="/c/msys64/mingw64/bin:$PATH" cmake --build build/mingw-release --target DetourModKit_bench --parallel
+PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-release/tests/DetourModKit_bench.exe > run.tsv
+```
+
+For a clean before/after comparison, bench the new implementation first,
+copy the header aside, `git checkout HEAD -- include/DetourModKit/event_dispatcher.hpp`
+to restore the baseline header, rebuild the `DetourModKit_bench` target, run
+again into the baseline TSV, then restore the new header. Repeat N times
+per side and compare medians with an explicit noise-floor check.
+
+## Caveat on committed TSVs
+
+The TSVs in this directory are raw artifacts from a specific host and
+compiler version. They are not a stable baseline. Treat them as evidence
+for the claims in this document, not as a regression gate. Future bench
+runs should regenerate their own numbers and compare against the structure
+of the results (17x fast-path win, 2x concurrent win, COW subscribe cost)
+rather than the absolute nanosecond values.
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/after.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/after.tsv
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/before.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/before.tsv
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_1.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_1.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	6.06	668
+emit	1	5000000	104.53	5715
+emit	8	1000000	220.31	2449
+emit	64	200000	1126.05	2521
+emit_safe	0	10000000	5.71	630
+emit_safe	1	5000000	97.17	5387
+emit_safe	8	1000000	219.06	2428
+emit_safe	64	200000	1120.66	2478
+subscribe_unsub_roundtrip	0	100000	1180.98	1313
+emit_concurrent_4_threads	8	4000000	245.09	980
+reentrancy_rejection	1	500000	203.29	1120
+# sink=23939455106
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_2.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_2.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	6.00	663
+emit	1	5000000	96.19	5303
+emit	8	1000000	220.13	2431
+emit	64	200000	1116.99	2462
+emit_safe	0	10000000	5.69	624
+emit_safe	1	5000000	96.83	5311
+emit_safe	8	1000000	220.24	2438
+emit_safe	64	200000	1111.00	2443
+subscribe_unsub_roundtrip	0	100000	1156.80	1275
+emit_concurrent_4_threads	8	4000000	248.19	992
+reentrancy_rejection	1	500000	190.95	1072
+# sink=23940408562
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_3.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_3.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	6.08	675
+emit	1	5000000	94.44	5200
+emit	8	1000000	215.45	2394
+emit	64	200000	1092.06	2412
+emit_safe	0	10000000	5.79	641
+emit_safe	1	5000000	96.42	5376
+emit_safe	8	1000000	216.73	2395
+emit_safe	64	200000	1099.84	2487
+subscribe_unsub_roundtrip	0	100000	1150.42	1277
+emit_concurrent_4_threads	8	4000000	257.35	1029
+reentrancy_rejection	1	500000	192.65	1081
+# sink=23936874150
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_4.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_4.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	5.72	627
+emit	1	5000000	93.81	5154
+emit	8	1000000	216.31	2375
+emit	64	200000	1091.75	2408
+emit_safe	0	10000000	5.57	614
+emit_safe	1	5000000	95.42	5236
+emit_safe	8	1000000	220.80	2418
+emit_safe	64	200000	1095.09	2407
+subscribe_unsub_roundtrip	0	100000	1123.53	1244
+emit_concurrent_4_threads	8	4000000	235.05	940
+reentrancy_rejection	1	500000	188.63	1060
+# sink=23945296277
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_5.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/new_5.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	5.80	642
+emit	1	5000000	92.25	5104
+emit	8	1000000	211.34	2341
+emit	64	200000	1085.13	2377
+emit_safe	0	10000000	5.67	618
+emit_safe	1	5000000	93.57	5143
+emit_safe	8	1000000	218.15	2393
+emit_safe	64	200000	1082.98	2385
+subscribe_unsub_roundtrip	0	100000	1127.63	1247
+emit_concurrent_4_threads	8	4000000	255.91	1023
+reentrancy_rejection	1	500000	193.98	1071
+# sink=23933756560
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_1.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_1.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	106.61	11705
+emit	1	5000000	128.61	7036
+emit	8	1000000	249.45	2768
+emit	64	200000	1139.86	2500
+emit_safe	0	10000000	105.04	11582
+emit_safe	1	5000000	120.53	6652
+emit_safe	8	1000000	241.97	2673
+emit_safe	64	200000	1093.30	2430
+subscribe_unsub_roundtrip	0	100000	469.98	514
+emit_concurrent_4_threads	8	4000000	519.06	2076
+reentrancy_rejection	1	500000	203.05	1151
+# sink=24040673944
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_2.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_2.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	106.19	11701
+emit	1	5000000	120.09	6623
+emit	8	1000000	245.57	2706
+emit	64	200000	1109.16	2439
+emit_safe	0	10000000	105.08	11571
+emit_safe	1	5000000	118.56	6567
+emit_safe	8	1000000	233.16	2585
+emit_safe	64	200000	1081.20	2374
+subscribe_unsub_roundtrip	0	100000	444.50	488
+emit_concurrent_4_threads	8	4000000	509.16	2036
+reentrancy_rejection	1	500000	212.53	1178
+# sink=24038786247
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_3.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_3.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	103.94	11593
+emit	1	5000000	120.79	6686
+emit	8	1000000	240.88	2707
+emit	64	200000	1079.81	2389
+emit_safe	0	10000000	103.12	11299
+emit_safe	1	5000000	118.56	6565
+emit_safe	8	1000000	232.03	2566
+emit_safe	64	200000	1109.64	2442
+subscribe_unsub_roundtrip	0	100000	445.98	492
+emit_concurrent_4_threads	8	4000000	510.86	2043
+reentrancy_rejection	1	500000	214.64	1186
+# sink=24041896088
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_4.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_4.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	101.87	11219
+emit	1	5000000	118.78	6579
+emit	8	1000000	247.11	2733
+emit	64	200000	1103.48	2483
+emit_safe	0	10000000	102.44	11260
+emit_safe	1	5000000	118.52	6520
+emit_safe	8	1000000	233.51	2566
+emit_safe	64	200000	1082.23	2385
+subscribe_unsub_roundtrip	0	100000	453.03	499
+emit_concurrent_4_threads	8	4000000	517.91	2071
+reentrancy_rejection	1	500000	215.26	1180
+# sink=24041522113
diff --git a/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_5.tsv b/docs/analysis/event_dispatcher_bench_v3.1.0/runs/old_5.tsv
@@ -0,0 +1,13 @@
+scenario	subscribers	iterations	median_ns_per_op	total_ms
+emit	0	10000000	103.69	11480
+emit	1	5000000	118.81	6567
+emit	8	1000000	239.97	2648
+emit	64	200000	1084.24	2409
+emit_safe	0	10000000	102.46	11257
+emit_safe	1	5000000	117.21	6484
+emit_safe	8	1000000	232.27	2557
+emit_safe	64	200000	1086.31	2374
+subscribe_unsub_roundtrip	0	100000	435.42	478
+emit_concurrent_4_threads	8	4000000	520.76	2083
+reentrancy_rejection	1	500000	208.94	1168
+# sink=24040143757
diff --git a/docs/analysis/event_dispatcher_bench_v3.2.0/README.md b/docs/analysis/event_dispatcher_bench_v3.2.0/README.md