Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ PATH="/c/msys64/mingw64/bin:$PATH" cmake -S . -B build/mingw-release \
# Available bench executables (standalone, no gtest runtime):
# DetourModKit_bench -- EventDispatcher emit / subscribe throughput
# DetourModKit_bench_scanner -- Scanner::find_pattern, rare-byte anchor vs naive
# DetourModKit_bench_memory -- validation predicate vs direct SEH-guarded read / chain primitives (hot-path cost)

PATH="/c/msys64/mingw64/bin:$PATH" cmake --build build/mingw-release \
--target DetourModKit_bench_scanner --parallel
Expand All @@ -74,6 +75,8 @@ PATH="/c/msys64/mingw64/bin:$PATH" cmake --build build/mingw-release \

Latest scanner bench numbers and methodology live in
[docs/analysis/scanner_bench_v3.x/README.md](docs/analysis/scanner_bench_v3.x/README.md).
Memory validation-vs-direct-read numbers live in
[docs/analysis/memory_bench_v3.x/README.md](docs/analysis/memory_bench_v3.x/README.md).

### Sanitizers and coverage (MinGW only)

Expand Down Expand Up @@ -110,7 +113,7 @@ include/DetourModKit/ # Public headers -- one per module
config_watcher.hpp # Filesystem watcher (ReadDirectoryChangesW) for INI hot-reload
input.hpp # Input polling (keyboard/mouse/XInput)
input_codes.hpp # Unified InputCode type and named key tables
memory.hpp # Memory read/write, sharded region cache, seh_read<T>, PE module range
memory.hpp # Memory read/write, sharded region cache, seh_read<T>, seh_resolve_chain/seh_read_chain<T>, plausible_userspace_ptr, PE module range
rtti.hpp # MSVC RTTI walker (type_name_of, vtable_is_type, find_in_pointer_table)
event_dispatcher.hpp # Typed pub/sub with RAII subscriptions (header-only)
profiler.hpp # Opt-in scoped timing (zero-cost when disabled)
Expand Down Expand Up @@ -246,6 +249,7 @@ dispatcher.emit_safe(PlayerStateChanged{.health = player->health});
- **Platform tests:** `tests/test_platform.cpp` tests internal loader-lock detection and module pinning utilities from `src/platform.hpp`.
- **Decoder tests:** `tests/test_x86_decode.cpp` tests the internal header `src/x86_decode.hpp` (RIP-relative E9 / EB / FF25 resolvers consumed by `Scanner`). The test file adds `src/` to its include path and drives each decoder with hand-crafted byte buffers.
- **Worker tests:** `tests/test_worker.cpp` covers the `StoppableWorker` RAII `std::jthread` wrapper, including the empty-body early return, swallowed `std::exception` and unknown-exception paths, and idempotent `request_stop()` / `shutdown()`. The loader-lock detach arm is untestable from user code (only reached under DllMain) and is accepted as such.
- **Pointer-chain tests:** `tests/test_memory_chain.cpp` is a deliberate second suite for the public `memory.hpp` surface, kept separate from `test_memory.cpp` because the single-fault-frame pointer-chain primitives (`plausible_userspace_ptr`, `seh_resolve_chain`, `seh_read_chain`, `seh_read_chain_bytes`) walk in-process pointer chains and need no cache or game-memory state, whereas `test_memory.cpp` drives the sharded cache, read/write, and module-range paths. Both suites bind to the same header; this is the only same-module split and is intentional for state isolation.
- **Test fixture pattern:** Each suite uses a `::testing::Test` subclass with `SetUp()`/`TearDown()` for temp file cleanup. Temp file paths must include the process ID (`_getpid()`) and a counter to avoid collisions when CTest runs tests in parallel as separate processes.
- **VMT hook test lifetime:** GoogleTest destroys test-body locals *before* calling `TearDown()`. VMT tests must explicitly call `remove_all_vmt_hooks()` (or `remove_vmt_hook`) before target objects go out of scope. Do not rely on `TearDown()` for VMT cleanup when the hooked object is a test-body local.
- **Coverage gate:** 80% minimum line coverage enforced in CI. All PRs must pass.
Expand Down Expand Up @@ -293,7 +297,7 @@ PATH="/c/msys64/mingw64/bin:$PATH" ./build/mingw-debug/tests/DetourModKit_tests.
These are called at 60+ fps from game hook callbacks. Never add allocations, locks, or blocking I/O to them:

- `InputPoller::is_binding_active(index)` -- single atomic load
- `InputPoller::is_binding_active(name)` -- hash lookup + atomic load per binding (typically 13)
- `InputPoller::is_binding_active(name)` -- hash lookup + atomic load per binding (typically 1-3)
- `HookManager::with_inline_hook()` -- shared_lock read
- `Logger::log()` level check -- single atomic load
- `Logger::log()` async enqueue -- atomic shared_ptr load + lock-free queue push
Expand All @@ -302,6 +306,8 @@ These are called at 60+ fps from game hook callbacks. Never add allocations, loc
- `Memory::read_ptr_unsafe()` -- SEH-protected raw dereference (MSVC), cache-accelerated with VirtualQuery fallback (MinGW)
- `Memory::read_ptr_unchecked()` -- inline pointer dereference with source and result low-address guards, no SEH (caller must guarantee structural pointer validity)
- `Memory::seh_read<T>()` / `seh_read_bytes()` -- typed and raw SEH-guarded reads; single `__try` frame on MSVC, VirtualQuery loop across regions on MinGW. Used by `Rtti` for chained RTTI walks
- `Memory::seh_resolve_chain()` / `seh_read_chain<T>()` -- resolves or reads a whole multi-level pointer chain under one fault guard: one out-of-line call instead of N separate `seh_read` calls, with each intermediate link kept in a register and pre-screened by `plausible_userspace_ptr` (a faulting or implausible link aborts the walk and returns nullopt/false). VirtualQuery-guarded per link on MinGW
- `Memory::plausible_userspace_ptr(p)` -- `inline constexpr` user-mode pointer plausibility test; pure arithmetic with no syscall and no memory access (early-rejects stale/sentinel/torn pointers before an SEH-guarded read)
- `Memory::contains(range, p)` -- constexpr point-in-range test for module bounds checks
- `Memory::own_module_range()` / `host_module_range()` -- magic-static cached, single atomic load on the fast path
- `Rtti::vtable_is_type(vt, expected)` -- one batched COL read (24 bytes) plus `expected.size() + 1` bytes of name comparison; no allocation
Expand Down
106 changes: 106 additions & 0 deletions docs/analysis/memory_bench_v3.x/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Memory microbenchmark: validation predicate vs direct SEH-guarded read

This directory captures a run of `tests/bench_memory.cpp` against the production `Memory` paths. The benchmark quantifies the per-call cost of each way to read game memory from a hot path, so a caller can choose between a validation predicate (`is_readable` / `is_writable`) and a direct SEH-guarded read (`seh_read`, `seh_read_chain`) with data rather than intuition. The guidance these numbers back is in [../../misc/hot-path-memory.md](../../misc/hot-path-memory.md).

The benchmark measures:

- **Per-call cost** of each primitive (validation warm-hit / cold-miss, raw `VirtualQuery`, `read_ptr_unchecked`, `seh_read`, direct load/store, `write_bytes`).
- **VirtualQuery vs address-space size** (reserve N regions, re-time) to show whether a large VAD tree inflates the miss cost.
- **`is_readable` tail latency** (p50/p99/max) under 1/2/4 threads forcing cache misses on a shared shard set. Tail latency, not average, is what a frame loop feels as a hitch.
- **Probe model**: a hook that resolves an object and reads K dependent fields across a few distinct (cache-missing) objects, GATED (`is_readable` before every read) vs DIRECT (one fault guard, raw reads), reported per probe including the tail, plus a per-frame budget.
- **Pointer chain**: a GATED per-link walk vs `seh_resolve_chain` / `seh_read_chain` (one fault guard for the whole walk).

## Hardware / configuration

- Build: MSVC 2022, Ninja, Release (`/O2`), `-DDMK_BUILD_BENCHMARKS=ON`
- Toolchain: MSVC, where `seh_*` use real `__try` / `__except` (table-driven on x64, free on the no-fault path)
- Iterations: 200,000 per sample (20,000 for `write_bytes`)
- Samples: 15; median reported. Latency studies report p50/p99/max over the raw sample set
- `DEFAULT_CACHE_EXPIRY_MS = 50`

## Results

```text
[1] Validation MISS / uncached (cache off -> VirtualQuery branch)
raw VirtualQuery 217.78 ns/call
is_readable MISS 236.65 ns/call
is_writable MISS 227.87 ns/call

[2] Validation WARM HIT (cache on, entry fresh within TTL)
is_readable HIT 54.47 ns/call
is_writable HIT 55.35 ns/call

[3] Direct access primitives
direct volatile load 3.93 ns/call
read_ptr_unchecked 3.92 ns/call
seh_read<u64> 7.42 ns/call
direct volatile store 0.45 ns/call
write_bytes(8) 5650.69 ns/call

[4] raw VirtualQuery vs VAD-tree size (single fixed address)
+ 0 reserved regions 237.48 ns/call
+ 1000 reserved regions 218.41 ns/call
+ 4000 reserved regions 220.93 ns/call
+ 12000 reserved regions 224.38 ns/call

[5] is_readable latency under contention (mostly-miss workload, 4096 pages)
1 thread(s): p50 800 ns p99 1700 ns max 77600 ns
2 thread(s): p50 1100 ns p99 2400 ns max 107000 ns
4 thread(s): p50 1700 ns p99 6100 ns max 81500 ns

[6] Per-probe cost: 8 reads across ~3 distinct (cache-missing) objects
GATED (is_readable+read): p50 5900 p99 10500 max 152100 mean 6120 ns/probe
DIRECT (raw read) : p50 100 p99 400 max 5100 mean 89 ns/probe
gated/direct mean ratio : 68.7x

[7] Per-frame budget (16.67 ms frame): cost = probes/frame x ns/probe
probes/fr gated %frame direct %frame
1 0.04% 0.00%
8 0.29% 0.00%
64 2.35% 0.03%
256 9.40% 0.14%
1024 37.60% 0.55%

[8] Pointer chain (6 links, warm cache)
gated link walk 316.10 ns/call
seh_resolve_chain 9.14 ns/call
seh_read_chain<u64> 10.93 ns/call
gated/seh_read_chain ratio: 28.9x
```

## How to read this

| Comparison | Numbers | Takeaway |
|------------|---------|----------|
| `seh_read<u64>` vs direct load | 7.4 ns vs 3.9 ns | A typed SEH-guarded read is within ~2x of a raw dereference, because the x64 `__try` is table-driven and free on the no-fault path. |
| `is_readable` HIT vs `seh_read` | 54.5 ns vs 7.4 ns | Even a warm cache hit on the predicate costs ~7x a direct guarded read: it takes a shard reader lock and a cache lookup. |
| `is_readable` MISS vs HIT | 236.7 ns vs 54.5 ns | A miss issues a `VirtualQuery` syscall and rebuilds the entry under an exclusive lock. When target addresses keep changing, almost every lookup misses. |
| `seh_read_chain` vs gated per-link walk | 10.9 ns vs 316.1 ns | Resolving a 6-link chain under one fault guard is ~29x faster than calling `is_readable` before every dereference. |
| Probe GATED vs DIRECT (mean) | 6120 ns vs 89 ns | Gating each of 8 dependent reads across cache-missing objects costs ~69x the direct cost per probe. |
| Probe tail (p99 / max) | 10500 / 152100 ns vs 400 / 5100 ns | The gate hurts the tail far more than the mean. A single worst-case gated probe (152 us) is ~0.9% of a 16.67 ms frame on its own. |

A few takeaways:

1. **The predicate is the wrong tool on a hot path.** It is not free even on a cache hit (a lock plus a lookup), and on a cache miss it pays a `VirtualQuery` plus an exclusive-lock rebuild. The probe model, where each object is a fresh page, is miss-dominated, so the gate runs ~69x slower per probe than reading directly under one guard.
2. **The cost scales with probes-per-frame, and the tail is the real hazard.** At a few probes per frame the gate is imperceptible. At a few hundred per frame (an apply path touching many bound objects) it climbs to a large fraction of the frame budget, and the p99/max tail can spike a frame on its own. A single average-per-call number hides this.
3. **A single SEH-guarded read is nearly free on MSVC.** `seh_read` is within ~2x of a raw load, and the chain primitives keep that property across a whole multi-level walk: one guard for the walk instead of N predicate calls.
4. **`VirtualQuery` cost is flat in address-space size.** Growing the VAD tree to 12,000 reserved regions does not move the per-call cost (the kernel walks a balanced tree), so the miss cost is intrinsic, not a function of how fragmented the process is.

## Reproducing locally

```bash
# MSVC (Developer Command Prompt), from repo root
cmake -S . -B build/msvc-release -G Ninja -DCMAKE_BUILD_TYPE=Release ^
-DDMK_BUILD_BENCHMARKS=ON -DDMK_BUILD_TESTS=OFF
cmake --build build/msvc-release --target DetourModKit_bench_memory --parallel
build\msvc-release\tests\DetourModKit_bench_memory.exe
```

The harness prints the human-readable tables above plus a `#TSV` block on stdout for machine parsing (`probe_gated_over_direct` is the headline gated-vs-direct ratio).

## Caveats

- Numbers are from a single development machine and are illustrative. The miss-path cost is dominated by `VirtualQuery` latency and shard-lock contention, so it varies by CPU, Windows build, and core count. Run the bench for your own target; the qualitative result (predicate expensive on the hot path, direct read cheap, chain cheap) holds across machines.
- These are MSVC numbers, the shipping configuration. On MinGW there is no SEH, so `seh_read` / `seh_read_chain` fall back to a `VirtualQuery`-guarded read and pay a syscall per access; on that toolchain the gated walk can be faster than the chain primitives, which is why mod builds target MSVC and why `read_ptr_unchecked` is the recommended MinGW hot-path read (see [../../misc/hot-path-memory.md](../../misc/hot-path-memory.md)).
- On MinGW the benchmark targets are built with the same Release LTO as the library (`INTERPROCEDURAL_OPTIMIZATION_RELEASE`, set in `tests/CMakeLists.txt` when IPO is supported), so each bench object and the LTO-only library archive form one LTO unit. This sidesteps a GCC linker-plugin bug where a mixed link (a non-LTO bench object against the LTO Release archive) re-emits libstdc++'s C++20-constrained `std::thread` / `std::tuple` linkonce symbol twice and fails with a spurious multiple-definition. No manual step is needed; do not force a non-LTO Release for the bench, since that mixed link is exactly what triggers the failure. The library and tests are unaffected.
- The probe model uses synthetic page-per-object churn to force the miss path. A real hook whose objects share pages will miss less often and see a smaller gate penalty, but the structural point (the predicate adds a lock and a possible syscall the direct read does not) is independent of the hit rate.
Loading
Loading