Skip to content

Latest commit

 

History

History
64 lines (47 loc) · 4.64 KB

File metadata and controls

64 lines (47 loc) · 4.64 KB

Benchmark — v0.4.0 single-thread fast path vs. concurrent path (Windows / MSVC / x64)

Comparative benchmark for Milestone 4.5 (spec §6.3 re-run across the ADR-0020 thread-safety policies). Methodology: ADR-0014, extended in M4.5 with the concurrent scenario (T threads each running the interleaved alloc/free loop against a shared pool, reporting aggregate ns/op = wall-time ÷ total ops). Produced by pool_vs_malloc_bench built three times — once per PBR_MEMORY_POOL_THREAD_SAFETY value — with --scenario all --threads 4.

Host

Field Value
CPU Intel Core i5-6600K (Skylake, 4 cores / 4 threads) @ 3.5 GHz
RAM 32 GB
OS Windows 10 Pro 19045
Compiler MSVC 19.51 (_MSC_FULL_VER 195136247), Release
alignof(std::max_align_t) 8 bytes
hardware_concurrency() 4
Config iterations=1000000 repeats=10 block_size=64 (first repeat dropped as warm-up); concurrent threads=4

Numbers are run-to-run noisy on a desktop OS (median is the headline statistic). The point is the relative picture across policies, not absolute nanoseconds.

Single-thread results (median ns/op)

The bulk and interleaved scenarios run on one thread — the spec §6.3 measurement, now showing the uncontended cost of each policy.

Scenario region NONE MUTEX LOCKFREE malloc
bulk pool alloc 11.80 34.69 18.61 ~71
bulk pool free 10.36 56.17 22.73 ~41
interleaved alloc+free 9.32 47.19 31.74 ~47
  • NONE is the fast path, unchanged from v0.2.0/v0.3.0SingleThreadedPolicy inlines to byte-identical code, so the single-thread numbers match the M2.9 reference (interleaved ≈ 9 ns/op). Spec §2.4's "preserve the single-thread fast path" mandate holds, measurably.
  • Synchronization has a real uncontended cost. Even with zero contention, MUTEX pays the lock/unlock (interleaved 47 ns/op ≈ 5× NONE) and LOCKFREE pays the CAS + acquire/release fences (32 ns/op ≈ 3.4× NONE). LOCKFREE is cheaper than MUTEX uncontended.

Concurrent results — 4 threads (aggregate median ns/op)

Policy pool alloc+free malloc alloc+free malloc / pool
NONE (T=1, clamped)¹ 9.52 47.77 5.02×
MUTEX (T=4) 69.54 31.75 0.46×
LOCKFREE (T=4) 41.79 23.80 0.57×

¹ The NONE build is intentionally racy (spec §2.4), so the bench clamps it to a single thread — its row is the fast-path baseline the thread-safe modes are measured against, not a 4-thread number.

  • Under contention, LOCKFREE (41.8 ns/op) beats MUTEX (69.5 ns/op). The single std::mutex serializes every operation; the lock-free CAS lets threads make progress without blocking, so it scales better on the same workload.
  • Both lose to malloc under contention (malloc/pool < 1). This is the expected architectural result: the pool has a single shared free-list head — one hot cache line every thread fights over — while modern malloc (and the Windows segment heap) spread contention across per-thread arenas. A single-head pool cannot out-scale a per-arena allocator no matter how clever the head's synchronization is.
  • The scaling answer for the pool is per-thread caches (the magazine / tcmalloc approach), which ADR-0020 §4 deliberately deferred — the Strategy seam keeps it a future, non-breaking addition. This benchmark is the evidence motivating that future work.

Takeaways

  1. The single-thread fast path is preserved at zero cost (NONE ≈ 9 ns/op interleaved, ~5× faster than malloc).
  2. Pay only for what you use: thread safety is opt-in at compile time, and NONE adds nothing.
  3. Among the thread-safe policies, LOCKFREE is the faster choice both uncontended and contended; MUTEX is the simpler, always-portable fallback.
  4. For high core-count contention, neither single-head policy beats an arena allocator — per-thread caches are the documented next step (ADR-0020 §4).

Reproduce

# one build per policy
cmake --preset bench -B build/bench-none                                    # NONE (default)
cmake --preset bench -B build/bench-mutex    -DPBR_MEMORY_POOL_THREAD_SAFETY=MUTEX
cmake --preset bench -B build/bench-lockfree -DPBR_MEMORY_POOL_THREAD_SAFETY=LOCKFREE
cmake --build build/bench-none && cmake --build build/bench-mutex && cmake --build build/bench-lockfree

# run each (paths are <build>/src/bench/cpp/it/d4np/memorypool/pool_vs_malloc_bench)
<bin> --scenario all --threads 4