Benchmark — v0.4.0 single-thread fast path vs. concurrent path (Windows / MSVC / x64)

Comparative benchmark for Milestone 4.5 (spec §6.3 re-run across the ADR-0020 thread-safety policies). Methodology: ADR-0014, extended in M4.5 with the concurrent scenario (T threads each running the interleaved alloc/free loop against a shared pool, reporting aggregate ns/op = wall-time ÷ total ops). Produced by pool_vs_malloc_bench built three times — once per PBR_MEMORY_POOL_THREAD_SAFETY value — with --scenario all --threads 4.

Host

Field	Value
CPU	Intel Core i5-6600K (Skylake, 4 cores / 4 threads) @ 3.5 GHz
RAM	32 GB
OS	Windows 10 Pro 19045
Compiler	MSVC 19.51 (`_MSC_FULL_VER` 195136247), Release
`alignof(std::max_align_t)`	8 bytes
`hardware_concurrency()`	4
Config	`iterations=1000000 repeats=10 block_size=64` (first repeat dropped as warm-up); concurrent `threads=4`

Numbers are run-to-run noisy on a desktop OS (median is the headline statistic). The point is the relative picture across policies, not absolute nanoseconds.

Single-thread results (median ns/op)

The bulk and interleaved scenarios run on one thread — the spec §6.3 measurement, now showing the uncontended cost of each policy.

Scenario	region	`NONE`	`MUTEX`	`LOCKFREE`	`malloc`
bulk	pool alloc	11.80	34.69	18.61	~71
bulk	pool free	10.36	56.17	22.73	~41
interleaved	alloc+free	9.32	47.19	31.74	~47

NONE is the fast path, unchanged from v0.2.0/v0.3.0 — SingleThreadedPolicy inlines to byte-identical code, so the single-thread numbers match the M2.9 reference (interleaved ≈ 9 ns/op). Spec §2.4's "preserve the single-thread fast path" mandate holds, measurably.
Synchronization has a real uncontended cost. Even with zero contention, MUTEX pays the lock/unlock (interleaved 47 ns/op ≈ 5× NONE) and LOCKFREE pays the CAS + acquire/release fences (32 ns/op ≈ 3.4× NONE). LOCKFREE is cheaper than MUTEX uncontended.

Concurrent results — 4 threads (aggregate median ns/op)

Policy	pool alloc+free	`malloc` alloc+free	malloc / pool
`NONE` (T=1, clamped)¹	9.52	47.77	5.02×
`MUTEX` (T=4)	69.54	31.75	0.46×
`LOCKFREE` (T=4)	41.79	23.80	0.57×

¹ The NONE build is intentionally racy (spec §2.4), so the bench clamps it to a single thread — its row is the fast-path baseline the thread-safe modes are measured against, not a 4-thread number.

Under contention, LOCKFREE (41.8 ns/op) beats MUTEX (69.5 ns/op). The single std::mutex serializes every operation; the lock-free CAS lets threads make progress without blocking, so it scales better on the same workload.
Both lose to malloc under contention (malloc/pool < 1). This is the expected architectural result: the pool has a single shared free-list head — one hot cache line every thread fights over — while modern malloc (and the Windows segment heap) spread contention across per-thread arenas. A single-head pool cannot out-scale a per-arena allocator no matter how clever the head's synchronization is.
The scaling answer for the pool is per-thread caches (the magazine / tcmalloc approach), which ADR-0020 §4 deliberately deferred — the Strategy seam keeps it a future, non-breaking addition. This benchmark is the evidence motivating that future work.

Takeaways

The single-thread fast path is preserved at zero cost (NONE ≈ 9 ns/op interleaved, ~5× faster than malloc).
Pay only for what you use: thread safety is opt-in at compile time, and NONE adds nothing.
Among the thread-safe policies, LOCKFREE is the faster choice both uncontended and contended; MUTEX is the simpler, always-portable fallback.
For high core-count contention, neither single-head policy beats an arena allocator — per-thread caches are the documented next step (ADR-0020 §4).

Reproduce

# one build per policy
cmake --preset bench -B build/bench-none                                    # NONE (default)
cmake --preset bench -B build/bench-mutex    -DPBR_MEMORY_POOL_THREAD_SAFETY=MUTEX
cmake --preset bench -B build/bench-lockfree -DPBR_MEMORY_POOL_THREAD_SAFETY=LOCKFREE
cmake --build build/bench-none && cmake --build build/bench-mutex && cmake --build build/bench-lockfree

# run each (paths are <build>/src/bench/cpp/it/d4np/memorypool/pool_vs_malloc_bench)
<bin> --scenario all --threads 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark — v0.4.0 single-thread fast path vs. concurrent path (Windows / MSVC / x64)

Host

Single-thread results (median ns/op)

Concurrent results — 4 threads (aggregate median ns/op)

Takeaways

Reproduce

FilesExpand file tree

v0.4.0-windows-msvc-x64-threading.md

Latest commit

History

v0.4.0-windows-msvc-x64-threading.md

File metadata and controls

Benchmark — v0.4.0 single-thread fast path vs. concurrent path (Windows / MSVC / x64)

Host

Single-thread results (median ns/op)

Concurrent results — 4 threads (aggregate median ns/op)

Takeaways

Reproduce