Comparative benchmark for Milestone 4.5 (spec §6.3 re-run across the ADR-0020 thread-safety policies). Methodology: ADR-0014, extended in M4.5 with the concurrent scenario (T threads each running the interleaved alloc/free loop against a shared pool, reporting aggregate ns/op = wall-time ÷ total ops). Produced by pool_vs_malloc_bench built three times — once per PBR_MEMORY_POOL_THREAD_SAFETY value — with --scenario all --threads 4.
| Field | Value |
|---|---|
| CPU | Intel Core i5-6600K (Skylake, 4 cores / 4 threads) @ 3.5 GHz |
| RAM | 32 GB |
| OS | Windows 10 Pro 19045 |
| Compiler | MSVC 19.51 (_MSC_FULL_VER 195136247), Release |
alignof(std::max_align_t) |
8 bytes |
hardware_concurrency() |
4 |
| Config | iterations=1000000 repeats=10 block_size=64 (first repeat dropped as warm-up); concurrent threads=4 |
Numbers are run-to-run noisy on a desktop OS (median is the headline statistic). The point is the relative picture across policies, not absolute nanoseconds.
The bulk and interleaved scenarios run on one thread — the spec §6.3 measurement, now showing the uncontended cost of each policy.
| Scenario | region | NONE |
MUTEX |
LOCKFREE |
malloc |
|---|---|---|---|---|---|
| bulk | pool alloc | 11.80 | 34.69 | 18.61 | ~71 |
| bulk | pool free | 10.36 | 56.17 | 22.73 | ~41 |
| interleaved | alloc+free | 9.32 | 47.19 | 31.74 | ~47 |
NONEis the fast path, unchanged from v0.2.0/v0.3.0 —SingleThreadedPolicyinlines to byte-identical code, so the single-thread numbers match the M2.9 reference (interleaved ≈ 9 ns/op). Spec §2.4's "preserve the single-thread fast path" mandate holds, measurably.- Synchronization has a real uncontended cost. Even with zero contention,
MUTEXpays the lock/unlock (interleaved 47 ns/op ≈ 5×NONE) andLOCKFREEpays the CAS + acquire/release fences (32 ns/op ≈ 3.4×NONE).LOCKFREEis cheaper thanMUTEXuncontended.
| Policy | pool alloc+free | malloc alloc+free |
malloc / pool |
|---|---|---|---|
NONE (T=1, clamped)¹ |
9.52 | 47.77 | 5.02× |
MUTEX (T=4) |
69.54 | 31.75 | 0.46× |
LOCKFREE (T=4) |
41.79 | 23.80 | 0.57× |
¹ The NONE build is intentionally racy (spec §2.4), so the bench clamps it to a single thread — its row is the fast-path baseline the thread-safe modes are measured against, not a 4-thread number.
- Under contention,
LOCKFREE(41.8 ns/op) beatsMUTEX(69.5 ns/op). The singlestd::mutexserializes every operation; the lock-free CAS lets threads make progress without blocking, so it scales better on the same workload. - Both lose to
mallocunder contention (malloc/pool < 1). This is the expected architectural result: the pool has a single shared free-list head — one hot cache line every thread fights over — while modernmalloc(and the Windows segment heap) spread contention across per-thread arenas. A single-head pool cannot out-scale a per-arena allocator no matter how clever the head's synchronization is. - The scaling answer for the pool is per-thread caches (the magazine / tcmalloc approach), which ADR-0020 §4 deliberately deferred — the Strategy seam keeps it a future, non-breaking addition. This benchmark is the evidence motivating that future work.
- The single-thread fast path is preserved at zero cost (
NONE≈ 9 ns/op interleaved, ~5× faster thanmalloc). - Pay only for what you use: thread safety is opt-in at compile time, and
NONEadds nothing. - Among the thread-safe policies,
LOCKFREEis the faster choice both uncontended and contended;MUTEXis the simpler, always-portable fallback. - For high core-count contention, neither single-head policy beats an arena allocator — per-thread caches are the documented next step (ADR-0020 §4).
# one build per policy
cmake --preset bench -B build/bench-none # NONE (default)
cmake --preset bench -B build/bench-mutex -DPBR_MEMORY_POOL_THREAD_SAFETY=MUTEX
cmake --preset bench -B build/bench-lockfree -DPBR_MEMORY_POOL_THREAD_SAFETY=LOCKFREE
cmake --build build/bench-none && cmake --build build/bench-mutex && cmake --build build/bench-lockfree
# run each (paths are <build>/src/bench/cpp/it/d4np/memorypool/pool_vs_malloc_bench)
<bin> --scenario all --threads 4