Merge pull request #4 from joz-k/transaction_write_api

joz-k · web-flow · commit 1d56cfd9c9ee · 2025-08-13T12:31:53.000+02:00
Transaction write api
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 # C++23 Lock-Free SPSC Queue
 
-A high-performance, single-producer, single-consumer (SPSC) queue implemented in modern C++23.
+A high-performance, _batch-oriented_, single-producer, single-consumer (SPSC) queue implemented in modern C++23.
 
 This project provides a robust, tested, lock-free queue that is suitable for high-performance applications, such as real-time audio or low-latency trading systems, where data must be exchanged between two threads with minimal overhead.
 
@@ -18,10 +18,13 @@ This project provides a robust, tested, lock-free queue that is suitable for hig
 -   **Cache-Friendly:** The queue is optimized for multi-core performance.
     1.  It uses `alignas` to place producer and consumer data on separate cache lines, preventing "false sharing."
     2.  It implements a performance optimization by **caching indices per core**. Each thread maintains a local, non-atomic cache of the other thread's position, minimizing expensive cross-core atomic operations.[^1]
--   **`JUCE::AbstractFifo`-inspired Design:** The API manages two indices for a user-provided buffer, giving the user full control over memory allocation.
+-   **[`JUCE::AbstractFifo`][JuceFifo]-inspired Design:** The API manages two indices for a user-provided buffer, giving the user full control over memory allocation.
 -   **Tested:** Includes a test suite built with CMake and CTest.
+-   **Highly-Performant:** Optimized for batch-oriented use cases. It scales well with larger batches, often surpassing the [performance](benchmarks/README.md) of other industry-standard solutions.
 
-[^1]: See ["MCRingBuffer"](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf) paper or Erik Rigtorp's [Optimizing a ring buffer for throughput](https://rigtorp.se/ringbuffer/).
+
+[JuceFifo]: https://docs.juce.com/master/classAbstractFifo.html
+[^1]: See ["MCRingBuffer"](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf) paper or Rigtorp's [Optimizing a ring buffer for throughput](https://rigtorp.se/ringbuffer/).
 
 ## Core Concept: The Circular Buffer
 
@@ -256,49 +259,9 @@ add_executable(MyAwesomeApp src/main.cpp)
 target_include_directories(MyAwesomeApp PRIVATE external/LockFreeSpscQueue/include)
 ```
 
-## (Advanced) Performance Benchmarks
-
-This project includes a simple performance benchmark suite using the [Google Benchmark](https://github.com/google/benchmark) library to measure queue throughput.
-
-The benchmarks are **disabled by default** to keep configuration and build times fast for users who only want to integrate the library.
-
-### How to Run Benchmarks
-
-1.  **Configure CMake with benchmarks enabled:**
-    You must explicitly enable the option `SPSC_QUEUE_BUILD_BENCHMARKS` when running CMake.
+## Performance Benchmarks
 
-    ```sh
-    # From the project root directory
-    cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPSC_QUEUE_BUILD_BENCHMARKS=ON
-    ```
-    *Note: The first time you run this, CMake will download the Google Benchmark source code, which may take a moment.*
-
-2.  **Build the project:**
-    This will now build the `queue_benchmark` executable in addition to any other enabled targets.
-    ```sh
-    cmake --build build
-    ```
-
-3.  **Run the benchmark executable:**
-    ```sh
-    ./build/benchmarks/queue_benchmark
-    ```
-
-### Example Benchmark Output
-
-You will see detailed output measuring the performance for different batch sizes. The most important column is `Items/s`, which shows the throughput in millions of items per second.
-
-```
-------------------------------------------------------------------------
-Benchmark                  Time       CPU Iterations UserCounters...
-------------------------------------------------------------------------
-BM_QueueThroughput/1   0.042 ms  0.042 ms      16717 Items/s=195.043M/s
-BM_QueueThroughput/4   0.011 ms  0.011 ms      66833 Items/s=777.743M/s
-BM_QueueThroughput/16  0.003 ms  0.003 ms     266026 Items/s=3.10748G/s
-BM_QueueThroughput/64  0.001 ms  0.001 ms    1049349 Items/s=12.3991G/s
-BM_QueueThroughput/256 0.000 ms  0.000 ms    2211341 Items/s=25.924G/s
-```
-This output clearly shows how performance dramatically increases when transferring items in batches compared to one by one.
+This project includes a benchmark suite, using the [Google Benchmark](https://github.com/google/benchmark) library, to measure queue throughput and compare it against [`moodycamel::ReaderWriterQueue`](https://github.com/cameron314/readerwriterqueue). For instructions on how to run the benchmarks, see the [Benchmarks](benchmarks/README.md) section.
 
 ## Disclaimers
 
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -0,0 +1,105 @@
+# Benchmarks
+
+This directory contains the performance benchmark suite for the `LockFreeSpscQueue` project, built using the [Google Benchmark](https://github.com/google/benchmark) library.
+
+It includes:
+
+The suite includes two primary executables:
+1.  **`queue_benchmark`:** A simple performance benchmark that measures the raw throughput of this library's batching API.
+2.  **`queue_comparison_benchmark`:** A comparative benchmark that stress-tests this library against the industry-standard [`moodycamel::ReaderWriterQueue`](https://github.com/cameron314/readerwriterqueue).
+
+The benchmarks are **disabled by default** to keep configuration and build times fast for users who only want to integrate the library.
+
+### How to Run
+
+It is **critical** to use the `Release` build type, as benchmarking a `Debug` build will produce meaningless results.
+
+1.  **Configure CMake:**
+    You must explicitly enable the desired benchmark option when running CMake.
+
+    ```sh
+    # To build ONLY the internal benchmark:
+    cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPSC_QUEUE_BUILD_BENCHMARKS=ON
+
+    # To build ONLY the comparison benchmark (most common):
+    cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPSC_QUEUE_BUILD_BENCHMARK_COMPARE=ON
+    ```
+    *Note: The first time you run this, CMake will download the required dependency libraries, which may take a moment.*
+
+2.  **Build the Project:**
+    ```sh
+    cmake --build build --config Release
+    ```
+
+3.  **Run the Executables:**
+    ```sh
+    # On Linux/macOS
+    ./build/benchmarks/queue_comparison_benchmark
+
+    # On Windows
+    .\build\benchmarks\queue_comparison_benchmark.exe
+    ```
+
+---
+
+## Raw Queue Speed Result
+
+When running the [`queue_benchmark`](queue_benchmark.cpp), you will see detailed output measuring the performance for different batch sizes. The most important column is `Items/s`, which shows the throughput in millions of items per second.
+
+Example output:
+
+```
+------------------------------------------------------------------------
+Benchmark                  Time       CPU Iterations UserCounters...
+------------------------------------------------------------------------
+BM_QueueThroughput/1   0.042 ms  0.042 ms      16717 Items/s=195.043M/s
+BM_QueueThroughput/4   0.011 ms  0.011 ms      66833 Items/s=777.743M/s
+BM_QueueThroughput/16  0.003 ms  0.003 ms     266026 Items/s=3.10748G/s
+BM_QueueThroughput/64  0.001 ms  0.001 ms    1049349 Items/s=12.3991G/s
+BM_QueueThroughput/256 0.000 ms  0.000 ms    2211341 Items/s=25.924G/s
+```
+This output clearly shows how performance dramatically increases when transferring items in batches compared to one by one.
+
+## Performance Comparison: Results & Analysis
+
+The following are example results from running the [`queue_comparison_benchmark`](queue_comparison_benchmark.cpp) on an Apple MacBook Air (M2). The benchmark transfers a sequence of pseudo-random `int64_t` values and simultaneously verifies data integrity, making it both a performance and correctness stress test.
+
+### Benchmark Data
+
+```
+------------------------------------------------------------------------------------------
+Benchmark                            Time      CPU Iterations UserCounters...
+------------------------------------------------------------------------------------------
+BM_ThisQueue_SingleItem           2.32 ms  2.32 ms        320 items_per_second=43.2044M/s
+BM_ThisQueue_SingleItem_Write256 0.108 ms 0.108 ms       6466 items_per_second=924.364M/s
+BM_Moodycamel_SingleItem         0.245 ms 0.245 ms       2932 items_per_second=408.063M/s
+BM_ThisQueue_Batch/4             0.686 ms 0.686 ms        983 items_per_second=145.91M/s
+BM_ThisQueue_Batch/8             0.151 ms 0.151 ms       4561 items_per_second=662.117M/s
+BM_ThisQueue_Batch/16            0.114 ms 0.114 ms       6597 items_per_second=878.876M/s
+BM_ThisQueue_Batch/64            0.093 ms 0.093 ms       7487 items_per_second=1.07577G/s
+BM_ThisQueue_Batch/256           0.086 ms 0.086 ms       8186 items_per_second=1.16642G/s
+BM_Moodycamel_Batch/4            0.243 ms 0.243 ms       2863 items_per_second=412.17M/s
+BM_Moodycamel_Batch/8            0.239 ms 0.239 ms       2933 items_per_second=418.038M/s
+BM_Moodycamel_Batch/16           0.240 ms 0.240 ms       2896 items_per_second=416.377M/s
+BM_Moodycamel_Batch/64           0.242 ms 0.242 ms       2880 items_per_second=413.098M/s
+BM_Moodycamel_Batch/256          0.242 ms 0.242 ms       2895 items_per_second=414.29M/s
+```
+
+### Key Takeaways
+
+The benchmark data reveals a clear performance profile based on the architectural trade-offs of each library.
+
+*   **1. Single-Item Transfers (`ThisQueue_SingleItem`):** For individual, non-batched item transfers, `moodycamel::ReaderWriterQueue` shows significantly higher throughput (`~412 M/s`) compared to this library's low-level `prepare_write` path (`~44 M/s`). This performance difference is attributed to a fundamental architectural distinction. `moodycamel::ReaderWriterQueue` uses a "queue-of-queues" design (a linked list of smaller ring buffers). This structure provides a higher degree of decoupling, allowing its producer to often operate on a local block with minimal need to synchronize with the consumer's global position. In contrast, this library uses a single, contiguous ring buffer, which results in a tighter coupling between the producer and consumer states, leading to a higher per-transaction cost for single items.
+
+*   **2. Transactional Writes (`ThisQueue_SingleItem_Write256`):** For high-frequency workloads where multiple items are produced in a tight loop, this library's `WriteTransaction` API provides the highest performance. By amortizing the cost of atomic synchronization over a large number of fast, non-atomic `try_push` calls, it achieves a throughput of **`~923 M/s`**, more than double that of Moodycamel's single-item API.
+
+*   **3. Batch Transfers (`Batch`):** This library's `try_write` API, designed for transferring pre-prepared blocks of data, demonstrates strong scaling performance as the batch size increases. The data shows a clear crossover point: while Moodycamel is faster for very small batches (e.g., size 4), this library's throughput surpasses it significantly for batches of approximately **8-16 items or more**, reaching over **`1.1 G/s`** at a batch size of 256. In contrast, the `moodycamel::ReaderWriterQueue` (v1.0.7) API does not offer a non-blocking bulk enqueue method, and its performance remains flat at its single-item throughput rate.
+
+### Conclusion
+
+The data shows that this library is highly specialized for batch-oriented and high-frequency transactional workloads. While other libraries may offer higher performance for pure single-item transfers due to their architectural design, this queue's specialized APIs provide a significant throughput advantage in its intended use cases.
+
+The recommended usage patterns are:
+*   For producers generating many items in a tight loop, the **`WriteTransaction`** API offers the best performance.
+*   For transferring pre-prepared blocks of data, the **`try_write`** lambda API is the most efficient method, especially for batch sizes of 16 or more.
+
diff --git a/benchmarks/queue_comparison_benchmark.cpp b/benchmarks/queue_comparison_benchmark.cpp
@@ -12,9 +12,6 @@ constexpr size_t RandomDataSize       = 4001; // The prime to make patterns less
 constexpr size_t ItemsPerIteration    = 100'025;
 constexpr size_t DefaultQueueCapacity = 65536; // 2^16
 
-// A capacity guaranteed to be larger than ItemsPerIteration.
-constexpr size_t LargeQueueCapacity   = 262144; // 2^18
-
 // Shared Test Data Generation
 const std::vector<DataType>& get_random_data() {
     static const auto data = []{
@@ -27,9 +24,9 @@ const std::vector<DataType>& get_random_data() {
     return data;
 }
 
-// Benchmark Group 1: Single Item, Default "Stalling" Capacity
+// Benchmark Group 1A: Single Item
 
-static void BM_ThisQueue_SingleItem_DefaultBuffer(benchmark::State& state) {
+static void BM_ThisQueue_SingleItem(benchmark::State& state) {
     const auto& random_data = get_random_data();
     std::vector<DataType> shared_buffer(DefaultQueueCapacity);
     LockFreeSpscQueue<DataType> queue(shared_buffer);
@@ -69,51 +66,17 @@ static void BM_ThisQueue_SingleItem_DefaultBuffer(benchmark::State& state) {
     consumer_should_stop.store(true, std::memory_order_relaxed);
     state.SetItemsProcessed(total_written);
 }
-BENCHMARK(BM_ThisQueue_SingleItem_DefaultBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
-
-static void BM_Moodycamel_SingleItem_DefaultBuffer(benchmark::State& state) {
-    const auto& random_data = get_random_data();
-    moodycamel::ReaderWriterQueue<DataType> queue(DefaultQueueCapacity);
-    std::atomic<bool> verification_failed  = false;
-    std::atomic<bool> consumer_should_stop = false;
-    std::jthread consumer([&] {
-        size_t i = 0;
-        DataType item;
-        while (!consumer_should_stop.load(std::memory_order_relaxed)) {
-            if (queue.try_dequeue(item)) {
-                if (item != random_data[i % RandomDataSize]) verification_failed.store(true);
-                i++;
-            } else {
-                std::this_thread::yield();
-            }
-        }
-    });
-
-    size_t total_written = 0;
-    for (auto _ : state) {
-        for (size_t n = 0; n < ItemsPerIteration; ++n) {
-            if (verification_failed.load(std::memory_order_relaxed)) {
-                state.SkipWithError("Verification failed!"); return;
-            }
-            const auto& item_to_write = random_data[total_written % RandomDataSize];
-            while (!queue.try_enqueue(item_to_write)) {}
-            total_written++;
-        }
-    }
-    consumer_should_stop.store(true, std::memory_order_relaxed);
-    state.SetItemsProcessed(total_written);
-}
-BENCHMARK(BM_Moodycamel_SingleItem_DefaultBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
+BENCHMARK(BM_ThisQueue_SingleItem)->Unit(benchmark::kMillisecond)->UseRealTime();
 
 
-// Benchmark Group 2: Single Item, Large, hopefully "Never-Full" Capacity
+// Benchmark Group 1B: Single Item, "Transaction" Builk-Write API
 
-static void BM_OurQueue_SingleItem_LargeBuffer(benchmark::State& state) {
+static void BM_ThisQueue_SingleItem_Write256(benchmark::State& state) {
     const auto& random_data = get_random_data();
-    std::vector<DataType> shared_buffer(LargeQueueCapacity);
+    std::vector<DataType> shared_buffer(DefaultQueueCapacity);
     LockFreeSpscQueue<DataType> queue(shared_buffer);
 
-    std::atomic<bool> verification_failed = false;
+    std::atomic<bool> verification_failed  = false;
     std::atomic<bool> consumer_should_stop = false;
     std::jthread consumer([&] {
         size_t i = 0;
@@ -130,31 +93,28 @@ static void BM_OurQueue_SingleItem_LargeBuffer(benchmark::State& state) {
 
     size_t total_written = 0;
     for (auto _ : state) {
-        for (size_t n = 0; n < ItemsPerIteration; ++n) {
-            if (verification_failed.load(std::memory_order_relaxed)) {
-                state.SkipWithError("Verification failed!"); return;
-            }
-            const auto& item_to_write = random_data[total_written % RandomDataSize];
-            while (true) {
-                auto scope = queue.prepare_write(1);
-                if (scope.get_items_written() == 1) {
-                    scope.get_block1()[0] = item_to_write;
-                    break;
+        for (size_t n = 0; n < ItemsPerIteration; ) {
+            // Start a transaction for up to 256 items at a time
+            auto transaction = queue.try_start_write(256);
+            if (transaction) {
+                while(n < ItemsPerIteration && transaction->try_push(random_data[total_written % RandomDataSize])) {
+                    total_written++;
+                    n++;
                 }
-                // This spin-wait should theoretically never be hit when the buffer is large.
+                // Transaction commits automatically when it goes out of scope here.
             }
-            total_written++;
         }
     }
     consumer_should_stop.store(true, std::memory_order_relaxed);
     state.SetItemsProcessed(total_written);
 }
-BENCHMARK(BM_OurQueue_SingleItem_LargeBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
+BENCHMARK(BM_ThisQueue_SingleItem_Write256)->Unit(benchmark::kMillisecond)->UseRealTime();
 
-static void BM_Moodycamel_SingleItem_LargeBuffer(benchmark::State& state) {
+
+static void BM_Moodycamel_SingleItem(benchmark::State& state) {
     const auto& random_data = get_random_data();
-    moodycamel::ReaderWriterQueue<DataType> queue(LargeQueueCapacity);
-    std::atomic<bool> verification_failed = false;
+    moodycamel::ReaderWriterQueue<DataType> queue(DefaultQueueCapacity);
+    std::atomic<bool> verification_failed  = false;
     std::atomic<bool> consumer_should_stop = false;
     std::jthread consumer([&] {
         size_t i = 0;
@@ -176,19 +136,17 @@ static void BM_Moodycamel_SingleItem_LargeBuffer(benchmark::State& state) {
                 state.SkipWithError("Verification failed!"); return;
             }
             const auto& item_to_write = random_data[total_written % RandomDataSize];
-            while (!queue.try_enqueue(item_to_write)) {
-                // This spin-wait should theoretically never be hit.
-            }
+            while (!queue.try_enqueue(item_to_write)) {}
             total_written++;
         }
     }
     consumer_should_stop.store(true, std::memory_order_relaxed);
     state.SetItemsProcessed(total_written);
 }
-BENCHMARK(BM_Moodycamel_SingleItem_LargeBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
+BENCHMARK(BM_Moodycamel_SingleItem)->Unit(benchmark::kMillisecond)->UseRealTime();
 
 
-// Benchmark Group 3: Batch/Bulk Transfers (with Default Capacity)
+// Benchmark Group 2: Batch/Bulk Transfers
 
 static void BM_ThisQueue_Batch(benchmark::State& state) {
     const size_t batch_size = state.range(0);
@@ -237,7 +195,7 @@ static void BM_ThisQueue_Batch(benchmark::State& state) {
     consumer_should_stop.store(true, std::memory_order_relaxed);
     state.SetItemsProcessed(total_written);
 }
-BENCHMARK(BM_ThisQueue_Batch)->Arg(4)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
+BENCHMARK(BM_ThisQueue_Batch)->Arg(4)->Arg(8)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
 
 
 static void BM_Moodycamel_Batch(benchmark::State& state) {
@@ -280,6 +238,6 @@ static void BM_Moodycamel_Batch(benchmark::State& state) {
     consumer_should_stop.store(true, std::memory_order_relaxed);
     state.SetItemsProcessed(total_written);
 }
-BENCHMARK(BM_Moodycamel_Batch)->Arg(4)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
+BENCHMARK(BM_Moodycamel_Batch)->Arg(4)->Arg(8)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
 
 BENCHMARK_MAIN();
diff --git a/include/LockFreeSpscQueue.h b/include/LockFreeSpscQueue.h