Skip to content

Commit 1d56cfd

Browse files
authored
Merge pull request #4 from joz-k/transaction_write_api
Transaction write api
2 parents 27ac55c + adc1f35 commit 1d56cfd

4 files changed

Lines changed: 238 additions & 112 deletions

File tree

README.md

Lines changed: 8 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44

55
# C++23 Lock-Free SPSC Queue
66

7-
A high-performance, single-producer, single-consumer (SPSC) queue implemented in modern C++23.
7+
A high-performance, _batch-oriented_, single-producer, single-consumer (SPSC) queue implemented in modern C++23.
88

99
This project provides a robust, tested, lock-free queue that is suitable for high-performance applications, such as real-time audio or low-latency trading systems, where data must be exchanged between two threads with minimal overhead.
1010

@@ -18,10 +18,13 @@ This project provides a robust, tested, lock-free queue that is suitable for hig
1818
- **Cache-Friendly:** The queue is optimized for multi-core performance.
1919
1. It uses `alignas` to place producer and consumer data on separate cache lines, preventing "false sharing."
2020
2. It implements a performance optimization by **caching indices per core**. Each thread maintains a local, non-atomic cache of the other thread's position, minimizing expensive cross-core atomic operations.[^1]
21-
- **`JUCE::AbstractFifo`-inspired Design:** The API manages two indices for a user-provided buffer, giving the user full control over memory allocation.
21+
- **[`JUCE::AbstractFifo`][JuceFifo]-inspired Design:** The API manages two indices for a user-provided buffer, giving the user full control over memory allocation.
2222
- **Tested:** Includes a test suite built with CMake and CTest.
23+
- **Highly-Performant:** Optimized for batch-oriented use cases. It scales well with larger batches, often surpassing the [performance](benchmarks/README.md) of other industry-standard solutions.
2324

24-
[^1]: See ["MCRingBuffer"](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf) paper or Erik Rigtorp's [Optimizing a ring buffer for throughput](https://rigtorp.se/ringbuffer/).
25+
26+
[JuceFifo]: https://docs.juce.com/master/classAbstractFifo.html
27+
[^1]: See ["MCRingBuffer"](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf) paper or Rigtorp's [Optimizing a ring buffer for throughput](https://rigtorp.se/ringbuffer/).
2528

2629
## Core Concept: The Circular Buffer
2730

@@ -256,49 +259,9 @@ add_executable(MyAwesomeApp src/main.cpp)
256259
target_include_directories(MyAwesomeApp PRIVATE external/LockFreeSpscQueue/include)
257260
```
258261
259-
## (Advanced) Performance Benchmarks
260-
261-
This project includes a simple performance benchmark suite using the [Google Benchmark](https://github.com/google/benchmark) library to measure queue throughput.
262-
263-
The benchmarks are **disabled by default** to keep configuration and build times fast for users who only want to integrate the library.
264-
265-
### How to Run Benchmarks
266-
267-
1. **Configure CMake with benchmarks enabled:**
268-
You must explicitly enable the option `SPSC_QUEUE_BUILD_BENCHMARKS` when running CMake.
262+
## Performance Benchmarks
269263
270-
```sh
271-
# From the project root directory
272-
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPSC_QUEUE_BUILD_BENCHMARKS=ON
273-
```
274-
*Note: The first time you run this, CMake will download the Google Benchmark source code, which may take a moment.*
275-
276-
2. **Build the project:**
277-
This will now build the `queue_benchmark` executable in addition to any other enabled targets.
278-
```sh
279-
cmake --build build
280-
```
281-
282-
3. **Run the benchmark executable:**
283-
```sh
284-
./build/benchmarks/queue_benchmark
285-
```
286-
287-
### Example Benchmark Output
288-
289-
You will see detailed output measuring the performance for different batch sizes. The most important column is `Items/s`, which shows the throughput in millions of items per second.
290-
291-
```
292-
------------------------------------------------------------------------
293-
Benchmark Time CPU Iterations UserCounters...
294-
------------------------------------------------------------------------
295-
BM_QueueThroughput/1 0.042 ms 0.042 ms 16717 Items/s=195.043M/s
296-
BM_QueueThroughput/4 0.011 ms 0.011 ms 66833 Items/s=777.743M/s
297-
BM_QueueThroughput/16 0.003 ms 0.003 ms 266026 Items/s=3.10748G/s
298-
BM_QueueThroughput/64 0.001 ms 0.001 ms 1049349 Items/s=12.3991G/s
299-
BM_QueueThroughput/256 0.000 ms 0.000 ms 2211341 Items/s=25.924G/s
300-
```
301-
This output clearly shows how performance dramatically increases when transferring items in batches compared to one by one.
264+
This project includes a benchmark suite, using the [Google Benchmark](https://github.com/google/benchmark) library, to measure queue throughput and compare it against [`moodycamel::ReaderWriterQueue`](https://github.com/cameron314/readerwriterqueue). For instructions on how to run the benchmarks, see the [Benchmarks](benchmarks/README.md) section.
302265
303266
## Disclaimers
304267

benchmarks/README.md

Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Benchmarks
2+
3+
This directory contains the performance benchmark suite for the `LockFreeSpscQueue` project, built using the [Google Benchmark](https://github.com/google/benchmark) library.
4+
5+
It includes:
6+
7+
The suite includes two primary executables:
8+
1. **`queue_benchmark`:** A simple performance benchmark that measures the raw throughput of this library's batching API.
9+
2. **`queue_comparison_benchmark`:** A comparative benchmark that stress-tests this library against the industry-standard [`moodycamel::ReaderWriterQueue`](https://github.com/cameron314/readerwriterqueue).
10+
11+
The benchmarks are **disabled by default** to keep configuration and build times fast for users who only want to integrate the library.
12+
13+
### How to Run
14+
15+
It is **critical** to use the `Release` build type, as benchmarking a `Debug` build will produce meaningless results.
16+
17+
1. **Configure CMake:**
18+
You must explicitly enable the desired benchmark option when running CMake.
19+
20+
```sh
21+
# To build ONLY the internal benchmark:
22+
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPSC_QUEUE_BUILD_BENCHMARKS=ON
23+
24+
# To build ONLY the comparison benchmark (most common):
25+
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DSPSC_QUEUE_BUILD_BENCHMARK_COMPARE=ON
26+
```
27+
*Note: The first time you run this, CMake will download the required dependency libraries, which may take a moment.*
28+
29+
2. **Build the Project:**
30+
```sh
31+
cmake --build build --config Release
32+
```
33+
34+
3. **Run the Executables:**
35+
```sh
36+
# On Linux/macOS
37+
./build/benchmarks/queue_comparison_benchmark
38+
39+
# On Windows
40+
.\build\benchmarks\queue_comparison_benchmark.exe
41+
```
42+
43+
---
44+
45+
## Raw Queue Speed Result
46+
47+
When running the [`queue_benchmark`](queue_benchmark.cpp), you will see detailed output measuring the performance for different batch sizes. The most important column is `Items/s`, which shows the throughput in millions of items per second.
48+
49+
Example output:
50+
51+
```
52+
------------------------------------------------------------------------
53+
Benchmark Time CPU Iterations UserCounters...
54+
------------------------------------------------------------------------
55+
BM_QueueThroughput/1 0.042 ms 0.042 ms 16717 Items/s=195.043M/s
56+
BM_QueueThroughput/4 0.011 ms 0.011 ms 66833 Items/s=777.743M/s
57+
BM_QueueThroughput/16 0.003 ms 0.003 ms 266026 Items/s=3.10748G/s
58+
BM_QueueThroughput/64 0.001 ms 0.001 ms 1049349 Items/s=12.3991G/s
59+
BM_QueueThroughput/256 0.000 ms 0.000 ms 2211341 Items/s=25.924G/s
60+
```
61+
This output clearly shows how performance dramatically increases when transferring items in batches compared to one by one.
62+
63+
## Performance Comparison: Results & Analysis
64+
65+
The following are example results from running the [`queue_comparison_benchmark`](queue_comparison_benchmark.cpp) on an Apple MacBook Air (M2). The benchmark transfers a sequence of pseudo-random `int64_t` values and simultaneously verifies data integrity, making it both a performance and correctness stress test.
66+
67+
### Benchmark Data
68+
69+
```
70+
------------------------------------------------------------------------------------------
71+
Benchmark Time CPU Iterations UserCounters...
72+
------------------------------------------------------------------------------------------
73+
BM_ThisQueue_SingleItem 2.32 ms 2.32 ms 320 items_per_second=43.2044M/s
74+
BM_ThisQueue_SingleItem_Write256 0.108 ms 0.108 ms 6466 items_per_second=924.364M/s
75+
BM_Moodycamel_SingleItem 0.245 ms 0.245 ms 2932 items_per_second=408.063M/s
76+
BM_ThisQueue_Batch/4 0.686 ms 0.686 ms 983 items_per_second=145.91M/s
77+
BM_ThisQueue_Batch/8 0.151 ms 0.151 ms 4561 items_per_second=662.117M/s
78+
BM_ThisQueue_Batch/16 0.114 ms 0.114 ms 6597 items_per_second=878.876M/s
79+
BM_ThisQueue_Batch/64 0.093 ms 0.093 ms 7487 items_per_second=1.07577G/s
80+
BM_ThisQueue_Batch/256 0.086 ms 0.086 ms 8186 items_per_second=1.16642G/s
81+
BM_Moodycamel_Batch/4 0.243 ms 0.243 ms 2863 items_per_second=412.17M/s
82+
BM_Moodycamel_Batch/8 0.239 ms 0.239 ms 2933 items_per_second=418.038M/s
83+
BM_Moodycamel_Batch/16 0.240 ms 0.240 ms 2896 items_per_second=416.377M/s
84+
BM_Moodycamel_Batch/64 0.242 ms 0.242 ms 2880 items_per_second=413.098M/s
85+
BM_Moodycamel_Batch/256 0.242 ms 0.242 ms 2895 items_per_second=414.29M/s
86+
```
87+
88+
### Key Takeaways
89+
90+
The benchmark data reveals a clear performance profile based on the architectural trade-offs of each library.
91+
92+
* **1. Single-Item Transfers (`ThisQueue_SingleItem`):** For individual, non-batched item transfers, `moodycamel::ReaderWriterQueue` shows significantly higher throughput (`~412 M/s`) compared to this library's low-level `prepare_write` path (`~44 M/s`). This performance difference is attributed to a fundamental architectural distinction. `moodycamel::ReaderWriterQueue` uses a "queue-of-queues" design (a linked list of smaller ring buffers). This structure provides a higher degree of decoupling, allowing its producer to often operate on a local block with minimal need to synchronize with the consumer's global position. In contrast, this library uses a single, contiguous ring buffer, which results in a tighter coupling between the producer and consumer states, leading to a higher per-transaction cost for single items.
93+
94+
* **2. Transactional Writes (`ThisQueue_SingleItem_Write256`):** For high-frequency workloads where multiple items are produced in a tight loop, this library's `WriteTransaction` API provides the highest performance. By amortizing the cost of atomic synchronization over a large number of fast, non-atomic `try_push` calls, it achieves a throughput of **`~923 M/s`**, more than double that of Moodycamel's single-item API.
95+
96+
* **3. Batch Transfers (`Batch`):** This library's `try_write` API, designed for transferring pre-prepared blocks of data, demonstrates strong scaling performance as the batch size increases. The data shows a clear crossover point: while Moodycamel is faster for very small batches (e.g., size 4), this library's throughput surpasses it significantly for batches of approximately **8-16 items or more**, reaching over **`1.1 G/s`** at a batch size of 256. In contrast, the `moodycamel::ReaderWriterQueue` (v1.0.7) API does not offer a non-blocking bulk enqueue method, and its performance remains flat at its single-item throughput rate.
97+
98+
### Conclusion
99+
100+
The data shows that this library is highly specialized for batch-oriented and high-frequency transactional workloads. While other libraries may offer higher performance for pure single-item transfers due to their architectural design, this queue's specialized APIs provide a significant throughput advantage in its intended use cases.
101+
102+
The recommended usage patterns are:
103+
* For producers generating many items in a tight loop, the **`WriteTransaction`** API offers the best performance.
104+
* For transferring pre-prepared blocks of data, the **`try_write`** lambda API is the most efficient method, especially for batch sizes of 16 or more.
105+

benchmarks/queue_comparison_benchmark.cpp

Lines changed: 25 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,6 @@ constexpr size_t RandomDataSize = 4001; // The prime to make patterns less
1212
constexpr size_t ItemsPerIteration = 100'025;
1313
constexpr size_t DefaultQueueCapacity = 65536; // 2^16
1414

15-
// A capacity guaranteed to be larger than ItemsPerIteration.
16-
constexpr size_t LargeQueueCapacity = 262144; // 2^18
17-
1815
// Shared Test Data Generation
1916
const std::vector<DataType>& get_random_data() {
2017
static const auto data = []{
@@ -27,9 +24,9 @@ const std::vector<DataType>& get_random_data() {
2724
return data;
2825
}
2926

30-
// Benchmark Group 1: Single Item, Default "Stalling" Capacity
27+
// Benchmark Group 1A: Single Item
3128

32-
static void BM_ThisQueue_SingleItem_DefaultBuffer(benchmark::State& state) {
29+
static void BM_ThisQueue_SingleItem(benchmark::State& state) {
3330
const auto& random_data = get_random_data();
3431
std::vector<DataType> shared_buffer(DefaultQueueCapacity);
3532
LockFreeSpscQueue<DataType> queue(shared_buffer);
@@ -69,51 +66,17 @@ static void BM_ThisQueue_SingleItem_DefaultBuffer(benchmark::State& state) {
6966
consumer_should_stop.store(true, std::memory_order_relaxed);
7067
state.SetItemsProcessed(total_written);
7168
}
72-
BENCHMARK(BM_ThisQueue_SingleItem_DefaultBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
73-
74-
static void BM_Moodycamel_SingleItem_DefaultBuffer(benchmark::State& state) {
75-
const auto& random_data = get_random_data();
76-
moodycamel::ReaderWriterQueue<DataType> queue(DefaultQueueCapacity);
77-
std::atomic<bool> verification_failed = false;
78-
std::atomic<bool> consumer_should_stop = false;
79-
std::jthread consumer([&] {
80-
size_t i = 0;
81-
DataType item;
82-
while (!consumer_should_stop.load(std::memory_order_relaxed)) {
83-
if (queue.try_dequeue(item)) {
84-
if (item != random_data[i % RandomDataSize]) verification_failed.store(true);
85-
i++;
86-
} else {
87-
std::this_thread::yield();
88-
}
89-
}
90-
});
91-
92-
size_t total_written = 0;
93-
for (auto _ : state) {
94-
for (size_t n = 0; n < ItemsPerIteration; ++n) {
95-
if (verification_failed.load(std::memory_order_relaxed)) {
96-
state.SkipWithError("Verification failed!"); return;
97-
}
98-
const auto& item_to_write = random_data[total_written % RandomDataSize];
99-
while (!queue.try_enqueue(item_to_write)) {}
100-
total_written++;
101-
}
102-
}
103-
consumer_should_stop.store(true, std::memory_order_relaxed);
104-
state.SetItemsProcessed(total_written);
105-
}
106-
BENCHMARK(BM_Moodycamel_SingleItem_DefaultBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
69+
BENCHMARK(BM_ThisQueue_SingleItem)->Unit(benchmark::kMillisecond)->UseRealTime();
10770

10871

109-
// Benchmark Group 2: Single Item, Large, hopefully "Never-Full" Capacity
72+
// Benchmark Group 1B: Single Item, "Transaction" Builk-Write API
11073

111-
static void BM_OurQueue_SingleItem_LargeBuffer(benchmark::State& state) {
74+
static void BM_ThisQueue_SingleItem_Write256(benchmark::State& state) {
11275
const auto& random_data = get_random_data();
113-
std::vector<DataType> shared_buffer(LargeQueueCapacity);
76+
std::vector<DataType> shared_buffer(DefaultQueueCapacity);
11477
LockFreeSpscQueue<DataType> queue(shared_buffer);
11578

116-
std::atomic<bool> verification_failed = false;
79+
std::atomic<bool> verification_failed = false;
11780
std::atomic<bool> consumer_should_stop = false;
11881
std::jthread consumer([&] {
11982
size_t i = 0;
@@ -130,31 +93,28 @@ static void BM_OurQueue_SingleItem_LargeBuffer(benchmark::State& state) {
13093

13194
size_t total_written = 0;
13295
for (auto _ : state) {
133-
for (size_t n = 0; n < ItemsPerIteration; ++n) {
134-
if (verification_failed.load(std::memory_order_relaxed)) {
135-
state.SkipWithError("Verification failed!"); return;
136-
}
137-
const auto& item_to_write = random_data[total_written % RandomDataSize];
138-
while (true) {
139-
auto scope = queue.prepare_write(1);
140-
if (scope.get_items_written() == 1) {
141-
scope.get_block1()[0] = item_to_write;
142-
break;
96+
for (size_t n = 0; n < ItemsPerIteration; ) {
97+
// Start a transaction for up to 256 items at a time
98+
auto transaction = queue.try_start_write(256);
99+
if (transaction) {
100+
while(n < ItemsPerIteration && transaction->try_push(random_data[total_written % RandomDataSize])) {
101+
total_written++;
102+
n++;
143103
}
144-
// This spin-wait should theoretically never be hit when the buffer is large.
104+
// Transaction commits automatically when it goes out of scope here.
145105
}
146-
total_written++;
147106
}
148107
}
149108
consumer_should_stop.store(true, std::memory_order_relaxed);
150109
state.SetItemsProcessed(total_written);
151110
}
152-
BENCHMARK(BM_OurQueue_SingleItem_LargeBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
111+
BENCHMARK(BM_ThisQueue_SingleItem_Write256)->Unit(benchmark::kMillisecond)->UseRealTime();
153112

154-
static void BM_Moodycamel_SingleItem_LargeBuffer(benchmark::State& state) {
113+
114+
static void BM_Moodycamel_SingleItem(benchmark::State& state) {
155115
const auto& random_data = get_random_data();
156-
moodycamel::ReaderWriterQueue<DataType> queue(LargeQueueCapacity);
157-
std::atomic<bool> verification_failed = false;
116+
moodycamel::ReaderWriterQueue<DataType> queue(DefaultQueueCapacity);
117+
std::atomic<bool> verification_failed = false;
158118
std::atomic<bool> consumer_should_stop = false;
159119
std::jthread consumer([&] {
160120
size_t i = 0;
@@ -176,19 +136,17 @@ static void BM_Moodycamel_SingleItem_LargeBuffer(benchmark::State& state) {
176136
state.SkipWithError("Verification failed!"); return;
177137
}
178138
const auto& item_to_write = random_data[total_written % RandomDataSize];
179-
while (!queue.try_enqueue(item_to_write)) {
180-
// This spin-wait should theoretically never be hit.
181-
}
139+
while (!queue.try_enqueue(item_to_write)) {}
182140
total_written++;
183141
}
184142
}
185143
consumer_should_stop.store(true, std::memory_order_relaxed);
186144
state.SetItemsProcessed(total_written);
187145
}
188-
BENCHMARK(BM_Moodycamel_SingleItem_LargeBuffer)->Unit(benchmark::kMillisecond)->UseRealTime();
146+
BENCHMARK(BM_Moodycamel_SingleItem)->Unit(benchmark::kMillisecond)->UseRealTime();
189147

190148

191-
// Benchmark Group 3: Batch/Bulk Transfers (with Default Capacity)
149+
// Benchmark Group 2: Batch/Bulk Transfers
192150

193151
static void BM_ThisQueue_Batch(benchmark::State& state) {
194152
const size_t batch_size = state.range(0);
@@ -237,7 +195,7 @@ static void BM_ThisQueue_Batch(benchmark::State& state) {
237195
consumer_should_stop.store(true, std::memory_order_relaxed);
238196
state.SetItemsProcessed(total_written);
239197
}
240-
BENCHMARK(BM_ThisQueue_Batch)->Arg(4)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
198+
BENCHMARK(BM_ThisQueue_Batch)->Arg(4)->Arg(8)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
241199

242200

243201
static void BM_Moodycamel_Batch(benchmark::State& state) {
@@ -280,6 +238,6 @@ static void BM_Moodycamel_Batch(benchmark::State& state) {
280238
consumer_should_stop.store(true, std::memory_order_relaxed);
281239
state.SetItemsProcessed(total_written);
282240
}
283-
BENCHMARK(BM_Moodycamel_Batch)->Arg(4)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
241+
BENCHMARK(BM_Moodycamel_Batch)->Arg(4)->Arg(8)->Arg(16)->Arg(64)->Arg(256)->Unit(benchmark::kMillisecond)->UseRealTime();
284242

285243
BENCHMARK_MAIN();

0 commit comments

Comments
 (0)