You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-45Lines changed: 8 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@
4
4
5
5
# C++23 Lock-Free SPSC Queue
6
6
7
-
A high-performance, single-producer, single-consumer (SPSC) queue implemented in modern C++23.
7
+
A high-performance, _batch-oriented_, single-producer, single-consumer (SPSC) queue implemented in modern C++23.
8
8
9
9
This project provides a robust, tested, lock-free queue that is suitable for high-performance applications, such as real-time audio or low-latency trading systems, where data must be exchanged between two threads with minimal overhead.
10
10
@@ -18,10 +18,13 @@ This project provides a robust, tested, lock-free queue that is suitable for hig
18
18
-**Cache-Friendly:** The queue is optimized for multi-core performance.
19
19
1. It uses `alignas` to place producer and consumer data on separate cache lines, preventing "false sharing."
20
20
2. It implements a performance optimization by **caching indices per core**. Each thread maintains a local, non-atomic cache of the other thread's position, minimizing expensive cross-core atomic operations.[^1]
21
-
-**`JUCE::AbstractFifo`-inspired Design:** The API manages two indices for a user-provided buffer, giving the user full control over memory allocation.
21
+
-**[`JUCE::AbstractFifo`][JuceFifo]-inspired Design:** The API manages two indices for a user-provided buffer, giving the user full control over memory allocation.
22
22
-**Tested:** Includes a test suite built with CMake and CTest.
23
+
-**Highly-Performant:** Optimized for batch-oriented use cases. It scales well with larger batches, often surpassing the [performance](benchmarks/README.md) of other industry-standard solutions.
23
24
24
-
[^1]: See ["MCRingBuffer"](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf) paper or Erik Rigtorp's [Optimizing a ring buffer for throughput](https://rigtorp.se/ringbuffer/).
[^1]: See ["MCRingBuffer"](https://www.cse.cuhk.edu.hk/~pclee/www/pubs/ancs09poster.pdf) paper or Rigtorp's [Optimizing a ring buffer for throughput](https://rigtorp.se/ringbuffer/).
This project includes a simple performance benchmark suite using the [Google Benchmark](https://github.com/google/benchmark) library to measure queue throughput.
262
-
263
-
The benchmarks are **disabled by default** to keep configuration and build times fast for users who only want to integrate the library.
264
-
265
-
### How to Run Benchmarks
266
-
267
-
1. **Configure CMake with benchmarks enabled:**
268
-
You must explicitly enable the option `SPSC_QUEUE_BUILD_BENCHMARKS` when running CMake.
*Note: The first time you run this, CMake will download the Google Benchmark source code, which may take a moment.*
275
-
276
-
2. **Build the project:**
277
-
This will now build the `queue_benchmark` executable in addition to any other enabled targets.
278
-
```sh
279
-
cmake --build build
280
-
```
281
-
282
-
3. **Run the benchmark executable:**
283
-
```sh
284
-
./build/benchmarks/queue_benchmark
285
-
```
286
-
287
-
### Example Benchmark Output
288
-
289
-
You will see detailed output measuring the performance for different batch sizes. The most important column is `Items/s`, which shows the throughput in millions of items per second.
BM_QueueThroughput/1 0.042 ms 0.042 ms 16717 Items/s=195.043M/s
296
-
BM_QueueThroughput/4 0.011 ms 0.011 ms 66833 Items/s=777.743M/s
297
-
BM_QueueThroughput/16 0.003 ms 0.003 ms 266026 Items/s=3.10748G/s
298
-
BM_QueueThroughput/64 0.001 ms 0.001 ms 1049349 Items/s=12.3991G/s
299
-
BM_QueueThroughput/256 0.000 ms 0.000 ms 2211341 Items/s=25.924G/s
300
-
```
301
-
This output clearly shows how performance dramatically increases when transferring items in batches compared to one by one.
264
+
This project includes a benchmark suite, using the [Google Benchmark](https://github.com/google/benchmark) library, to measure queue throughput and compare it against [`moodycamel::ReaderWriterQueue`](https://github.com/cameron314/readerwriterqueue). For instructions on how to run the benchmarks, see the [Benchmarks](benchmarks/README.md) section.
This directory contains the performance benchmark suite for the `LockFreeSpscQueue` project, built using the [Google Benchmark](https://github.com/google/benchmark) library.
4
+
5
+
It includes:
6
+
7
+
The suite includes two primary executables:
8
+
1.**`queue_benchmark`:** A simple performance benchmark that measures the raw throughput of this library's batching API.
9
+
2.**`queue_comparison_benchmark`:** A comparative benchmark that stress-tests this library against the industry-standard [`moodycamel::ReaderWriterQueue`](https://github.com/cameron314/readerwriterqueue).
10
+
11
+
The benchmarks are **disabled by default** to keep configuration and build times fast for users who only want to integrate the library.
12
+
13
+
### How to Run
14
+
15
+
It is **critical** to use the `Release` build type, as benchmarking a `Debug` build will produce meaningless results.
16
+
17
+
1.**Configure CMake:**
18
+
You must explicitly enable the desired benchmark option when running CMake.
*Note: The first time you run this, CMake will download the required dependency libraries, which may take a moment.*
28
+
29
+
2. **Build the Project:**
30
+
```sh
31
+
cmake --build build --config Release
32
+
```
33
+
34
+
3. **Run the Executables:**
35
+
```sh
36
+
# On Linux/macOS
37
+
./build/benchmarks/queue_comparison_benchmark
38
+
39
+
# On Windows
40
+
.\build\benchmarks\queue_comparison_benchmark.exe
41
+
```
42
+
43
+
---
44
+
45
+
## Raw Queue Speed Result
46
+
47
+
When running the [`queue_benchmark`](queue_benchmark.cpp), you will see detailed output measuring the performance fordifferent batch sizes. The most important column is `Items/s`, which shows the throughputin millions of items per second.
BM_QueueThroughput/1 0.042 ms 0.042 ms 16717 Items/s=195.043M/s
56
+
BM_QueueThroughput/4 0.011 ms 0.011 ms 66833 Items/s=777.743M/s
57
+
BM_QueueThroughput/16 0.003 ms 0.003 ms 266026 Items/s=3.10748G/s
58
+
BM_QueueThroughput/64 0.001 ms 0.001 ms 1049349 Items/s=12.3991G/s
59
+
BM_QueueThroughput/256 0.000 ms 0.000 ms 2211341 Items/s=25.924G/s
60
+
```
61
+
This output clearly shows how performance dramatically increases when transferring items in batches compared to one by one.
62
+
63
+
## Performance Comparison: Results & Analysis
64
+
65
+
The following are example results from running the [`queue_comparison_benchmark`](queue_comparison_benchmark.cpp) on an Apple MacBook Air (M2). The benchmark transfers a sequence of pseudo-random `int64_t` values and simultaneously verifies data integrity, making it both a performance and correctness stress test.
BM_ThisQueue_SingleItem 2.32 ms 2.32 ms 320 items_per_second=43.2044M/s
74
+
BM_ThisQueue_SingleItem_Write256 0.108 ms 0.108 ms 6466 items_per_second=924.364M/s
75
+
BM_Moodycamel_SingleItem 0.245 ms 0.245 ms 2932 items_per_second=408.063M/s
76
+
BM_ThisQueue_Batch/4 0.686 ms 0.686 ms 983 items_per_second=145.91M/s
77
+
BM_ThisQueue_Batch/8 0.151 ms 0.151 ms 4561 items_per_second=662.117M/s
78
+
BM_ThisQueue_Batch/16 0.114 ms 0.114 ms 6597 items_per_second=878.876M/s
79
+
BM_ThisQueue_Batch/64 0.093 ms 0.093 ms 7487 items_per_second=1.07577G/s
80
+
BM_ThisQueue_Batch/256 0.086 ms 0.086 ms 8186 items_per_second=1.16642G/s
81
+
BM_Moodycamel_Batch/4 0.243 ms 0.243 ms 2863 items_per_second=412.17M/s
82
+
BM_Moodycamel_Batch/8 0.239 ms 0.239 ms 2933 items_per_second=418.038M/s
83
+
BM_Moodycamel_Batch/16 0.240 ms 0.240 ms 2896 items_per_second=416.377M/s
84
+
BM_Moodycamel_Batch/64 0.242 ms 0.242 ms 2880 items_per_second=413.098M/s
85
+
BM_Moodycamel_Batch/256 0.242 ms 0.242 ms 2895 items_per_second=414.29M/s
86
+
```
87
+
88
+
### Key Takeaways
89
+
90
+
The benchmark data reveals a clear performance profile based on the architectural trade-offs of each library.
91
+
92
+
***1. Single-Item Transfers (`ThisQueue_SingleItem`):** For individual, non-batched item transfers, `moodycamel::ReaderWriterQueue` shows significantly higher throughput (`~412 M/s`) compared to this library's low-level `prepare_write` path (`~44 M/s`). This performance difference is attributed to a fundamental architectural distinction. `moodycamel::ReaderWriterQueue` uses a "queue-of-queues" design (a linked list of smaller ring buffers). This structure provides a higher degree of decoupling, allowing its producer to often operate on a local block with minimal need to synchronize with the consumer's global position. In contrast, this library uses a single, contiguous ring buffer, which results in a tighter coupling between the producer and consumer states, leading to a higher per-transaction cost for single items.
93
+
94
+
***2. Transactional Writes (`ThisQueue_SingleItem_Write256`):** For high-frequency workloads where multiple items are produced in a tight loop, this library's `WriteTransaction` API provides the highest performance. By amortizing the cost of atomic synchronization over a large number of fast, non-atomic `try_push` calls, it achieves a throughput of **`~923 M/s`**, more than double that of Moodycamel's single-item API.
95
+
96
+
***3. Batch Transfers (`Batch`):** This library's `try_write` API, designed for transferring pre-prepared blocks of data, demonstrates strong scaling performance as the batch size increases. The data shows a clear crossover point: while Moodycamel is faster for very small batches (e.g., size 4), this library's throughput surpasses it significantly for batches of approximately **8-16 items or more**, reaching over **`1.1 G/s`** at a batch size of 256. In contrast, the `moodycamel::ReaderWriterQueue` (v1.0.7) API does not offer a non-blocking bulk enqueue method, and its performance remains flat at its single-item throughput rate.
97
+
98
+
### Conclusion
99
+
100
+
The data shows that this library is highly specialized forbatch-oriented and high-frequency transactional workloads. While other libraries may offer higher performance for pure single-item transfers due to their architectural design, this queue's specialized APIs provide a significant throughput advantagein its intended use cases.
101
+
102
+
The recommended usage patterns are:
103
+
* For producers generating many items in a tight loop, the **`WriteTransaction`** API offers the best performance.
104
+
* For transferring pre-prepared blocks of data, the **`try_write`** lambda API is the most efficient method, especially for batch sizes of 16 or more.
0 commit comments