🔒 C++ barrier & locking algorithms

A benchmark (counter) is used to increment a global atomic counter (integer) using various barrier and locking algorithms (listed below). Additionally, the same algorithms are used while using a bucket sort algorithm. This sorting algorithm takes an unsorted (specified) .txt file of unique (random) integers and outputs a sorted list of integers to a different (specified) .txt file.

Counter program (benchmark)

This multithreaded program increments an integer counter based on the number of threads input and the number of iterations for each thread to perform. Ideally, the output should be the number of threads mutliplied by the number of iterations (for example 10 thread * 100 iterations = 1000). The global atomic counter is incremented by all threads in this program. Various barrier and locking algorithms are used, as outlined below. Additionally, this program is meant more for benchmarking performance results using a Linux basd performance analyzing tool called 'perf'. Results are provided further in this document.

Bucket sorting program (mysort program)

Using the same algorithms as the benchmark program, this bucket sorting algorithm program reads in a file of randomly sorted integers and stores them into a vector locally. It then sorts based on various barrier and locking algorithms (same as above) while using multiple threads, both of these are specified by a input arguments. Based on the number of threads, the input vector is split up into different buckets and sorted synchronously before rejoining/inserting back into the primary vector. Though the time and memory cost overhead may be significant for high thread numbers and a small list of numbers, it is easily scalable when sorting larger number sets.

Barrier Algorithms:

Sense-reversal barrier - Each sequential barrier uses an opposite value of the previous barrier's value as a pass/stop state. By not using the same value, this barrier algorithm solves for potential deadlocking issues. For example, if barrier 1 uses the value 0 to stop the threads, barrier 2 will use 1 to stop, barrier 3 will use 0, barrier 4 will use 1, and so forth. In the algorithm, it simply flips (0/1) the value each time.
Pthread barrier - Implementation of pthread locks is setup from initializing how many threads to expect at the barrier. Once all threads have arrived, the barrier call will unlock and the threads will continue until either finished, or another barrier is set. Once the threads are complete, they will join to the main thread as a sense of notification that the worker threads are complete (and then dead).

Locking Algorithms:

Test-and-set (TAS) lock - using a single atomic value, TAS functionality tests a lock (true/false; obtain when false), and then sets lock. In the algorithm, a constant spinning loop tests against the lock to see when it's set to false (0). And once 0, it obtains the lock by setting the atomic value to 1. One issue with TAS is it's unfairness with not guaranteeing FIFO ordering for threading competing against the lock. This algorithm is easy to implement, but horrible for scaling.
Test-and-test-and-set (TTAS) lock - more elaborate locking based on a modified version of TAS. In additional to test the lock, it also continuously checks the memory to see if the lock has been obtained. Once free, it attempts to take the lock by using test and set. So the additional 'test' comes from testing the memory of the lock, before trying to test and set it.
Ticket lock - a queue like algorithm that uses two atomic counters, one which designates who is up, and who is next. Though this could lead to starvation of some threads, it allows for the next thread (among all) to know it's got CPU access next. This is also a spinlock, where the next in the queue waits until it's equal to the value now available. Both of these counters will increment as the locks/unlocks occur.
Pthread lock - using a single mutex lock and an atomic shared variable, this lock first initializes the mutex lock, and then locks/unlocks (sets) the lock using the built in pthread library. This lock also destroys the lock once all threads have merged back to the master thread.

Sorting List Structure

Migrated from an array, and switched to using a vector. Some differences (and ultimate reasoning for decision) are outlined below.

	Array	Vector	Notes
Synchronized	No	Yes	1
Implemented	Statically or Dynamically	Dynamic arrays with list interface
Size	Fixed	Dynamic	2
Deallocation	Explicitly	Automatically
Size Determination	undetermined	O(1) time	3
Reallocation	Explicitly	Implicitly
Returned through func?	~No	Yes	4
Copied Directly?	No	Yes	5
Assigned Directly?	No	Yes	5

Only one thread at a time can access the code with vectors, while arrays allow for multiple threads to work on it at the same time. This is turn causes thread waiting for vectors.
Size: The size of an array is fixed, and expensive to reallocate. Vectors, on the other hand, are resizable since they are allocated on heap memory.
Cannot be determined if array is dynamically allocated. Vectors are O(1) because they maintain variables which keeps track of container size at all times.
Arrays can only be returned if they are dynamically allocated from a function
Vectors can be copied directly (example: vec1 = vec2), where arrays need to be iterated when copied. Same applied for assignment. (arr1 = arr2 would produce an error)

Code Organization

📁 Files

Makefile - used to create and remove executable C++11 objects
mysort.cpp - bucket sort using bars and locks
arg_parser.h - parsing and error handling for mysort program input options
counter.cpp - benchmark program to incorporate bars and locks
cnt_arg_parser.h - parsing and error handling for counter program input options
README.pdf - write-up for project (this file)

💾 Additional files

randomNumberGen.cpp- creates a text file with a specified number of random numbers from [ 1 - number size ]. Default number is 10,000 unless specified when ran (as second argument).
pthread_add.h - threading if compiling on macOS (will still run as normal for linux systems)'
perf.sh - shell script file to run perf tests -- Omitted this file b/c was not able to run due to permissions on jupiter notebook. Methods for testing are listed below.

Compiling

Note: If zipped, first unzip file before proceeding.

From root directory, run make from the terminal. This will generate a program called counter and mysort
Next, follow execution syntax below for either counter or mysort.
Make clean will remove all files generated by make.
Additionally, make numGen will generate the executable object to generate random numbers to a .txt file (set to source.txt).

Execution

Program option/input parameters for `counter`

> counter [--name]

counter --name: prints name to console.
OR

> counter [-t NUM THREADS] [-i=NUM_ITERATIONS] [--bar=<sense,pthread>] [--lock=<tas,ttas,ticket,mcs,pthread>] [-o out.txt]

-t NUM_THREADS: specify how many threads to use during execution (including master thread)
-i=NUM_ITERATIONS: number of times to increment counter on each thread
--bar=<sense,pthread>: type of barrier algorithm to use
--lock=<tas,ttas,ticket,mcs,pthread>: type of locking algorithm to use
-o out.txt: sorted txt file of numebrs, in which the program outputs/writes to

Additional outputs: time of execution in nanoseconds
Example of program execution, using 10 threads, 1000 iterations for each thread, TAS locking algorithm, and writing to file out.txt:

> ./counter -t 10 -i=1000 --lock=tas -o out.txt

Program option/input argument parameters for `mysort`

> mysort [--name]

mysort --name: prints name to console.
OR

> mysort [source.txt] [-o out.txt] [-t NUM THREADS] [--alg=<bucket>] [--bar=<sense,pthread>] [--lock=<tas,ttas,ticket,mcs,pthread>]

source.txt: input file for unsorted and random list of numbers, each on a new line
-o out.txt: output file for sorted list of numbers (from input file in step preceeding this one), each on a new line
-t NUM_THREADS: specify how many threads to use during execution (including master thread)
--alg=<bucket>: sorting algorithm to use (always bucket)
--bar=<sense,pthread>: type of barrier algorithm to use
--lock=<tas,ttas,ticket,mcs,pthread>: type of locking algorithm to use

Additional outputs: time of execution in nanoseconds, and if sorting fails
Example of program execution for mysort, reading from an input file of source.txt, writing to an output file of out.txt, using bucket sort, a sense-reversal barrier, and 5 threads:

> ./mysort source.txt -o out.txt -t 5 --alg=bucket --bar=sense

Analyzing program performances, using `perf`

Barriers

For L1 cache hit rate

> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt
> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>

For branch-prediction hit rate

> perf stat -e branch-loads -e branch-load-misses ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt
> perf stat -e branch-loads -e branch-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>

For L1 cache hit rate

> perf stat -e page-faults ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt
> perf stat -e page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>

Or all stats at once (how I ended up doing it), using 10 repeated runs

> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt  
> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>

Barrier Perf Table (Iteration/Array at 1000, 10 threads, and 10 repeated runs (average))

Program	Barrier	Run Time (s)	L1 cache hit rate	branch-prediction hit rate	page-fault
counter	sense	0.001862	99.70%	99.86%	154
counter	pthread	0.012391	90.04%	99.01%	148
mysort	sense	0.001578	99.32%	99.46%	187
mysort	pthread	0.000531	96.56%	97.58%	186

Locks

For L1 cache hit rate

> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt
> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>

For branch-prediction hit rate

> perf stat -e branch-loads -e branch-load-misses ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt
> perf stat -e branch-loads -e branch-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>

For L1 cache hit rate

> perf stat -e page-faults ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt
> perf stat -e page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>

Or all stats at once (how I ended up doing it), using 10 repeated runs

> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt 
> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>

Lock Perf Table (Iteration/Array at 1000, 10 threads, and 10 repeated runs (average))

Program	Lock	Run Time (s)	L1 cache hit rate	branch-prediction hit rate	page-fault
counter	tas	0.000574	92.37%	97.01%	144
counter	ttas	0.000705	93.40%	97.07%	145
counter	ticket	0.001021	96.17%	98.69%	145
counter	pthread	0.001050	97.06%	99.01%	145
mysort	tas	0.000346	96.99%	97.75%	171
mysort	ttas	0.000371	97.02%	97.78%	170
mysort	ticket	0.000358	97.30%	97.92%	166
mysort	pthread	0.000349	96.53%	97.46%	171

Perf Analysis

Overall among benchmarking and mysort performance tests, locking algorithms seem to be faster and have better hit rates. Speculating that with barriers, each process has to wait for one another before proceeding on, while the locking algorithms are able to sort the subarray independently on one another prior to merging back. The pthread appeared the perform the worse across the board, since each thread was having to synchronize after every bit, there's a lot of overhead cost involed. The fastest looked to be TAS lock, imagining it's the simplest of the locks, but it not scalable so could see degraded results with larger numbers.

🐜 Surviving Bugs

Input/output files are loaded/saved as is. Does not check for txt file types or if data is integers.
Creating/Joining of threads is out of order for locks, so sometimes fails bucket sorting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🔒 C++ barrier & locking algorithms

Counter program (benchmark)

Bucket sorting program (mysort program)

Barrier Algorithms:

Locking Algorithms:

Sorting List Structure

Code Organization

📁 Files

💾 Additional files

Compiling

Execution

Program option/input parameters for `counter`

Program option/input argument parameters for `mysort`

Analyzing program performances, using `perf`

Barriers

For L1 cache hit rate

For branch-prediction hit rate

For L1 cache hit rate

Or all stats at once (how I ended up doing it), using 10 repeated runs

Barrier Perf Table (Iteration/Array at 1000, 10 threads, and 10 repeated runs (average))

Locks

For L1 cache hit rate

For branch-prediction hit rate

For L1 cache hit rate

Or all stats at once (how I ended up doing it), using 10 repeated runs

Lock Perf Table (Iteration/Array at 1000, 10 threads, and 10 repeated runs (average))

Perf Analysis

🐜 Surviving Bugs

Resources:

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🔒 C++ barrier & locking algorithms

Counter program (benchmark)

Bucket sorting program (mysort program)

Barrier Algorithms:

Locking Algorithms:

Sorting List Structure

Code Organization

📁 Files

💾 Additional files

Compiling

Execution

Program option/input parameters for counter

Program option/input argument parameters for mysort

Analyzing program performances, using perf

Barriers

For L1 cache hit rate

For branch-prediction hit rate

For L1 cache hit rate

Or all stats at once (how I ended up doing it), using 10 repeated runs

Barrier Perf Table (Iteration/Array at 1000, 10 threads, and 10 repeated runs (average))

Locks

For L1 cache hit rate

For branch-prediction hit rate

For L1 cache hit rate

Or all stats at once (how I ended up doing it), using 10 repeated runs

Lock Perf Table (Iteration/Array at 1000, 10 threads, and 10 repeated runs (average))

Perf Analysis

🐜 Surviving Bugs

Resources:

Program option/input parameters for `counter`

Program option/input argument parameters for `mysort`

Analyzing program performances, using `perf`