A benchmark (counter) is used to increment a global atomic counter (integer) using various barrier and locking algorithms (listed below). Additionally, the same algorithms are used while using a bucket sort algorithm. This sorting algorithm takes an unsorted (specified) .txt file of unique (random) integers and outputs a sorted list of integers to a different (specified) .txt file.
This multithreaded program increments an integer counter based on the number of threads input and the number of iterations for each thread to perform. Ideally, the output should be the number of threads mutliplied by the number of iterations (for example 10 thread * 100 iterations = 1000). The global atomic counter is incremented by all threads in this program. Various barrier and locking algorithms are used, as outlined below. Additionally, this program is meant more for benchmarking performance results using a Linux basd performance analyzing tool called 'perf'. Results are provided further in this document.
Using the same algorithms as the benchmark program, this bucket sorting algorithm program reads in a file of randomly sorted integers and stores them into a vector locally. It then sorts based on various barrier and locking algorithms (same as above) while using multiple threads, both of these are specified by a input arguments. Based on the number of threads, the input vector is split up into different buckets and sorted synchronously before rejoining/inserting back into the primary vector. Though the time and memory cost overhead may be significant for high thread numbers and a small list of numbers, it is easily scalable when sorting larger number sets.
- Sense-reversal barrier - Each sequential barrier uses an opposite value of the previous barrier's value as a pass/stop state. By not using the same value, this barrier algorithm solves for potential deadlocking issues. For example, if barrier 1 uses the value 0 to stop the threads, barrier 2 will use 1 to stop, barrier 3 will use 0, barrier 4 will use 1, and so forth. In the algorithm, it simply flips (0/1) the value each time.
- Pthread barrier - Implementation of pthread locks is setup from initializing how many threads to expect at the barrier. Once all threads have arrived, the barrier call will unlock and the threads will continue until either finished, or another barrier is set. Once the threads are complete, they will join to the main thread as a sense of notification that the worker threads are complete (and then dead).
- Test-and-set (TAS) lock - using a single atomic value, TAS functionality tests a lock (true/false; obtain when false), and then sets lock. In the algorithm, a constant spinning loop tests against the lock to see when it's set to false (0). And once 0, it obtains the lock by setting the atomic value to 1. One issue with TAS is it's unfairness with not guaranteeing FIFO ordering for threading competing against the lock. This algorithm is easy to implement, but horrible for scaling.
- Test-and-test-and-set (TTAS) lock - more elaborate locking based on a modified version of TAS. In additional to test the lock, it also continuously checks the memory to see if the lock has been obtained. Once free, it attempts to take the lock by using test and set. So the additional 'test' comes from testing the memory of the lock, before trying to test and set it.
- Ticket lock - a queue like algorithm that uses two atomic counters, one which designates who is up, and who is next. Though this could lead to starvation of some threads, it allows for the next thread (among all) to know it's got CPU access next. This is also a spinlock, where the next in the queue waits until it's equal to the value now available. Both of these counters will increment as the locks/unlocks occur.
- Pthread lock - using a single mutex lock and an atomic shared variable, this lock first initializes the mutex lock, and then locks/unlocks (sets) the lock using the built in pthread library. This lock also destroys the lock once all threads have merged back to the master thread.
- Migrated from an array, and switched to using a vector. Some differences (and ultimate reasoning for decision) are outlined below.
| Array | Vector | Notes | |
|---|---|---|---|
| Synchronized | No | Yes | 1 |
| Implemented | Statically or Dynamically | Dynamic arrays with list interface | |
| Size | Fixed | Dynamic | 2 |
| Deallocation | Explicitly | Automatically | |
| Size Determination | undetermined | O(1) time | 3 |
| Reallocation | Explicitly | Implicitly | |
| Returned through func? | ~No | Yes | 4 |
| Copied Directly? | No | Yes | 5 |
| Assigned Directly? | No | Yes | 5 |
- Only one thread at a time can access the code with vectors, while arrays allow for multiple threads to work on it at the same time. This is turn causes thread waiting for vectors.
- Size: The size of an array is fixed, and expensive to reallocate. Vectors, on the other hand, are resizable since they are allocated on heap memory.
- Cannot be determined if array is dynamically allocated. Vectors are O(1) because they maintain variables which keeps track of container size at all times.
- Arrays can only be returned if they are dynamically allocated from a function
- Vectors can be copied directly (example: vec1 = vec2), where arrays need to be iterated when copied. Same applied for assignment. (arr1 = arr2 would produce an error)
Makefile- used to create and remove executable C++11 objectsmysort.cpp- bucket sort using bars and locksarg_parser.h- parsing and error handling for mysort program input optionscounter.cpp- benchmark program to incorporate bars and lockscnt_arg_parser.h- parsing and error handling for counter program input optionsREADME.pdf- write-up for project (this file)
randomNumberGen.cpp- creates a text file with a specified number of random numbers from [ 1 - number size ]. Default number is 10,000 unless specified when ran (as second argument).pthread_add.h- threading if compiling on macOS (will still run as normal for linux systems)'perf.sh- shell script file to run perf tests -- Omitted this file b/c was not able to run due to permissions on jupiter notebook. Methods for testing are listed below.
Note: If zipped, first unzip file before proceeding.
- From root directory, run
makefrom the terminal. This will generate a program called counter and mysort - Next, follow execution syntax below for either counter or mysort.
Make cleanwill remove all files generated bymake.- Additionally,
make numGenwill generate the executable object to generate random numbers to a .txt file (set to source.txt).
> counter [--name]counter --name: prints name to console.
OR
> counter [-t NUM THREADS] [-i=NUM_ITERATIONS] [--bar=<sense,pthread>] [--lock=<tas,ttas,ticket,mcs,pthread>] [-o out.txt]-t NUM_THREADS: specify how many threads to use during execution (including master thread)-i=NUM_ITERATIONS: number of times to increment counter on each thread--bar=<sense,pthread>: type of barrier algorithm to use--lock=<tas,ttas,ticket,mcs,pthread>: type of locking algorithm to use-o out.txt: sorted txt file of numebrs, in which the program outputs/writes to
- Additional outputs: time of execution in nanoseconds
Example of program execution, using 10 threads, 1000 iterations for each thread, TAS locking algorithm, and writing to file out.txt:
> ./counter -t 10 -i=1000 --lock=tas -o out.txt> mysort [--name]mysort --name: prints name to console.
OR
> mysort [source.txt] [-o out.txt] [-t NUM THREADS] [--alg=<bucket>] [--bar=<sense,pthread>] [--lock=<tas,ttas,ticket,mcs,pthread>]source.txt: input file for unsorted and random list of numbers, each on a new line-o out.txt: output file for sorted list of numbers (from input file in step preceeding this one), each on a new line-t NUM_THREADS: specify how many threads to use during execution (including master thread)--alg=<bucket>: sorting algorithm to use (always bucket)--bar=<sense,pthread>: type of barrier algorithm to use--lock=<tas,ttas,ticket,mcs,pthread>: type of locking algorithm to use
- Additional outputs: time of execution in nanoseconds, and if sorting fails
Example of program execution for mysort, reading from an input file of source.txt, writing to an output file of out.txt, using bucket sort, a sense-reversal barrier, and 5 threads:
> ./mysort source.txt -o out.txt -t 5 --alg=bucket --bar=sense> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt
> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>> perf stat -e branch-loads -e branch-load-misses ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt
> perf stat -e branch-loads -e branch-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>> perf stat -e page-faults ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt
> perf stat -e page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./counter -t 10 -i=1000 --bar=<sense, pthread> -o out.txt
> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --bar=<sense, pthread>| Program | Barrier | Run Time (s) | L1 cache hit rate | branch-prediction hit rate | page-fault |
|---|---|---|---|---|---|
| counter | sense | 0.001862 | 99.70% | 99.86% | 154 |
| counter | pthread | 0.012391 | 90.04% | 99.01% | 148 |
| mysort | sense | 0.001578 | 99.32% | 99.46% | 187 |
| mysort | pthread | 0.000531 | 96.56% | 97.58% | 186 |
> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt
> perf stat -e L1-dcache-loads -e L1-dcache-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>> perf stat -e branch-loads -e branch-load-misses ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt
> perf stat -e branch-loads -e branch-load-misses ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>> perf stat -e page-faults ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt
> perf stat -e page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./counter -t 10 -i=1000 --lock=<tas,ttas,ticket,pthread> -o out.txt
> perf stat --repeat 10 -e L1-dcache-loads,L1-dcache-load-misses,branch-loads,branch-load-misses,page-faults ./mysort source.txt -o out.txt -t 10 --alg=bucket --lock=<tas,ttas,ticket,pthread>| Program | Lock | Run Time (s) | L1 cache hit rate | branch-prediction hit rate | page-fault |
|---|---|---|---|---|---|
| counter | tas | 0.000574 | 92.37% | 97.01% | 144 |
| counter | ttas | 0.000705 | 93.40% | 97.07% | 145 |
| counter | ticket | 0.001021 | 96.17% | 98.69% | 145 |
| counter | pthread | 0.001050 | 97.06% | 99.01% | 145 |
| mysort | tas | 0.000346 | 96.99% | 97.75% | 171 |
| mysort | ttas | 0.000371 | 97.02% | 97.78% | 170 |
| mysort | ticket | 0.000358 | 97.30% | 97.92% | 166 |
| mysort | pthread | 0.000349 | 96.53% | 97.46% | 171 |
Overall among benchmarking and mysort performance tests, locking algorithms seem to be faster and have better hit rates. Speculating that with barriers, each process has to wait for one another before proceeding on, while the locking algorithms are able to sort the subarray independently on one another prior to merging back. The pthread appeared the perform the worse across the board, since each thread was having to synchronize after every bit, there's a lot of overhead cost involed. The fastest looked to be TAS lock, imagining it's the simplest of the locks, but it not scalable so could see degraded results with larger numbers.
- Input/output files are loaded/saved as is. Does not check for txt file types or if data is integers.
- Creating/Joining of threads is out of order for locks, so sometimes fails bucket sorting.
- bucket sort (variaion)
- multithreading pthread
- pthread locks
- mutex locks for linux threads
- pthreads
- test and set spinlocks
- TAS spinlock 2
- Test and test-and-set (wiki)
- Double-checked locking
- Ticket lock
- Ticket lock (wiki)
- Sense Reversal (small sample)
- Pthread barrier
- Pthread barrier 2 (in C)
- perf & perf stat
- measuring execution time
- clock and time functions
- chrono high resolution time accuracy
- atomic fetch add - since 'fai' not available
- Advantages of vector over array in C++
- Awesome videos on locks and atomics