Assignment 4 | Tristan Amiotte-Suchet#8
Open
letriton25 wants to merge 7 commits into
Open
Conversation
Still no improvement with the cached version of matmul.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Parallel Programming
Åbo Akademi University, Information Technology Department
Instructor: Alireza Olama
Student: Tristan Amiotte-Suchet
Student ID: 2501127
Homework Assignment 4: Optimizing Matrix Multiplication in C++
Due Date: 31/05/2026
Points: 100
Challenge of the Assignment
In this assignment, we are tasked with optimizing the performance of a naive matrix multiplication implementation in C++ using two techniques: cache optimization via blocked matrix multiplication and parallelization using OpenMP.
Before starting the optimizations, it is crucial to ensure that the naive matrix multiplication implementation is correct. Also, because all the benchmarks and tests rely on the correctness of the matrix multiplication, validating the results against a reference implementation is essential. It is also important to start by implementing the validation functions and all the other onces that are use for read and write the matrix files.
Let's have a small overview of this pre-optimization phase and the next steps when starting implementing the different optimizations.
Pre-Optimization Phase
I decided to divide the work and implement the read/write work in a separate library called
matfileto keep the code organized and modular. And also cause it allows me to, later, write a side program to generate random matrices with specific dimensions for testing without having to duplicate the read/write code. So the my implementation of the work in themain_ans.cppfile has the same structure as the one provided inmain.cpp.Cache Optimization (Blocked Matrix Multiplication)
This part is the one that create me the most difficulties. At the beginning, I implemented it using the pseudocode provided in the assignment instructions. However, I quickly faced a performance issue. Since even with all different possible cache size, the speedup was not improved on the provided matrices. Using my side executable to generate bigger matrices like 1000x1000, I was finally able to notice a small speedup between 1.4x and 1.6x in general. However, even with bigger matrices the gain was not significant. Maybe the fact that the first implementation does not increase the the speedup is due to my hardware limitations, but I unfortunately have no idea about the exact cause. The code used at this moment is the following one:
After that, I decided to use a different approach. The concept is still the same and I continue the divide the matrices into blocks. But also start using a register blocking technique. The idea is to load the values of the output matrix ( C ) into registers, and then perform the calculations for a small block of columns (e.g., 8 columns at a time) while keeping the intermediate results in registers. This way, we can reduce the number of memory accesses and take advantage of the CPU's ability to perform multiple operations on the data stored in registers. With this new approach, I was able to achieve a much better speedup, around 2.8x in general on the provided matrices. The code used for this new approach is the one available in the
main_ans.cppwith the last commited version of theblocked_matmulfunction.Parallel Matrix Multiplication using OpenMP
This one was more straightforward to implement. I just had to add the OpenMP pragmas to the naive matrix multiplication implementation as explained in the assignment instructions. using exactly this following line
#pragma omp parallel forbefore the first loop of the naive matrix multiplication implementation.I also tried to vary the number of threads used for the parallel implementation by setting the
OMP_NUM_THREADSenvironment variable. I noticed that until 4 threads, the speedup is perfectly linear, but after that, it stay a bit over 4x. Probably because of the overhead of creating threads and the fact that my CPU has 4 physical cores, so using more than 4 threads does not provide any additional performance benefit.Results and benchmarks
After implementing both optimizations, I wrote a simple shell script to run several times my program on each provided matrices. The script is available in the
benchmark.shfile. The results is the following table:We can notice on this table that the blocked matrix multiplication implementation provides a significant speedup over the naive implementation, with an average speedup of around 2.8x across all test cases. The parallel matrix multiplication implementation also provides a significant speedup, with an average speedup of around 3.2x across all test cases. However, the speedup from parallelization is not consistent across all test cases, and in some cases, it is even slower than the naive implementation. This could be due to various factors such as the overhead of creating threads, the size of the matrices, and the number of available CPU cores. The blocked version, on the other hand, consistently outperforms the naive version, and the speedup is usually the same across all test cases, which indicates that the cache optimization technique is effective regardless of the matrix size.