AA-parallel-computing · Artorias17 · May 28, 2026 · May 30, 2026 · May 31, 2026 · May 31, 2026
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,30 @@
+# Compiled binaries
+matmul
+# Object files
+*.o
+*.a
+*.so
+logs/
+objs/
+
+# Build artifacts
+*.d
+
+# CMake
+CMakeFiles/
+CMakeCache.txt
+cmake_install.cmake
+Makefile
+
+# Editor/IDE
+*.swp
+*.swo
+*~
+
+# OS
+.DS_Store
+Thumbs.db
+
+# Output files
+data/*/result.raw
+timing_results.md
diff --git a/.python-version b/.python-version
@@ -0,0 +1 @@
+3.11
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -9,15 +9,15 @@ set(CMAKE_CXX_STANDARD_REQUIRED ON)
 find_package(OpenMP REQUIRED)
 
 if(OpenMP_CXX_FOUND)
-    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
+    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS} -Ofast -march=native -funroll-loops")
 endif()
 
 if(APPLE)
     set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_Alignof=alignof")
 endif()
 
 
-add_executable(matmul main_ans.cpp)
+add_executable(matmul main.cpp)
 
 
 if(OpenMP_CXX_FOUND)

diff --git a/README.md b/README.md
@@ -6,23 +6,25 @@
 
 ## Homework Assignment 4: Optimizing Matrix Multiplication in C++
 
-**Due Date**: 31/05/2026
+### Task Distribution
 
-**Points**: 100
+| Student                           | Task |
+|-----------------------------------|------|
+|Ha Do (Student ID: 2402703)        | Naive and Blocked Matrix Multiplication and implementing utility functions  |
+|Abhishek Roy (Student ID: 2502895) | OpenMP Matrix Multiplication and Optimizations for other matrix multiplication functions |
 
 ---
 
 ### Assignment Overview
 
-Welcome to the last homework assignment of the Parallel Programming course! In this assignment, you will optimize the performance of a naive matrix multiplication
+Optimized the performance of a naive matrix multiplication
 implementation using two techniques:
 
-1. **Cache Optimization via Blocked Matrix Multiplication**: Improve data locality to reduce cache misses.
-2. **Parallel Matrix Multiplication using `OpenMP`**: Parallelize the computation across multiple threads.
+1. **Cache Optimization via Blocked Matrix Multiplication**: Improved data locality to reduce cache misses.
+2. **Parallel Matrix Multiplication using `OpenMP`**: Parallelized the computation across multiple threads.
 
-Your task is to implement both optimizations in the provided C++ `main.cpp` file, measure their performance, and compare the
-wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds
-on naive matmul implementation, so ensure your naive implementation is correct before starting.
+The task was to implement both optimizations, measure their performance, and compare the
+wall clock time of the naive, cache-optimized, and parallel implementations for each test case.
 
 ---
 
@@ -41,7 +43,7 @@ The naive matrix multiplication (with triple nested loops) accesses memory in a
 temporal locality:
 
 - **Spatial Locality**: Accessing consecutive memory locations (e.g., elements in the same cache line).
-- **Temporal Locality**: Reusing the same data multiple times while it’s still in the cache.
+- **Temporal Locality**: Reusing the same data multiple times while it's still in the cache.
 
 Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By
 performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache
@@ -68,9 +70,6 @@ for (ii = 0; ii < m; ii += block_size)
 - **Outer loops (ii, jj, kk)**: Iterate over blocks.
 - **Inner loops (i, j, k)**: Compute within a block, reusing data in the cache.
 
-**Task**: Implement the `blocked_matmul` function in the provided `main.cpp`. Experiment with different block sizes (e.g.,
-16, 32, 64) and report the best performance.
-
 ---
 
 #### 2. Parallel Matrix Multiplication with OpenMP
@@ -83,8 +82,8 @@ be parallelized, as each element of the output matrix \( C \) can be computed in
 
 **Parallelizing with OpenMP**
 
-Use OpenMP to parallelize the outer loop(s) of the naive matrix multiplication. For example, parallelize the loop over
-rows of \( C \):
+OpenMP was used to parallelize the outer loop(s) of the naive matrix multiplication. For example, the loop over
+rows of \( C \) was parallelized:
 
 ```cpp
 #pragma omp parallel for
@@ -99,139 +98,66 @@ for (i = 0; i < m; i++)
   without locks.
 - Use `omp_get_wtime()` to measure wall clock time for accurate performance comparisons.
 
-**Task**: Implement the `parallel_matmul` function in the provided `main.cpp` using `OpenMP`. Test with different numbers of
-threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`.
-
 ---
 
 #### 3. Performance Measurement
 
 For each test case (0 through 9 in the `data` folder):
 
-- Measure the **wall clock time** for:
-    - Naive matrix multiplication (`naive_matmul`).
-    - Cache-optimized matrix multiplication (`blocked_matmul`).
-    - Parallel matrix multiplication (`parallel_matmul`).
-- Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time.
-- Report the times in a table in your submission README.md, including:
-    - Test case number.
-    - Matrix dimensions (m × n × p).
-    - Wall clock time for each implementation (in seconds).
-    - Speedup of blocked and parallel implementations over the naive implementation.
-
-Example table format:
-
-| Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
-|-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------|
-| 0         | 512 × 512 × 512        | 2.345          | 0.987            | 0.543             | 2.38×           | 4.32×            |
-
----
-
-#### Matrix Storage and Memory Management
-
-- Row-major order for all matrices
-- Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`).
-- Do not use smart pointers.
-
----
-
-#### Input/Output and Validation
-
-- Use the same input/output format as Assignment 1:
-    - Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
-    - Output file: `data/<case>/result.raw` (matrix \( C \)).
-    - Reference file: `data/<case>/output.raw` for validation.
-- The executable accepts a case number (0–9) as a command-line argument.
-- Validate correctness by comparing `result.raw` with `output.raw` for each implementation.
+- Measured the **wall clock time** for:
+  - Naive matrix multiplication (`naive_matmul`).
+  - Cache-optimized matrix multiplication (`blocked_matmul`) with block size 32.
+  - Parallel matrix multiplication (`parallel_matmul`) with OMP_NUM_THREADS = 8.
+- Used `omp_get_wtime()` for high-resolution wall clock timings.
+
+#### 4. Results
+
+**CPU Specs:**
+
+- CPU: AMD Ryzen 7 8845HS
+- Architecture: x86-64
+- Cores: 8
+- Threads: 16
+
+The results in the table below come from the basic implementation of the aforementioned matrix multiplications.
+
+|   Test Case | Dimensions (m x n x p)   |   Naive Time (s) |   Blocked Time (s) |   Parallel Time (s) | Blocked Speedup   | Parallel Speedup   |
+|------------:|:-------------------------|-----------------:|-------------------:|--------------------:|:------------------|:-------------------|
+|           0 | 64x64x64                 |       0.00101837 |        0.000795773 |         0.00089702  | 1.27973x          | 1.13528x           |
+|           1 | 128x64x128               |       0.00414562 |        0.00348123  |         0.00195844  | 1.19085x          | 2.11679x           |
+|           2 | 100x128x56               |       0.0022999  |        0.00236687  |         0.00106175  | 0.971706x         | 2.16614x           |
+|           3 | 128x64x128               |       0.00355272 |        0.00337588  |         0.00159078  | 1.05238x          | 2.23332x           |
+|           4 | 32x128x32                |       0.00070751 |        0.000524026 |         0.000757544 | 1.35014x          | 0.933952x          |
+|           5 | 200x100x256              |       0.0176089  |        0.016518    |         0.00467121  | 1.06605x          | 3.76967x           |
+|           6 | 256x256x256              |       0.0536581  |        0.0567246   |         0.0120156   | 0.945941x         | 4.4657x            |
+|           7 | 256x300x256              |       0.0632706  |        0.0663129   |         0.0131018   | 0.954122x         | 4.82916x           |
+|           8 | 64x128x64                |       0.00175151 |        0.00203758  |         0.00326974  | 0.859606x         | 0.535674x          |
+|           9 | 256x256x257              |       0.0567747  |        0.0572363   |         0.0125215   | 0.991936x         | 4.53418x           |
+
+As can be seen, the blocked matrix multiplication speedup was slightly above 1x or less in most cases. However, the parallel implementation achieved speedups of around 2x to 4.5x in most cases.
+
+The table below shows results after the following optimizations:
+
+- Switched blocked and parallel loops to be `i -> k -> j`
+  - The initial `i -> j -> k` order accesses `B[k * p + j]` with `k` as the innermost variable, stepping through B column-wise with stride `p`. This causes a cache miss on every iteration. Swapping to `i -> k -> j` makes `j` the innermost variable, so `B[k * p + j]` is accessed sequentially (stride 1), keeping all three matrices in cache-friendly access patterns.
+- Added compiler flags:
+  - `-Ofast`: Enables all `-O3` optimizations with some additional flags. One of which is `-ffast-math`. This allows the compiler to reorder floating point operations, use fused multiply-add (FMA) instructions, and vectorize reduction loops more aggressively. This is the flag most responsible for the blocked speedup improvement.
+  - `-march=native`: Generates code using the full SIMD instruction set of the host CPU (e.g. AVX2, AVX-512). Without this, the compiler falls back to a generic baseline (SSE2), missing wide vector registers that process 8 floats at a time.
+  - `-funroll-loops`: Unrolls loop bodies to reduce loop control overhead and expose more instruction-level parallelism for the CPU's out-of-order execution units.
+
+|   Test Case | Dimensions (m x n x p)   |   Naive Time (s) |   Blocked Time (s) |   Parallel Time (s) | Blocked Speedup   | Parallel Speedup   |
+|------------:|:-------------------------|-----------------:|-------------------:|--------------------:|:------------------|:-------------------|
+|           0 | 64x64x64                 |      0.000206294 |        0.000124644 |         0.000974188 | 1.65507x          | 0.21176x           |
+|           1 | 128x64x128               |      0.000902154 |        0.000501695 |         0.000846172 | 1.79821x          | 1.06616x           |
+|           2 | 100x128x56               |      0.000479879 |        0.000572318 |         0.000855709 | 0.838483x         | 0.560797x          |
+|           3 | 128x64x128               |      0.00101505  |        0.00050891  |         0.000930236 | 1.99456x          | 1.09117x           |
+|           4 | 32x128x32                |      8.6234e-05  |        5.2299e-05  |         0.000552463 | 1.64887x          | 0.15609x           |
+|           5 | 200x100x256              |      0.00479585  |        0.00224633  |         0.00191679  | 2.13497x          | 2.50202x           |
+|           6 | 256x256x256              |      0.0186269   |        0.00742483  |         0.00312077  | 2.50874x          | 5.9687x            |
+|           7 | 256x300x256              |      0.0223733   |        0.0108002   |         0.00306352  | 2.07156x          | 7.30315x           |
+|           8 | 64x128x64                |      0.000383877 |        0.000223077 |         0.000714149 | 1.72083x          | 0.537531x          |
+|           9 | 256x256x257              |      0.0118203   |        0.00739098  |         0.00332793  | 1.59929x          | 3.55186x           |
+
+With this, the blocked implementation speedup improved to around 1.5x to 2.5x. However, the parallel speedup dropped as most small cases are at 1x or below. For larger cases (5, 6, 7, 9), it still managed to achieve 2.5x to 7.3x speedup. The compiler flags optimized the single-threaded naive baseline significantly, which reduced the relative parallel speedup. For small matrices, thread spawn contributed to the overhead, while larger matrices had enough work for the threads to contribute to the speedup.
 
 ---
-
-### Build Instructions
-
-- Use the provided `CMakeLists.txt` to build the project.
-- **Additional Requirements**:
-    - Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
-    - The provided CMake file includes OpenMP support.
-- **Windows Users**:
-    - Use CLion or Visual Studio with CMake.
-    - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
-- **Linux/Mac Users**:
-    - Make sure the GCC compiler is installed (`brew install gcc` on Mac).
-    - Configure CMake to use the correct compiler:
-      ```bash
-      cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
-      ```
-    - Run `cmake .` to generate a Makefile, then `make`.
-- **Testing OpenMP**:
-    - Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
-      Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
-    - Test with different thread counts to find the best performance.
-
----
-
-### Submission Requirements
-
-#### Fork and Clone the Repository
-
-- Fork the Assignment 4 repository (provided separately).
-- Clone your fork:
-  ```bash
-  git clone https://github.com/AA-parallel-computing/Assignment-4-Optional.git
-  cd Assignment-4-Optional
-  ```
-
-#### Create a New Branch
-
-```bash
-git checkout -b student-name
-```
-
-#### Implement Your Solution
-
-- Modify the provided `main.cpp` to implement `blocked_matmul` and `parallel_matmul`.
-- Update `README.md` with your performance results table.
-
-#### Commit and Push
-
-```bash
-git add .
-git commit -m "student-name: Implemented optimized matrix multiplication"
-git push origin student-name
-```
-
-#### Submit a Pull Request (PR)
-
-- Create a pull request from your branch to the base repository’s `main` branch.
-- Include a description of your optimizations and any challenges faced.
-
----
-
-### Grading (100 Points Total)
-
-| Subtask                                     | Points |
-|---------------------------------------------|--------|
-| Correct implementation of `blocked_matmul`  | 30     |
-| Correct implementation of `parallel_matmul` | 30     |
-| Accurate performance measurements           | 20     |
-| Performance results table in README.md      | 10     |
-| Code clarity, commenting, and organization  | 10     |
-| **Total**                                   | 100    |
-
----
-
-### Tips for Success
-
-- **Cache Optimization**:
-    - Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
-    - Use a block size that balances cache usage without excessive overhead.
-- **OpenMP**:
-    - Test with different thread counts to find the optimal number for your system.
-    - Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
-- **Performance Measurement**:
-    - Run multiple iterations for each test case and report the average time to reduce variability.
-    - Ensure no other heavy processes are running during measurements.
-- **Debugging**:
-    - Validate each implementation against `output.raw` to ensure correctness before optimizing.
-    - Use small test cases to debug your blocked and parallel implementations.
-
-Good luck, and enjoy optimizing your matrix multiplication!