|
1 | | -# Homework-2 |
2 | | -Parallel Computing Homework Assignment 1 |
| 1 | +# Parallel Programming |
| 2 | + |
| 3 | +**Åbo Akademi University, Information Technology Department** |
| 4 | + |
| 5 | +**Instructor: Alireza Olama** |
| 6 | + |
| 7 | +## Homework Assignment 2: Optimizing Matrix Multiplication in C++ |
| 8 | + |
| 9 | +**Due Date**: 08/05/2025 |
| 10 | + |
| 11 | +**Points**: 100 |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +### Assignment Overview |
| 16 | + |
| 17 | +Welcome to the second homework assignment of the Parallel Programming course! In Assignment 1, you implemented a naive |
| 18 | +matrix multiplication using a triple nested loop. In this assignment, you will optimize the performance of your naive |
| 19 | +implementation using two techniques: |
| 20 | + |
| 21 | +1. **Cache Optimization via Blocked Matrix Multiplication**: Improve data locality to reduce cache misses. |
| 22 | +2. **Parallel Matrix Multiplication using `OpenMP`**: Parallelize the computation across multiple threads. |
| 23 | + |
| 24 | +Your task is to implement both optimizations in the provided C++ `main.cpp` file, measure their performance, and compare the |
| 25 | +wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds |
| 26 | +on your Assignment 1 code, so ensure your naive implementation is correct before starting. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +### Technical Requirements |
| 31 | + |
| 32 | +#### 1. Cache Optimization (Blocked Matrix Multiplication) |
| 33 | + |
| 34 | +**Why Cache Optimization?** |
| 35 | + |
| 36 | +Modern CPUs rely on cache memory to reduce the latency of accessing data from main memory. Cache memory is faster but |
| 37 | +smaller, organized in cache lines (typically 64 bytes). When a CPU accesses a memory location, it fetches an entire |
| 38 | +cache line. Matrix multiplication can suffer from poor performance if memory accesses are not cache-friendly, leading to |
| 39 | +frequent cache misses. |
| 40 | + |
| 41 | +The naive matrix multiplication (with triple nested loops) accesses memory in a way that may not exploit spatial and |
| 42 | +temporal locality: |
| 43 | + |
| 44 | +- **Spatial Locality**: Accessing consecutive memory locations (e.g., elements in the same cache line). |
| 45 | +- **Temporal Locality**: Reusing the same data multiple times while it’s still in the cache. |
| 46 | + |
| 47 | +Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By |
| 48 | +performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache |
| 49 | +misses and improving performance. |
| 50 | + |
| 51 | +**Blocked Matrix Multiplication Pseudocode** |
| 52 | + |
| 53 | +Assume matrices \( A \) (m × n), \( B \) (n × p), and \( C \) (m × p) are stored in row-major order. The blocked matrix |
| 54 | +multiplication processes submatrices of size \( block_size × block_size \): |
| 55 | + |
| 56 | +```cpp |
| 57 | +// C = A * B |
| 58 | +for (ii = 0; ii < m; ii += block_size) |
| 59 | + for (jj = 0; jj < p; jj += block_size) |
| 60 | + for (kk = 0; kk < n; kk += block_size) |
| 61 | + // Process block: C[ii:ii+block_size, jj:jj+block_size] += A[ii:ii+block_size, kk:kk+block_size] * B[kk:kk+block_size, jj:jj+block_size] |
| 62 | + for (i = ii; i < min(ii + block_size, m); i++) |
| 63 | + for (j = jj; j < min(jj + block_size, p); j++) |
| 64 | + for (k = kk; k < min(kk + block_size, n); k++) |
| 65 | + C[i * p + j] += A[i * n + k] * B[k * p + j] |
| 66 | +``` |
| 67 | + |
| 68 | +- **block_size**: Chosen to ensure the block fits in the cache (e.g., 32, 64, or 128, depending on the system). |
| 69 | +- **Outer loops (ii, jj, kk)**: Iterate over blocks. |
| 70 | +- **Inner loops (i, j, k)**: Compute within a block, reusing data in the cache. |
| 71 | + |
| 72 | +**Task**: Implement the `blocked_matmul` function in the provided `main.cpp`. Experiment with different block sizes (e.g., |
| 73 | +16, 32, 64) and report the best performance. |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +#### 2. Parallel Matrix Multiplication with OpenMP |
| 78 | + |
| 79 | +**Why OpenMP?** |
| 80 | + |
| 81 | +`OpenMP` is a portable API for parallel programming in shared-memory systems. It allows you to parallelize loops with |
| 82 | +minimal code changes, distributing iterations across multiple threads. In matrix multiplication, the outer loop(s) can |
| 83 | +be parallelized, as each element of the output matrix \( C \) can be computed independently. |
| 84 | + |
| 85 | +**Parallelizing with OpenMP** |
| 86 | + |
| 87 | +Use OpenMP to parallelize the outer loop(s) of the naive matrix multiplication. For example, parallelize the loop over |
| 88 | +rows of \( C \): |
| 89 | + |
| 90 | +```cpp |
| 91 | +#pragma omp parallel for |
| 92 | +for (i = 0; i < m; i++) |
| 93 | + for (j = 0; j < p; j++) |
| 94 | + for (k = 0; k < n; k++) |
| 95 | + C[i * p + j] += A[i * n + k] * B[k * p + j]; |
| 96 | +``` |
| 97 | + |
| 98 | +- The `#pragma omp parallel for` directive tells `OpenMP` to distribute iterations of the loop across available threads. |
| 99 | +- Ensure thread safety: Since each iteration writes to a distinct element of \( C \), this loop is safe to parallelize |
| 100 | + without locks. |
| 101 | +- Use `omp_get_wtime()` to measure wall clock time for accurate performance comparisons. |
| 102 | + |
| 103 | +**Task**: Implement the `parallel_matmul` function in the provided `main.cpp` using `OpenMP`. Test with different numbers of |
| 104 | +threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`. |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +#### 3. Performance Measurement |
| 109 | + |
| 110 | +For each test case (0 through 9 in the `data` folder): |
| 111 | + |
| 112 | +- Measure the **wall clock time** for: |
| 113 | + - Naive matrix multiplication (`naive_matmul`). |
| 114 | + - Cache-optimized matrix multiplication (`blocked_matmul`). |
| 115 | + - Parallel matrix multiplication (`parallel_matmul`). |
| 116 | +- Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time. |
| 117 | +- Report the times in a table in your submission README.md, including: |
| 118 | + - Test case number. |
| 119 | + - Matrix dimensions (m × n × p). |
| 120 | + - Wall clock time for each implementation (in seconds). |
| 121 | + - Speedup of blocked and parallel implementations over the naive implementation. |
| 122 | + |
| 123 | +Example table format: |
| 124 | + |
| 125 | +| Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup | |
| 126 | +|-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------| |
| 127 | +| 0 | 512 × 512 × 512 | 2.345 | 0.987 | 0.543 | 2.38× | 4.32× | |
| 128 | + |
| 129 | +--- |
| 130 | + |
| 131 | +#### Matrix Storage and Memory Management |
| 132 | + |
| 133 | +- Continue using row-major order for all matrices, as in Assignment 1. |
| 134 | +- Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`). |
| 135 | +- Do not use STL containers or smart pointers. |
| 136 | + |
| 137 | +--- |
| 138 | + |
| 139 | +#### Input/Output and Validation |
| 140 | + |
| 141 | +- Use the same input/output format as Assignment 1: |
| 142 | + - Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)). |
| 143 | + - Output file: `data/<case>/result.raw` (matrix \( C \)). |
| 144 | + - Reference file: `data/<case>/output.raw` for validation. |
| 145 | +- The executable accepts a case number (0–9) as a command-line argument. |
| 146 | +- Validate correctness by comparing `result.raw` with `output.raw` for each implementation. |
| 147 | + |
| 148 | +--- |
| 149 | + |
| 150 | +### Build Instructions |
| 151 | + |
| 152 | +- Use the provided `CMakeLists.txt` to build the project. |
| 153 | +- **Additional Requirements**: |
| 154 | + - Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC). |
| 155 | + - The provided CMake file includes OpenMP support. |
| 156 | +- **Windows Users**: |
| 157 | + - Use CLion or Visual Studio with CMake. |
| 158 | + - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`. |
| 159 | +- **Linux/Mac Users**: |
| 160 | + - Make sure gcc compiler is installed (`brew install gcc` on Mac). |
| 161 | + - Configure cmake to use the correct compiler: |
| 162 | + ```bash |
| 163 | + cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ . |
| 164 | + ``` |
| 165 | + - Run `cmake .` to generate a Makefile, then `make`. |
| 166 | +- **Testing OpenMP**: |
| 167 | + - Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on |
| 168 | + Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows). |
| 169 | + - Test with different thread counts to find the best performance. |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +### Submission Requirements |
| 174 | + |
| 175 | +#### Fork and Clone the Repository |
| 176 | + |
| 177 | +- Fork the Assignment 2 repository (provided separately). |
| 178 | +- Clone your fork: |
| 179 | + ```bash |
| 180 | + git clone https://github.com/parallelcomputingabo/Homework-2.git |
| 181 | + cd Homework-2 |
| 182 | + ``` |
| 183 | + |
| 184 | +#### Create a New Branch |
| 185 | + |
| 186 | +```bash |
| 187 | +git checkout -b student-name |
| 188 | +``` |
| 189 | + |
| 190 | +#### Implement Your Solution |
| 191 | + |
| 192 | +- Modify the provided `main.cpp` to implement `blocked_matmul` and `parallel_matmul`. |
| 193 | +- Update `README.md` with your performance results table. |
| 194 | + |
| 195 | +#### Commit and Push |
| 196 | + |
| 197 | +```bash |
| 198 | +git add . |
| 199 | +git commit -m "student-name: Implemented optimized matrix multiplication" |
| 200 | +git push origin student-name |
| 201 | +``` |
| 202 | + |
| 203 | +#### Submit a Pull Request (PR) |
| 204 | + |
| 205 | +- Create a pull request from your branch to the base repository’s `main` branch. |
| 206 | +- Include a description of your optimizations and any challenges faced. |
| 207 | + |
| 208 | +--- |
| 209 | + |
| 210 | +### Grading (100 Points Total) |
| 211 | + |
| 212 | +| Subtask | Points | |
| 213 | +|---------------------------------------------|--------| |
| 214 | +| Correct implementation of `blocked_matmul` | 30 | |
| 215 | +| Correct implementation of `parallel_matmul` | 30 | |
| 216 | +| Accurate performance measurements | 20 | |
| 217 | +| Performance results table in README.md | 10 | |
| 218 | +| Code clarity, commenting, and organization | 10 | |
| 219 | +| **Total** | 100 | |
| 220 | + |
| 221 | +--- |
| 222 | + |
| 223 | +### Tips for Success |
| 224 | + |
| 225 | +- **Cache Optimization**: |
| 226 | + - Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64). |
| 227 | + - Use a block size that balances cache usage without excessive overhead. |
| 228 | +- **OpenMP**: |
| 229 | + - Test with different thread counts to find the optimal number for your system. |
| 230 | + - Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues). |
| 231 | +- **Performance Measurement**: |
| 232 | + - Run multiple iterations for each test case and report the average time to reduce variability. |
| 233 | + - Ensure no other heavy processes are running during measurements. |
| 234 | +- **Debugging**: |
| 235 | + - Validate each implementation against `output.raw` to ensure correctness before optimizing. |
| 236 | + - Use small test cases to debug your blocked and parallel implementations. |
| 237 | + |
| 238 | +Good luck, and enjoy optimizing your matrix multiplication! |
0 commit comments