Skip to content

Commit 19c605d

Browse files
committed
Assignment 2
1 parent ca7c814 commit 19c605d

33 files changed

Lines changed: 4863 additions & 2 deletions

CMakeLists.txt

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
cmake_minimum_required(VERSION 3.10)
2+
3+
project(MatrixMultiplication)
4+
5+
set(CMAKE_CXX_STANDARD 11)
6+
7+
set(CMAKE_CXX_STANDARD_REQUIRED ON)
8+
9+
find_package(OpenMP REQUIRED)
10+
11+
if(OpenMP_CXX_FOUND)
12+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}")
13+
endif()
14+
15+
if(APPLE)
16+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -D_Alignof=alignof")
17+
endif()
18+
19+
20+
add_executable(matmul main_ans.cpp)
21+
22+
23+
if(OpenMP_CXX_FOUND)
24+
target_link_libraries(matmul PUBLIC OpenMP::OpenMP_CXX)
25+
endif()

README.md

Lines changed: 238 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,238 @@
1-
# Homework-2
2-
Parallel Computing Homework Assignment 1
1+
# Parallel Programming
2+
3+
**Åbo Akademi University, Information Technology Department**
4+
5+
**Instructor: Alireza Olama**
6+
7+
## Homework Assignment 2: Optimizing Matrix Multiplication in C++
8+
9+
**Due Date**: 08/05/2025
10+
11+
**Points**: 100
12+
13+
---
14+
15+
### Assignment Overview
16+
17+
Welcome to the second homework assignment of the Parallel Programming course! In Assignment 1, you implemented a naive
18+
matrix multiplication using a triple nested loop. In this assignment, you will optimize the performance of your naive
19+
implementation using two techniques:
20+
21+
1. **Cache Optimization via Blocked Matrix Multiplication**: Improve data locality to reduce cache misses.
22+
2. **Parallel Matrix Multiplication using `OpenMP`**: Parallelize the computation across multiple threads.
23+
24+
Your task is to implement both optimizations in the provided C++ `main.cpp` file, measure their performance, and compare the
25+
wall clock time of the naive, cache-optimized, and parallel implementations for each test case. This assignment builds
26+
on your Assignment 1 code, so ensure your naive implementation is correct before starting.
27+
28+
---
29+
30+
### Technical Requirements
31+
32+
#### 1. Cache Optimization (Blocked Matrix Multiplication)
33+
34+
**Why Cache Optimization?**
35+
36+
Modern CPUs rely on cache memory to reduce the latency of accessing data from main memory. Cache memory is faster but
37+
smaller, organized in cache lines (typically 64 bytes). When a CPU accesses a memory location, it fetches an entire
38+
cache line. Matrix multiplication can suffer from poor performance if memory accesses are not cache-friendly, leading to
39+
frequent cache misses.
40+
41+
The naive matrix multiplication (with triple nested loops) accesses memory in a way that may not exploit spatial and
42+
temporal locality:
43+
44+
- **Spatial Locality**: Accessing consecutive memory locations (e.g., elements in the same cache line).
45+
- **Temporal Locality**: Reusing the same data multiple times while it’s still in the cache.
46+
47+
Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By
48+
performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache
49+
misses and improving performance.
50+
51+
**Blocked Matrix Multiplication Pseudocode**
52+
53+
Assume matrices \( A \) (m × n), \( B \) (n × p), and \( C \) (m × p) are stored in row-major order. The blocked matrix
54+
multiplication processes submatrices of size \( block_size × block_size \):
55+
56+
```cpp
57+
// C = A * B
58+
for (ii = 0; ii < m; ii += block_size)
59+
for (jj = 0; jj < p; jj += block_size)
60+
for (kk = 0; kk < n; kk += block_size)
61+
// Process block: C[ii:ii+block_size, jj:jj+block_size] += A[ii:ii+block_size, kk:kk+block_size] * B[kk:kk+block_size, jj:jj+block_size]
62+
for (i = ii; i < min(ii + block_size, m); i++)
63+
for (j = jj; j < min(jj + block_size, p); j++)
64+
for (k = kk; k < min(kk + block_size, n); k++)
65+
C[i * p + j] += A[i * n + k] * B[k * p + j]
66+
```
67+
68+
- **block_size**: Chosen to ensure the block fits in the cache (e.g., 32, 64, or 128, depending on the system).
69+
- **Outer loops (ii, jj, kk)**: Iterate over blocks.
70+
- **Inner loops (i, j, k)**: Compute within a block, reusing data in the cache.
71+
72+
**Task**: Implement the `blocked_matmul` function in the provided `main.cpp`. Experiment with different block sizes (e.g.,
73+
16, 32, 64) and report the best performance.
74+
75+
---
76+
77+
#### 2. Parallel Matrix Multiplication with OpenMP
78+
79+
**Why OpenMP?**
80+
81+
`OpenMP` is a portable API for parallel programming in shared-memory systems. It allows you to parallelize loops with
82+
minimal code changes, distributing iterations across multiple threads. In matrix multiplication, the outer loop(s) can
83+
be parallelized, as each element of the output matrix \( C \) can be computed independently.
84+
85+
**Parallelizing with OpenMP**
86+
87+
Use OpenMP to parallelize the outer loop(s) of the naive matrix multiplication. For example, parallelize the loop over
88+
rows of \( C \):
89+
90+
```cpp
91+
#pragma omp parallel for
92+
for (i = 0; i < m; i++)
93+
for (j = 0; j < p; j++)
94+
for (k = 0; k < n; k++)
95+
C[i * p + j] += A[i * n + k] * B[k * p + j];
96+
```
97+
98+
- The `#pragma omp parallel for` directive tells `OpenMP` to distribute iterations of the loop across available threads.
99+
- Ensure thread safety: Since each iteration writes to a distinct element of \( C \), this loop is safe to parallelize
100+
without locks.
101+
- Use `omp_get_wtime()` to measure wall clock time for accurate performance comparisons.
102+
103+
**Task**: Implement the `parallel_matmul` function in the provided `main.cpp` using `OpenMP`. Test with different numbers of
104+
threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`.
105+
106+
---
107+
108+
#### 3. Performance Measurement
109+
110+
For each test case (0 through 9 in the `data` folder):
111+
112+
- Measure the **wall clock time** for:
113+
- Naive matrix multiplication (`naive_matmul`).
114+
- Cache-optimized matrix multiplication (`blocked_matmul`).
115+
- Parallel matrix multiplication (`parallel_matmul`).
116+
- Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time.
117+
- Report the times in a table in your submission README.md, including:
118+
- Test case number.
119+
- Matrix dimensions (m × n × p).
120+
- Wall clock time for each implementation (in seconds).
121+
- Speedup of blocked and parallel implementations over the naive implementation.
122+
123+
Example table format:
124+
125+
| Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
126+
|-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------|
127+
| 0 | 512 × 512 × 512 | 2.345 | 0.987 | 0.543 | 2.38× | 4.32× |
128+
129+
---
130+
131+
#### Matrix Storage and Memory Management
132+
133+
- Continue using row-major order for all matrices, as in Assignment 1.
134+
- Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`).
135+
- Do not use STL containers or smart pointers.
136+
137+
---
138+
139+
#### Input/Output and Validation
140+
141+
- Use the same input/output format as Assignment 1:
142+
- Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
143+
- Output file: `data/<case>/result.raw` (matrix \( C \)).
144+
- Reference file: `data/<case>/output.raw` for validation.
145+
- The executable accepts a case number (0–9) as a command-line argument.
146+
- Validate correctness by comparing `result.raw` with `output.raw` for each implementation.
147+
148+
---
149+
150+
### Build Instructions
151+
152+
- Use the provided `CMakeLists.txt` to build the project.
153+
- **Additional Requirements**:
154+
- Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
155+
- The provided CMake file includes OpenMP support.
156+
- **Windows Users**:
157+
- Use CLion or Visual Studio with CMake.
158+
- Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
159+
- **Linux/Mac Users**:
160+
- Make sure gcc compiler is installed (`brew install gcc` on Mac).
161+
- Configure cmake to use the correct compiler:
162+
```bash
163+
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
164+
```
165+
- Run `cmake .` to generate a Makefile, then `make`.
166+
- **Testing OpenMP**:
167+
- Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
168+
Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
169+
- Test with different thread counts to find the best performance.
170+
171+
---
172+
173+
### Submission Requirements
174+
175+
#### Fork and Clone the Repository
176+
177+
- Fork the Assignment 2 repository (provided separately).
178+
- Clone your fork:
179+
```bash
180+
git clone https://github.com/parallelcomputingabo/Homework-2.git
181+
cd Homework-2
182+
```
183+
184+
#### Create a New Branch
185+
186+
```bash
187+
git checkout -b student-name
188+
```
189+
190+
#### Implement Your Solution
191+
192+
- Modify the provided `main.cpp` to implement `blocked_matmul` and `parallel_matmul`.
193+
- Update `README.md` with your performance results table.
194+
195+
#### Commit and Push
196+
197+
```bash
198+
git add .
199+
git commit -m "student-name: Implemented optimized matrix multiplication"
200+
git push origin student-name
201+
```
202+
203+
#### Submit a Pull Request (PR)
204+
205+
- Create a pull request from your branch to the base repository’s `main` branch.
206+
- Include a description of your optimizations and any challenges faced.
207+
208+
---
209+
210+
### Grading (100 Points Total)
211+
212+
| Subtask | Points |
213+
|---------------------------------------------|--------|
214+
| Correct implementation of `blocked_matmul` | 30 |
215+
| Correct implementation of `parallel_matmul` | 30 |
216+
| Accurate performance measurements | 20 |
217+
| Performance results table in README.md | 10 |
218+
| Code clarity, commenting, and organization | 10 |
219+
| **Total** | 100 |
220+
221+
---
222+
223+
### Tips for Success
224+
225+
- **Cache Optimization**:
226+
- Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
227+
- Use a block size that balances cache usage without excessive overhead.
228+
- **OpenMP**:
229+
- Test with different thread counts to find the optimal number for your system.
230+
- Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
231+
- **Performance Measurement**:
232+
- Run multiple iterations for each test case and report the average time to reduce variability.
233+
- Ensure no other heavy processes are running during measurements.
234+
- **Debugging**:
235+
- Validate each implementation against `output.raw` to ensure correctness before optimizing.
236+
- Use small test cases to debug your blocked and parallel implementations.
237+
238+
Good luck, and enjoy optimizing your matrix multiplication!

0 commit comments

Comments
 (0)