Skip to content

Commit 29aebfc

Browse files
Deji10Muhammad Zahid
andcommitted
ayodeji-ibrahim: Implemented blocked and parallel matrix multiplication
Co-authored-by: Muhammad Zahid <muhammad.zahid@example.com>
1 parent 46b0cb2 commit 29aebfc

2 files changed

Lines changed: 248 additions & 193 deletions

File tree

README.md

Lines changed: 63 additions & 102 deletions
Original file line numberDiff line numberDiff line change
@@ -106,132 +106,93 @@ threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`.
106106

107107
#### 3. Performance Measurement
108108

109-
For each test case (0 through 9 in the `data` folder):
109+
## Performance Results
110110

111-
- Measure the **wall clock time** for:
112-
- Naive matrix multiplication (`naive_matmul`).
113-
- Cache-optimized matrix multiplication (`blocked_matmul`).
114-
- Parallel matrix multiplication (`parallel_matmul`).
115-
- Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time.
116-
- Report the times in a table in your submission README.md, including:
117-
- Test case number.
118-
- Matrix dimensions (m × n × p).
119-
- Wall clock time for each implementation (in seconds).
120-
- Speedup of blocked and parallel implementations over the naive implementation.
111+
### Environment
112+
- **Platform**: GitHub Codespaces (Linux x86_64, 2 physical CPU cores)
113+
- **Compiler**: g++ with `-O3 -fopenmp`
114+
- **Methodology**: Each timing is the arithmetic mean of **5 independent runs**
115+
- **Default block size**: 64 (theoretical L1-cache-line alignment)
116+
- **Default thread count**: 4
121117

122-
Example table format:
118+
### Main Results Table (Averaged over 5 runs)
123119

124-
| Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
125-
|-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------|
126-
| 0 | 512 × 512 × 512 | 2.345 | 0.987 | 0.543 | 2.38× | 4.32× |
120+
| Case | Dimensions (m × n × p) | Naive (s) | Blocked (s) | Parallel (s) | Blocked Speedup | Parallel Speedup |
121+
|------|------------------------|-----------|-------------|--------------|-----------------|------------------|
122+
| 0 | 64 × 64 × 64 | 0.000209 | 0.000202 | 0.000227 | 1.04× | 0.92× |
123+
| 1 | 128 × 64 × 128 | 0.001096 | 0.000871 | 0.000740 | 1.26× | 1.48× |
124+
| 2 | 100 × 128 × 56 | 0.000691 | 0.000638 | 0.000922 | 1.08× | 0.75× |
125+
| 3 | 128 × 64 × 128 | 0.001541 | 0.001245 | 0.001014 | 1.24× | 1.52× |
126+
| 4 | 32 × 128 × 32 | 0.000160 | 0.000143 | 0.000309 | 1.12× | 0.52× |
127+
| 5 | 200 × 100 × 256 | 0.007707 | 0.007681 | 0.007275 | 1.00× | 1.06× |
128+
| 6 | 256 × 256 × 256 | 0.026578 | 0.021396 | 0.022247 | 1.24× | 1.19× |
129+
| 7 | 256 × 300 × 256 | 0.033655 | 0.026134 | 0.030615 | 1.29× | 1.10× |
130+
| 8 | 64 × 128 × 64 | 0.000499 | 0.000385 | 0.000419 | 1.30× | 1.19× |
131+
| 9 | 256 × 256 × 257 | 0.018924 | 0.013386 | 0.011839 | 1.41× | 1.60× |
127132

128-
---
133+
All implementations validated against `output.raw` with tolerance `1e-2`. All 10 cases pass for all three implementations.
129134

130-
#### Matrix Storage and Memory Management
135+
### Block Size Experiment (Case 7: 256 × 300 × 256, the largest test case)
131136

132-
- Row-major order for all matrices
133-
- Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`).
134-
- Do not use smart pointers.
137+
To find the optimal block size, the `blocked_matmul` was tested with four block sizes against the naive baseline. Each timing is averaged over 5 runs.
135138

136-
---
139+
| Block Size | Time (s) | Speedup |
140+
|------------|----------|---------|
141+
| **16** | **0.02312** | **2.33×** |
142+
| 32 | 0.02349 | 2.29× |
143+
| 64 | 0.03020 | 1.78× |
144+
| 128 | 0.02783 | 1.94× |
137145

138-
#### Input/Output and Validation
146+
**Finding**: Block size **16** gives the best performance for these matrix dimensions, with block size 32 a close second. The commonly recommended block size of 64 (one cache line of doubles) was *not* optimal here. Smaller blocks keep the working set comfortably inside L1 cache, while at block size 64 and above the working set begins to spill out of L1.
139147

140-
- Use the same input/output format as Assignment 1:
141-
- Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
142-
- Output file: `data/<case>/result.raw` (matrix \( C \)).
143-
- Reference file: `data/<case>/output.raw` for validation.
144-
- The executable accepts a case number (0–9) as a command-line argument.
145-
- Validate correctness by comparing `result.raw` with `output.raw` for each implementation.
148+
For the main results, block size 64 was kept as the default to follow the conventional "cache-line aligned" recommendation, but block size 16 or 32 would give meaningfully better speedups on this hardware.
146149

147-
---
150+
### Thread Count Experiment (Case 7)
148151

149-
### Build Instructions
150-
151-
- Use the provided `CMakeLists.txt` to build the project.
152-
- **Additional Requirements**:
153-
- Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
154-
- The provided CMake file includes OpenMP support.
155-
- **Windows Users**:
156-
- Use CLion or Visual Studio with CMake.
157-
- Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
158-
- **Linux/Mac Users**:
159-
- Make sure the GCC compiler is installed (`brew install gcc` on Mac).
160-
- Configure CMake to use the correct compiler:
161-
```bash
162-
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
163-
```
164-
- Run `cmake .` to generate a Makefile, then `make`.
165-
- **Testing OpenMP**:
166-
- Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
167-
Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
168-
- Test with different thread counts to find the best performance.
152+
To find the optimal thread count, `parallel_matmul` was tested with 1, 2, 4, and 8 threads. Each timing is averaged over 5 runs.
169153

170-
---
154+
| Threads | Time (s) | Speedup |
155+
|---------|----------|---------|
156+
| 1 | 0.03594 | 1.08× |
157+
| **2** | **0.02555** | **1.52×** |
158+
| 4 | 0.02713 | 1.43× |
159+
| 8 | 0.03206 | 1.21× |
171160

172-
### Submission Requirements
161+
**Finding**: **2 threads is optimal** on this hardware. The GitHub Codespaces free tier provides 2 physical CPU cores; once thread count exceeds physical cores, hyperthreading contention and OpenMP scheduling overhead outweigh the parallelism benefit. 8 threads is *worse* than 1 thread because thread management overhead dominates.
173162

174-
#### Fork and Clone the Repository
163+
On a machine with 4 or more physical cores, the optimal thread count would shift accordingly.
175164

176-
- Fork the Assignment 4 repository (provided separately).
177-
- Clone your fork:
178-
```bash
179-
git clone https://github.com/AA-parallel-computing/Assignment-4-Optional.git
180-
cd Assignment-4-Optional
181-
```
165+
### Analysis
182166

183-
#### Create a New Branch
167+
**Correctness**: Every implementation produces identical results to the reference output for all 10 test cases.
184168

185-
```bash
186-
git checkout -b student-name
187-
```
169+
**Cache Optimization (Blocked)**:
170+
- Blocking gives consistent **modest speedup (1.0× to 1.41×)** across cases with the default block size of 64.
171+
- The block size sweep showed up to **2.33×** speedup at block size 16, demonstrating the importance of tuning the block size to the specific cache hierarchy and problem dimensions.
188172

189-
#### Implement Your Solution
173+
**Parallel (OpenMP)**:
174+
- Parallelization helps **when the matrix is large enough** to amortize OpenMP thread setup overhead.
175+
- For tiny matrices (cases 0, 2, 4), parallel is *slower* than naive (0.52× to 0.92×) because thread creation cost exceeds the actual compute work.
176+
- For mid-sized matrices (cases 1, 3, 6, 8, 9), parallel gives 1.19× – 1.60× speedup.
177+
- The thread sweep revealed that the Codespaces 2-core environment caps the achievable parallel speedup at ~1.5× regardless of how many threads we request. On hardware with more cores, larger speedups would be visible.
190178

191-
- Modify the provided `main.cpp` to implement `blocked_matmul` and `parallel_matmul`.
192-
- Update `README.md` with your performance results table.
179+
**Block Size Choice**: For these specific matrix sizes (up to approx. 256 × 300), L1 cache pressure dominates and smaller blocks (16, 32) work best. The "default" cache-line-sized block of 64 is suboptimal here but would likely be better on much larger problems where the trade-off shifts toward reducing loop overhead.
193180

194-
#### Commit and Push
181+
**Optimal Configuration on Codespaces (2-core)**:
182+
- Block size: **16 or 32**
183+
- Thread count: **2**
184+
- Expected combined speedup over naive: approximately 3× or 4× by combining blocking and parallelization
195185

196-
```bash
197-
git add .
198-
git commit -m "student-name: Implemented optimized matrix multiplication"
199-
git push origin student-name
200-
```
201-
202-
#### Submit a Pull Request (PR)
186+
### Challenges
203187

204-
- Create a pull request from your branch to the base repository’s `main` branch.
205-
- Include a description of your optimizations and any challenges faced.
188+
1. **Small Test Cases**: The provided test cases are too small to fully showcase OpenMP parallelism. The largest case (256 × 300 × 256) executes in ~33 ms, where OpenMP setup costs are significant relative to compute. Matrices of 1024 × 1024 or larger would yield speedups closer to the theoretical limits of the hardware.
206189

207-
---
190+
2. **Codespaces Environment**: The 2-core CPU limit in GitHub Codespaces caps achievable parallel speedup. On a typical 8-core workstation, parallel speedups of 4× - 6× would be expected for the larger test cases.
208191

209-
### Grading (100 Points Total)
192+
3. **Measurement Stability**: Single-run timings showed significant variance (some "speedups" appeared to be slowdowns simply due to noise). Switching to 5-run averaging stabilized the results and made the patterns clear. This is itself a useful methodological finding.
210193

211-
| Subtask | Points |
212-
|---------------------------------------------|--------|
213-
| Correct implementation of `blocked_matmul` | 30 |
214-
| Correct implementation of `parallel_matmul` | 30 |
215-
| Accurate performance measurements | 20 |
216-
| Performance results table in README.md | 10 |
217-
| Code clarity, commenting, and organization | 10 |
218-
| **Total** | 100 |
194+
4. **Default Block Size Was Suboptimal**: The conventional block size of 64 (one cache line of doubles) was not the best for these test cases block size 16 was 30% faster. This reinforces that "cache-line aligned" is a starting heuristic, not a final answer; empirical tuning matters.
219195

220-
---
196+
5. **Text-format I/O**: The `.raw` files are space-separated text, not binary doubles. Reading is done using `ifstream >> double` with the first two integers as `rows cols` dimensions.
221197

222-
### Tips for Success
223-
224-
- **Cache Optimization**:
225-
- Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
226-
- Use a block size that balances cache usage without excessive overhead.
227-
- **OpenMP**:
228-
- Test with different thread counts to find the optimal number for your system.
229-
- Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
230-
- **Performance Measurement**:
231-
- Run multiple iterations for each test case and report the average time to reduce variability.
232-
- Ensure no other heavy processes are running during measurements.
233-
- **Debugging**:
234-
- Validate each implementation against `output.raw` to ensure correctness before optimizing.
235-
- Use small test cases to debug your blocked and parallel implementations.
236-
237-
Good luck, and enjoy optimizing your matrix multiplication!
198+
6. **Local Toolchain**: Could not install g++ locally on Windows in time; switched to GitHub Codespaces, which provided a complete Linux dev environment with all required tooling.

0 commit comments

Comments
 (0)