You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All implementations validated against `output.raw` with tolerance `1e-2`. All 10 cases pass for all three implementations.
129
134
130
-
#### Matrix Storage and Memory Management
135
+
###Block Size Experiment (Case 7: 256 × 300 × 256, the largest test case)
131
136
132
-
- Row-major order for all matrices
133
-
- Use C-style arrays with manual memory management (`malloc` or `new`, `free` or `delete`).
134
-
- Do not use smart pointers.
137
+
To find the optimal block size, the `blocked_matmul` was tested with four block sizes against the naive baseline. Each timing is averaged over 5 runs.
135
138
136
-
---
139
+
| Block Size | Time (s) | Speedup |
140
+
|------------|----------|---------|
141
+
|**16**|**0.02312**|**2.33×**|
142
+
| 32 | 0.02349 | 2.29× |
143
+
| 64 | 0.03020 | 1.78× |
144
+
| 128 | 0.02783 | 1.94× |
137
145
138
-
#### Input/Output and Validation
146
+
**Finding**: Block size **16** gives the best performance for these matrix dimensions, with block size 32 a close second. The commonly recommended block size of 64 (one cache line of doubles) was *not* optimal here. Smaller blocks keep the working set comfortably inside L1 cache, while at block size 64 and above the working set begins to spill out of L1.
139
147
140
-
- Use the same input/output format as Assignment 1:
141
-
- Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
142
-
- Output file: `data/<case>/result.raw` (matrix \( C \)).
143
-
- Reference file: `data/<case>/output.raw` for validation.
144
-
- The executable accepts a case number (0–9) as a command-line argument.
145
-
- Validate correctness by comparing `result.raw` with `output.raw` for each implementation.
148
+
For the main results, block size 64 was kept as the default to follow the conventional "cache-line aligned" recommendation, but block size 16 or 32 would give meaningfully better speedups on this hardware.
146
149
147
-
---
150
+
### Thread Count Experiment (Case 7)
148
151
149
-
### Build Instructions
150
-
151
-
- Use the provided `CMakeLists.txt` to build the project.
152
-
-**Additional Requirements**:
153
-
- Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
154
-
- The provided CMake file includes OpenMP support.
155
-
-**Windows Users**:
156
-
- Use CLion or Visual Studio with CMake.
157
-
- Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
158
-
-**Linux/Mac Users**:
159
-
- Make sure the GCC compiler is installed (`brew install gcc` on Mac).
- Run `cmake .` to generate a Makefile, then`make`.
165
-
- **Testing OpenMP**:
166
-
- Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
167
-
Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
168
-
- Test with different thread counts to find the best performance.
152
+
To find the optimal thread count, `parallel_matmul` was tested with 1, 2, 4, and 8 threads. Each timing is averaged over 5 runs.
169
153
170
-
---
154
+
| Threads | Time (s) | Speedup |
155
+
|---------|----------|---------|
156
+
| 1 | 0.03594 | 1.08× |
157
+
|**2**|**0.02555**|**1.52×**|
158
+
| 4 | 0.02713 | 1.43× |
159
+
| 8 | 0.03206 | 1.21× |
171
160
172
-
### Submission Requirements
161
+
**Finding**: **2 threads is optimal** on this hardware. The GitHub Codespaces free tier provides 2 physical CPU cores; once thread count exceeds physical cores, hyperthreading contention and OpenMP scheduling overhead outweigh the parallelism benefit. 8 threads is *worse* than 1 thread because thread management overhead dominates.
173
162
174
-
#### Fork and Clone the Repository
163
+
On a machine with 4 or more physical cores, the optimal thread count would shift accordingly.
175
164
176
-
- Fork the Assignment 4 repository (provided separately).
**Correctness**: Every implementation produces identical results to the reference output for all 10 test cases.
184
168
185
-
```bash
186
-
git checkout -b student-name
187
-
```
169
+
**Cache Optimization (Blocked)**:
170
+
- Blocking gives consistent **modest speedup (1.0× to 1.41×)** across cases with the default block size of 64.
171
+
- The block size sweep showed up to **2.33×** speedup at block size 16, demonstrating the importance of tuning the block size to the specific cache hierarchy and problem dimensions.
188
172
189
-
#### Implement Your Solution
173
+
**Parallel (OpenMP)**:
174
+
- Parallelization helps **when the matrix is large enough** to amortize OpenMP thread setup overhead.
175
+
- For tiny matrices (cases 0, 2, 4), parallel is *slower* than naive (0.52× to 0.92×) because thread creation cost exceeds the actual compute work.
- The thread sweep revealed that the Codespaces 2-core environment caps the achievable parallel speedup at ~1.5× regardless of how many threads we request. On hardware with more cores, larger speedups would be visible.
190
178
191
-
- Modify the provided `main.cpp` to implement `blocked_matmul` and `parallel_matmul`.
192
-
- Update `README.md` with your performance results table.
179
+
**Block Size Choice**: For these specific matrix sizes (up to approx. 256 × 300), L1 cache pressure dominates and smaller blocks (16, 32) work best. The "default" cache-line-sized block of 64 is suboptimal here but would likely be better on much larger problems where the trade-off shifts toward reducing loop overhead.
193
180
194
-
#### Commit and Push
181
+
**Optimal Configuration on Codespaces (2-core)**:
182
+
- Block size: **16 or 32**
183
+
- Thread count: **2**
184
+
- Expected combined speedup over naive: approximately 3× or 4× by combining blocking and parallelization
- Create a pull request from your branch to the base repository’s `main` branch.
205
-
- Include a description of your optimizations and any challenges faced.
188
+
1.**Small Test Cases**: The provided test cases are too small to fully showcase OpenMP parallelism. The largest case (256 × 300 × 256) executes in ~33 ms, where OpenMP setup costs are significant relative to compute. Matrices of 1024 × 1024 or larger would yield speedups closer to the theoretical limits of the hardware.
206
189
207
-
---
190
+
2.**Codespaces Environment**: The 2-core CPU limit in GitHub Codespaces caps achievable parallel speedup. On a typical 8-core workstation, parallel speedups of 4× - 6× would be expected for the larger test cases.
208
191
209
-
### Grading (100 Points Total)
192
+
3.**Measurement Stability**: Single-run timings showed significant variance (some "speedups" appeared to be slowdowns simply due to noise). Switching to 5-run averaging stabilized the results and made the patterns clear. This is itself a useful methodological finding.
| Correct implementation of `blocked_matmul`| 30 |
214
-
| Correct implementation of `parallel_matmul`| 30 |
215
-
| Accurate performance measurements | 20 |
216
-
| Performance results table in README.md | 10 |
217
-
| Code clarity, commenting, and organization | 10 |
218
-
|**Total**| 100 |
194
+
4.**Default Block Size Was Suboptimal**: The conventional block size of 64 (one cache line of doubles) was not the best for these test cases block size 16 was 30% faster. This reinforces that "cache-line aligned" is a starting heuristic, not a final answer; empirical tuning matters.
219
195
220
-
---
196
+
5.**Text-format I/O**: The `.raw` files are space-separated text, not binary doubles. Reading is done using `ifstream >> double` with the first two integers as `rows cols` dimensions.
221
197
222
-
### Tips for Success
223
-
224
-
- **Cache Optimization**:
225
-
- Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
226
-
- Use a block size that balances cache usage without excessive overhead.
227
-
- **OpenMP**:
228
-
- Test with different thread counts to find the optimal number for your system.
229
-
- Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
230
-
- **Performance Measurement**:
231
-
- Run multiple iterations for each testcase and report the average time to reduce variability.
232
-
- Ensure no other heavy processes are running during measurements.
233
-
- **Debugging**:
234
-
- Validate each implementation against `output.raw` to ensure correctness before optimizing.
235
-
- Use small test cases to debug your blocked and parallel implementations.
236
-
237
-
Good luck, and enjoy optimizing your matrix multiplication!
198
+
6.**Local Toolchain**: Could not install g++ locally on Windows in time; switched to GitHub Codespaces, which provided a complete Linux dev environment with all required tooling.
0 commit comments