Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 48 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,22 +109,37 @@ threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`.
For each test case (0 through 9 in the `data` folder):

- Measure the **wall clock time** for:
- Naive matrix multiplication (`naive_matmul`).
- Cache-optimized matrix multiplication (`blocked_matmul`).
- Parallel matrix multiplication (`parallel_matmul`).
- Naive matrix multiplication (`naive_matmul`).
- Cache-optimized matrix multiplication (`blocked_matmul`).
- Parallel matrix multiplication (`parallel_matmul`).
- Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time.
- Report the times in a table in your submission README.md, including:
- Test case number.
- Matrix dimensions (m × n × p).
- Wall clock time for each implementation (in seconds).
- Speedup of blocked and parallel implementations over the naive implementation.
- Test case number.
- Matrix dimensions (m × n × p).
- Wall clock time for each implementation (in seconds).
- Speedup of blocked and parallel implementations over the naive implementation.

Example table format:

| Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
|-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------|
| --------- | ---------------------- | -------------- | ---------------- | ----------------- | --------------- | ---------------- |
| 0 | 512 × 512 × 512 | 2.345 | 0.987 | 0.543 | 2.38× | 4.32× |

## Results - Group H

| Test Case | Dimensions (m x n x p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
| --------- | ---------------------- | -------------- | ---------------- | ----------------- | --------------- | ---------------- |
| 0 | 64 x 64 x 64 | 0.000397436 | 0.000262500 | 0.000198718 | 1.51x | 2.00x |
| 1 | 128 x 64 x 128 | 0.001657894 | 0.000984128 | 0.000712641 | 1.68x | 2.33x |
| 2 | 100 x 128 x 56 | 0.001347827 | 0.000594339 | 0.000440559 | 2.27x | 3.06x |
| 3 | 128 x 64 x 128 | 0.001729733 | 0.001550001 | 0.000840000 | 1.12x | 2.06x |
| 4 | 32 x 128 x 32 | 0.000229629 | 0.000165354 | 0.000174157 | 1.39x | 1.32x |
| 5 | 200 x 100 x 256 | 0.007874995 | 0.004499997 | 0.002863635 | 1.75x | 2.75x |
| 6 | 256 x 256 x 256 | 0.052999973 | 0.013999999 | 0.011199999 | 3.79x | 4.73x |
| 7 | 256 x 300 x 256 | 0.034000039 | 0.019500017 | 0.013499975 | 1.74x | 2.52x |
| 8 | 64 x 128 x 64 | 0.000819675 | 0.000439656 | 0.000398438 | 1.86x | 2.06x |
| 9 | 256 x 256 x 257 | 0.026499987 | 0.019500017 | 0.007750005 | 1.36x | 3.42x |

---

#### Matrix Storage and Memory Management
Expand All @@ -138,9 +153,9 @@ Example table format:
#### Input/Output and Validation

- Use the same input/output format as Assignment 1:
- Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
- Output file: `data/<case>/result.raw` (matrix \( C \)).
- Reference file: `data/<case>/output.raw` for validation.
- Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
- Output file: `data/<case>/result.raw` (matrix \( C \)).
- Reference file: `data/<case>/output.raw` for validation.
- The executable accepts a case number (0–9) as a command-line argument.
- Validate correctness by comparing `result.raw` with `output.raw` for each implementation.

Expand All @@ -150,22 +165,22 @@ Example table format:

- Use the provided `CMakeLists.txt` to build the project.
- **Additional Requirements**:
- Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
- The provided CMake file includes OpenMP support.
- Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
- The provided CMake file includes OpenMP support.
- **Windows Users**:
- Use CLion or Visual Studio with CMake.
- Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
- Use CLion or Visual Studio with CMake.
- Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
- **Linux/Mac Users**:
- Make sure the GCC compiler is installed (`brew install gcc` on Mac).
- Configure CMake to use the correct compiler:
```bash
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
```
- Run `cmake .` to generate a Makefile, then `make`.
- Make sure the GCC compiler is installed (`brew install gcc` on Mac).
- Configure CMake to use the correct compiler:
```bash
cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
```
- Run `cmake .` to generate a Makefile, then `make`.
- **Testing OpenMP**:
- Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
- Test with different thread counts to find the best performance.
- Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
- Test with different thread counts to find the best performance.

---

Expand Down Expand Up @@ -209,7 +224,7 @@ git push origin student-name
### Grading (100 Points Total)

| Subtask | Points |
|---------------------------------------------|--------|
| ------------------------------------------- | ------ |
| Correct implementation of `blocked_matmul` | 30 |
| Correct implementation of `parallel_matmul` | 30 |
| Accurate performance measurements | 20 |
Expand All @@ -222,16 +237,16 @@ git push origin student-name
### Tips for Success

- **Cache Optimization**:
- Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
- Use a block size that balances cache usage without excessive overhead.
- Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
- Use a block size that balances cache usage without excessive overhead.
- **OpenMP**:
- Test with different thread counts to find the optimal number for your system.
- Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
- Test with different thread counts to find the optimal number for your system.
- Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
- **Performance Measurement**:
- Run multiple iterations for each test case and report the average time to reduce variability.
- Ensure no other heavy processes are running during measurements.
- Run multiple iterations for each test case and report the average time to reduce variability.
- Ensure no other heavy processes are running during measurements.
- **Debugging**:
- Validate each implementation against `output.raw` to ensure correctness before optimizing.
- Use small test cases to debug your blocked and parallel implementations.
- Validate each implementation against `output.raw` to ensure correctness before optimizing.
- Use small test cases to debug your blocked and parallel implementations.

Good luck, and enjoy optimizing your matrix multiplication!
Loading