AA-parallel-computing · z-haq · May 31, 2026
diff --git a/README.md b/README.md
@@ -109,22 +109,37 @@ threads (e.g., 2, 4, 8) by setting the environment variable `OMP_NUM_THREADS`.
 For each test case (0 through 9 in the `data` folder):
 
 - Measure the **wall clock time** for:
-    - Naive matrix multiplication (`naive_matmul`).
-    - Cache-optimized matrix multiplication (`blocked_matmul`).
-    - Parallel matrix multiplication (`parallel_matmul`).
+  - Naive matrix multiplication (`naive_matmul`).
+  - Cache-optimized matrix multiplication (`blocked_matmul`).
+  - Parallel matrix multiplication (`parallel_matmul`).
 - Use `omp_get_wtime()` for timing, as it provides high-resolution wall clock time.
 - Report the times in a table in your submission README.md, including:
-    - Test case number.
-    - Matrix dimensions (m × n × p).
-    - Wall clock time for each implementation (in seconds).
-    - Speedup of blocked and parallel implementations over the naive implementation.
+  - Test case number.
+  - Matrix dimensions (m × n × p).
+  - Wall clock time for each implementation (in seconds).
+  - Speedup of blocked and parallel implementations over the naive implementation.
 
 Example table format:
 
 | Test Case | Dimensions (m × n × p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
-|-----------|------------------------|----------------|------------------|-------------------|-----------------|------------------|
+| --------- | ---------------------- | -------------- | ---------------- | ----------------- | --------------- | ---------------- |
 | 0         | 512 × 512 × 512        | 2.345          | 0.987            | 0.543             | 2.38×           | 4.32×            |
 
+## Results - Group H
+
+| Test Case | Dimensions (m x n x p) | Naive Time (s) | Blocked Time (s) | Parallel Time (s) | Blocked Speedup | Parallel Speedup |
+| --------- | ---------------------- | -------------- | ---------------- | ----------------- | --------------- | ---------------- |
+| 0         | 64 x 64 x 64           | 0.000397436    | 0.000262500      | 0.000198718       | 1.51x           | 2.00x            |
+| 1         | 128 x 64 x 128         | 0.001657894    | 0.000984128      | 0.000712641       | 1.68x           | 2.33x            |
+| 2         | 100 x 128 x 56         | 0.001347827    | 0.000594339      | 0.000440559       | 2.27x           | 3.06x            |
+| 3         | 128 x 64 x 128         | 0.001729733    | 0.001550001      | 0.000840000       | 1.12x           | 2.06x            |
+| 4         | 32 x 128 x 32          | 0.000229629    | 0.000165354      | 0.000174157       | 1.39x           | 1.32x            |
+| 5         | 200 x 100 x 256        | 0.007874995    | 0.004499997      | 0.002863635       | 1.75x           | 2.75x            |
+| 6         | 256 x 256 x 256        | 0.052999973    | 0.013999999      | 0.011199999       | 3.79x           | 4.73x            |
+| 7         | 256 x 300 x 256        | 0.034000039    | 0.019500017      | 0.013499975       | 1.74x           | 2.52x            |
+| 8         | 64 x 128 x 64          | 0.000819675    | 0.000439656      | 0.000398438       | 1.86x           | 2.06x            |
+| 9         | 256 x 256 x 257        | 0.026499987    | 0.019500017      | 0.007750005       | 1.36x           | 3.42x            |
+
 ---
 
 #### Matrix Storage and Memory Management
@@ -138,9 +153,9 @@ Example table format:
 #### Input/Output and Validation
 
 - Use the same input/output format as Assignment 1:
-    - Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
-    - Output file: `data/<case>/result.raw` (matrix \( C \)).
-    - Reference file: `data/<case>/output.raw` for validation.
+  - Input files: `data/<case>/input0.raw` (matrix \( A \)) and `input1.raw` (matrix \( B \)).
+  - Output file: `data/<case>/result.raw` (matrix \( C \)).
+  - Reference file: `data/<case>/output.raw` for validation.
 - The executable accepts a case number (0–9) as a command-line argument.
 - Validate correctness by comparing `result.raw` with `output.raw` for each implementation.
 
@@ -150,22 +165,22 @@ Example table format:
 
 - Use the provided `CMakeLists.txt` to build the project.
 - **Additional Requirements**:
-    - Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
-    - The provided CMake file includes OpenMP support.
+  - Ensure OpenMP is enabled in your compiler (e.g., `-fopenmp` for GCC).
+  - The provided CMake file includes OpenMP support.
 - **Windows Users**:
-    - Use CLion or Visual Studio with CMake.
-    - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
+  - Use CLion or Visual Studio with CMake.
+  - Alternatively, use MinGW with `cmake -G "MinGW Makefiles"` and `make`.
 - **Linux/Mac Users**:
-    - Make sure the GCC compiler is installed (`brew install gcc` on Mac).
-    - Configure CMake to use the correct compiler:
-      ```bash
-      cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
-      ```
-    - Run `cmake .` to generate a Makefile, then `make`.
+  - Make sure the GCC compiler is installed (`brew install gcc` on Mac).
+  - Configure CMake to use the correct compiler:
+    ```bash
+    cmake -DCMAKE_C_COMPILER=gcc -DCMAKE_CXX_COMPILER=g++ .
+    ```
+  - Run `cmake .` to generate a Makefile, then `make`.
 - **Testing OpenMP**:
-    - Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
-      Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
-    - Test with different thread counts to find the best performance.
+  - Set the number of threads using the environment variable `OMP_NUM_THREADS` (e.g., `export OMP_NUM_THREADS=4` on
+    Linux/Mac, or `set OMP_NUM_THREADS=4` on Windows).
+  - Test with different thread counts to find the best performance.
 
 ---
 
@@ -209,7 +224,7 @@ git push origin student-name
 ### Grading (100 Points Total)
 
 | Subtask                                     | Points |
-|---------------------------------------------|--------|
+| ------------------------------------------- | ------ |
 | Correct implementation of `blocked_matmul`  | 30     |
 | Correct implementation of `parallel_matmul` | 30     |
 | Accurate performance measurements           | 20     |
@@ -222,16 +237,16 @@ git push origin student-name
 ### Tips for Success
 
 - **Cache Optimization**:
-    - Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
-    - Use a block size that balances cache usage without excessive overhead.
+  - Experiment with different block sizes. Start with powers of 2 (e.g., 16, 32, 64).
+  - Use a block size that balances cache usage without excessive overhead.
 - **OpenMP**:
-    - Test with different thread counts to find the optimal number for your system.
-    - Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
+  - Test with different thread counts to find the optimal number for your system.
+  - Be cautious of false sharing (when threads access nearby memory locations, causing cache coherence issues).
 - **Performance Measurement**:
-    - Run multiple iterations for each test case and report the average time to reduce variability.
-    - Ensure no other heavy processes are running during measurements.
+  - Run multiple iterations for each test case and report the average time to reduce variability.
+  - Ensure no other heavy processes are running during measurements.
 - **Debugging**:
-    - Validate each implementation against `output.raw` to ensure correctness before optimizing.
-    - Use small test cases to debug your blocked and parallel implementations.
+  - Validate each implementation against `output.raw` to ensure correctness before optimizing.
+  - Use small test cases to debug your blocked and parallel implementations.
 
 Good luck, and enjoy optimizing your matrix multiplication!