Skip to content

Latest commit

 

History

History
163 lines (118 loc) · 10.2 KB

File metadata and controls

163 lines (118 loc) · 10.2 KB

Parallel Programming

Åbo Akademi University, Information Technology Department

Instructor: Alireza Olama

Homework Assignment 4: Optimizing Matrix Multiplication in C++

Task Distribution

Student Task
Ha Do (Student ID: 2402703) Naive and Blocked Matrix Multiplication and implementing utility functions
Abhishek Roy (Student ID: 2502895) OpenMP Matrix Multiplication and Optimizations for other matrix multiplication functions

Assignment Overview

Optimized the performance of a naive matrix multiplication implementation using two techniques:

  1. Cache Optimization via Blocked Matrix Multiplication: Improved data locality to reduce cache misses.
  2. Parallel Matrix Multiplication using OpenMP: Parallelized the computation across multiple threads.

The task was to implement both optimizations, measure their performance, and compare the wall clock time of the naive, cache-optimized, and parallel implementations for each test case.


Technical Requirements

1. Cache Optimization (Blocked Matrix Multiplication)

Why Cache Optimization?

Modern CPUs rely on cache memory to reduce the latency of accessing data from main memory. Cache memory is faster but smaller, organized in cache lines (typically 64 bytes). When a CPU accesses a memory location, it fetches an entire cache line. Matrix multiplication can suffer from poor performance if memory accesses are not cache-friendly, leading to frequent cache misses.

The naive matrix multiplication (with triple nested loops) accesses memory in a way that may not exploit spatial and temporal locality:

  • Spatial Locality: Accessing consecutive memory locations (e.g., elements in the same cache line).
  • Temporal Locality: Reusing the same data multiple times while it's still in the cache.

Blocked matrix multiplication divides the matrices into smaller submatrices (blocks) that fit into the cache. By performing computations on these blocks, you ensure that data is reused while it resides in the cache, reducing cache misses and improving performance.

Blocked Matrix Multiplication Pseudocode

Assume matrices ( A ) (m × n), ( B ) (n × p), and ( C ) (m × p) are stored in row-major order. The blocked matrix multiplication processes submatrices of size ( block_size × block_size ):

// C = A * B
for (ii = 0; ii < m; ii += block_size)
    for (jj = 0; jj < p; jj += block_size)
        for (kk = 0; kk < n; kk += block_size)
            // Process block: C[ii:ii+block_size, jj:jj+block_size] += A[ii:ii+block_size, kk:kk+block_size] * B[kk:kk+block_size, jj:jj+block_size]
            for (i = ii; i < min(ii + block_size, m); i++)
                for (j = jj; j < min(jj + block_size, p); j++)
                    for (k = kk; k < min(kk + block_size, n); k++)
                        C[i * p + j] += A[i * n + k] * B[k * p + j]
  • block_size: Chosen to ensure the block fits in the cache (e.g., 32, 64, or 128, depending on the system).
  • Outer loops (ii, jj, kk): Iterate over blocks.
  • Inner loops (i, j, k): Compute within a block, reusing data in the cache.

2. Parallel Matrix Multiplication with OpenMP

Why OpenMP?

OpenMP is a portable API for parallel programming in shared-memory systems. It allows you to parallelize loops with minimal code changes, distributing iterations across multiple threads. In matrix multiplication, the outer loop(s) can be parallelized, as each element of the output matrix ( C ) can be computed independently.

Parallelizing with OpenMP

OpenMP was used to parallelize the outer loop(s) of the naive matrix multiplication. For example, the loop over rows of ( C ) was parallelized:

#pragma omp parallel for
for (i = 0; i < m; i++)
    for (j = 0; j < p; j++)
        for (k = 0; k < n; k++)
            C[i * p + j] += A[i * n + k] * B[k * p + j];
  • The #pragma omp parallel for directive tells OpenMP to distribute iterations of the loop across available threads.
  • Ensure thread safety: Since each iteration writes to a distinct element of ( C ), this loop is safe to parallelize without locks.
  • Use omp_get_wtime() to measure wall clock time for accurate performance comparisons.

3. Performance Measurement

For each test case (0 through 9 in the data folder):

  • Measured the wall clock time for:
    • Naive matrix multiplication (naive_matmul).
    • Cache-optimized matrix multiplication (blocked_matmul) with block size 32.
    • Parallel matrix multiplication (parallel_matmul) with OMP_NUM_THREADS = 8.
  • Used omp_get_wtime() for high-resolution wall clock timings.

4. Results

CPU Specs:

  • CPU: AMD Ryzen 7 8845HS
  • Architecture: x86-64
  • Cores: 8
  • Threads: 16

The results in the table below come from the basic implementation of the aforementioned matrix multiplications.

Test Case Dimensions (m x n x p) Naive Time (s) Blocked Time (s) Parallel Time (s) Blocked Speedup Parallel Speedup
0 64x64x64 0.00101837 0.000795773 0.00089702 1.27973x 1.13528x
1 128x64x128 0.00414562 0.00348123 0.00195844 1.19085x 2.11679x
2 100x128x56 0.0022999 0.00236687 0.00106175 0.971706x 2.16614x
3 128x64x128 0.00355272 0.00337588 0.00159078 1.05238x 2.23332x
4 32x128x32 0.00070751 0.000524026 0.000757544 1.35014x 0.933952x
5 200x100x256 0.0176089 0.016518 0.00467121 1.06605x 3.76967x
6 256x256x256 0.0536581 0.0567246 0.0120156 0.945941x 4.4657x
7 256x300x256 0.0632706 0.0663129 0.0131018 0.954122x 4.82916x
8 64x128x64 0.00175151 0.00203758 0.00326974 0.859606x 0.535674x
9 256x256x257 0.0567747 0.0572363 0.0125215 0.991936x 4.53418x

As can be seen, the blocked matrix multiplication speedup was slightly above 1x or less in most cases. However, the parallel implementation achieved speedups of around 2x to 4.5x in most cases.

The table below shows results after the following optimizations:

  • Switched blocked and parallel loops to be i -> k -> j
    • The initial i -> j -> k order accesses B[k * p + j] with k as the innermost variable, stepping through B column-wise with stride p. This causes a cache miss on every iteration. Swapping to i -> k -> j makes j the innermost variable, so B[k * p + j] is accessed sequentially (stride 1), keeping all three matrices in cache-friendly access patterns.
  • Added compiler flags:
    • -Ofast: Enables all -O3 optimizations with some additional flags. One of which is -ffast-math. This allows the compiler to reorder floating point operations, use fused multiply-add (FMA) instructions, and vectorize reduction loops more aggressively. This is the flag most responsible for the blocked speedup improvement.
    • -march=native: Generates code using the full SIMD instruction set of the host CPU (e.g. AVX2, AVX-512). Without this, the compiler falls back to a generic baseline (SSE2), missing wide vector registers that process 8 floats at a time.
    • -funroll-loops: Unrolls loop bodies to reduce loop control overhead and expose more instruction-level parallelism for the CPU's out-of-order execution units.
Test Case Dimensions (m x n x p) Naive Time (s) Blocked Time (s) Parallel Time (s) Blocked Speedup Parallel Speedup
0 64x64x64 0.000206294 0.000124644 0.000974188 1.65507x 0.21176x
1 128x64x128 0.000902154 0.000501695 0.000846172 1.79821x 1.06616x
2 100x128x56 0.000479879 0.000572318 0.000855709 0.838483x 0.560797x
3 128x64x128 0.00101505 0.00050891 0.000930236 1.99456x 1.09117x
4 32x128x32 8.6234e-05 5.2299e-05 0.000552463 1.64887x 0.15609x
5 200x100x256 0.00479585 0.00224633 0.00191679 2.13497x 2.50202x
6 256x256x256 0.0186269 0.00742483 0.00312077 2.50874x 5.9687x
7 256x300x256 0.0223733 0.0108002 0.00306352 2.07156x 7.30315x
8 64x128x64 0.000383877 0.000223077 0.000714149 1.72083x 0.537531x
9 256x256x257 0.0118203 0.00739098 0.00332793 1.59929x 3.55186x

With this, the blocked implementation speedup improved to around 1.5x to 2.5x. However, the parallel speedup dropped as most small cases are at 1x or below. For larger cases (5, 6, 7, 9), it still managed to achieve 2.5x to 7.3x speedup. The compiler flags optimized the single-threaded naive baseline significantly, which reduced the relative parallel speedup. For small matrices, thread spawn contributed to the overhead, while larger matrices had enough work for the threads to contribute to the speedup.