Skip to content

muhammad-zahid: Implemented optimized matrix multiplication#10

Open
Zahid07 wants to merge 2 commits into
AA-parallel-computing:mainfrom
Zahid07:muhammad-zahid
Open

muhammad-zahid: Implemented optimized matrix multiplication#10
Zahid07 wants to merge 2 commits into
AA-parallel-computing:mainfrom
Zahid07:muhammad-zahid

Conversation

@Zahid07
Copy link
Copy Markdown

@Zahid07 Zahid07 commented Jun 1, 2026

Matrix Multiplication Performance Analysis

System Configuration

  • Compiler: GCC with -O2 optimization and -fopenmp flags
  • OpenMP Threads: 4 (OMP_NUM_THREADS=4)
  • Block Size: 32 (for blocked matrix multiplication)

Performance Measurements

All test cases passed validation successfully.

Test Case Dimensions (m × n × p) Naive Time (s) Blocked Time (s) Parallel Time (s) Blocked Speedup Parallel Speedup
0 64×64×64 0.000138 0.000126 0.000238 1.10× 0.58×
1 128×64×128 0.000546 0.000589 0.000367 0.93× 1.49×
2 100×128×56 0.000318 0.000340 0.000304 0.94× 1.04×
3 128×64×128 0.000571 0.000517 0.000546 1.10× 1.04×
4 32×128×32 0.000074 0.000065 0.000164 1.15× 0.45×
5 200×100×256 0.002757 0.002537 0.000958 1.09× 2.88×
6 256×256×256 0.010045 0.008880 0.002511 1.13× 4.00×
7 256×300×256 0.010491 0.009336 0.003308 1.12× 3.17×
8 64×128×64 0.000233 0.000245 0.000520 0.95× 0.45×
9 256×256×257 0.007978 0.008164 0.002268 0.98× 3.52×

Analysis

Blocked Matrix Multiplication

The cache-optimized blocked implementation shows modest improvements for most test cases:

  • Best performance on test case 4 (1.15× speedup)
  • Consistent improvements on medium-sized matrices (1.09–1.13× speedup)
  • Slight slowdown on some irregular-sized matrices due to block boundary overhead

The block size of 32 provides a good balance between cache efficiency and computational overhead. For larger matrices (cases 6, 7, 9), the blocked approach consistently outperforms the naive implementation, demonstrating the benefits of improved cache locality.

Parallel Matrix Multiplication

The OpenMP parallelized implementation demonstrates significant speedups for large matrices:

  • Best performance on test case 6 (4.00× speedup) — largest square matrix (256×256×256)
  • Strong performance on cases 7 and 9 (3.17× and 3.52× speedup)
  • Moderate improvements on medium-sized matrices (1.49–2.88× speedup)
  • Parallel overhead outweighs benefits on small matrices (cases 0, 4, 8), causing slowdowns

Key Observations

Matrix Size Best Approach Reason
Small (< 100×100) Naive Low overhead, thread creation cost not worth it
Medium (100–200) Blocked Cache locality benefits outweigh block overhead
Large (> 200×200) Parallel Dramatic 3–4× speedup with 4 threads

Implementation Details

  • Naive: Standard triple-nested loop with i-j-k ordering
  • Blocked: 6-level nested loop with block size of 32
  • Parallel: OpenMP #pragma omp parallel for on the outermost loop
  • Validation: All implementations passed with epsilon tolerance of 0.1 for floating-point comparison

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants