muhammad-zahid: Implemented optimized matrix multiplication by Zahid07 · Pull Request #10 · AA-parallel-computing/Assignment-4-Optional

Zahid07 · 2026-06-01T13:53:16Z

Matrix Multiplication Performance Analysis

System Configuration

Compiler: GCC with -O2 optimization and -fopenmp flags
OpenMP Threads: 4 (OMP_NUM_THREADS=4)
Block Size: 32 (for blocked matrix multiplication)

Performance Measurements

All test cases passed validation successfully.

Test Case	Dimensions (m × n × p)	Naive Time (s)	Blocked Time (s)	Parallel Time (s)	Blocked Speedup	Parallel Speedup
0	64×64×64	0.000138	0.000126	0.000238	1.10×	0.58×
1	128×64×128	0.000546	0.000589	0.000367	0.93×	1.49×
2	100×128×56	0.000318	0.000340	0.000304	0.94×	1.04×
3	128×64×128	0.000571	0.000517	0.000546	1.10×	1.04×
4	32×128×32	0.000074	0.000065	0.000164	1.15×	0.45×
5	200×100×256	0.002757	0.002537	0.000958	1.09×	2.88×
6	256×256×256	0.010045	0.008880	0.002511	1.13×	4.00×
7	256×300×256	0.010491	0.009336	0.003308	1.12×	3.17×
8	64×128×64	0.000233	0.000245	0.000520	0.95×	0.45×
9	256×256×257	0.007978	0.008164	0.002268	0.98×	3.52×

Analysis

Blocked Matrix Multiplication

The cache-optimized blocked implementation shows modest improvements for most test cases:

Best performance on test case 4 (1.15× speedup)
Consistent improvements on medium-sized matrices (1.09–1.13× speedup)
Slight slowdown on some irregular-sized matrices due to block boundary overhead

The block size of 32 provides a good balance between cache efficiency and computational overhead. For larger matrices (cases 6, 7, 9), the blocked approach consistently outperforms the naive implementation, demonstrating the benefits of improved cache locality.

Parallel Matrix Multiplication

The OpenMP parallelized implementation demonstrates significant speedups for large matrices:

Best performance on test case 6 (4.00× speedup) — largest square matrix (256×256×256)
Strong performance on cases 7 and 9 (3.17× and 3.52× speedup)
Moderate improvements on medium-sized matrices (1.49–2.88× speedup)
Parallel overhead outweighs benefits on small matrices (cases 0, 4, 8), causing slowdowns

Key Observations

Matrix Size	Best Approach	Reason
Small (< 100×100)	Naive	Low overhead, thread creation cost not worth it
Medium (100–200)	Blocked	Cache locality benefits outweigh block overhead
Large (> 200×200)	Parallel	Dramatic 3–4× speedup with 4 threads

Implementation Details

Naive: Standard triple-nested loop with i-j-k ordering
Blocked: 6-level nested loop with block size of 32
Parallel: OpenMP #pragma omp parallel for on the outermost loop
Validation: All implementations passed with epsilon tolerance of 0.1 for floating-point comparison

mzahid07-sudo and others added 2 commits June 1, 2026 18:47

muhammad-zahid: Implemented optimized matrix multiplication

f3de5ad

Delete er.name

c304d09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

muhammad-zahid: Implemented optimized matrix multiplication#10

muhammad-zahid: Implemented optimized matrix multiplication#10
Zahid07 wants to merge 2 commits into
AA-parallel-computing:mainfrom
Zahid07:muhammad-zahid

Zahid07 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zahid07 commented Jun 1, 2026

Matrix Multiplication Performance Analysis

System Configuration

Performance Measurements

Analysis

Blocked Matrix Multiplication

Parallel Matrix Multiplication

Key Observations

Implementation Details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants