Skip to content

Commit 7656d49

Browse files
Deji10Muhammad Zahid
andcommitted
ayodeji-ibrahim: Implemented blocked and parallel matrix multiplication
Co-authored-by: Muhammad Zahid <muhammad.zahid@example.com>
1 parent 46b0cb2 commit 7656d49

2 files changed

Lines changed: 161 additions & 82 deletions

File tree

README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,3 +235,58 @@ git push origin student-name
235235
- Use small test cases to debug your blocked and parallel implementations.
236236

237237
Good luck, and enjoy optimizing your matrix multiplication!
238+
239+
---
240+
241+
## Performance Results
242+
243+
### Environment
244+
- **Platform**: GitHub Codespaces (Linux x86_64)
245+
- **Compiler**: g++ with `-O3 -fopenmp`
246+
- **Threads**: 4 (OpenMP)
247+
- **Block Size**: 64
248+
249+
### Results Table
250+
251+
| Case | Dimensions (m × n × p) | Naive (s) | Blocked (s) | Parallel (s) | Blocked Speedup | Parallel Speedup |
252+
|------|------------------------|-----------|-------------|--------------|-----------------|------------------|
253+
| 0 | 64 × 64 × 64 | 0.000433 | 0.000355 | 0.001822 | 1.22× | 0.24× |
254+
| 1 | 128 × 64 × 128 | 0.000982 | 0.000858 | 0.000854 | 1.14× | 1.15× |
255+
| 2 | 100 × 128 × 56 | 0.000747 | 0.000543 | 0.000621 | 1.38× | 1.20× |
256+
| 3 | 128 × 64 × 128 | 0.001224 | 0.002144 | 0.001733 | 0.57× | 0.71× |
257+
| 4 | 32 × 128 × 32 | 0.000233 | 0.000143 | 0.001070 | 1.63× | 0.22× |
258+
| 5 | 200 × 100 × 256 | 0.007649 | 0.006110 | 0.007922 | 1.25× | 0.97× |
259+
| 6 | 256 × 256 × 256 | 0.025777 | 0.025024 | 0.018829 | 1.03× | 1.37× |
260+
| 7 | 256 × 300 × 256 | 0.031356 | 0.028475 | 0.031300 | 1.10× | 1.00× |
261+
| 8 | 64 × 128 × 64 | 0.000560 | 0.000402 | 0.000884 | 1.39× | 0.63× |
262+
| 9 | 256 × 256 × 257 | 0.020814 | 0.028982 | 0.022235 | 0.72× | 0.94× |
263+
264+
All implementations validated against the reference output (`output.raw`) with tolerance `1e-2`. All cases marked **OK**.
265+
266+
### Analysis
267+
268+
**Correctness**: Every implementation produces identical results to the reference output for all 10 test cases.
269+
270+
**Cache Optimization (Blocked)**:
271+
- Blocked multiplication gives a **modest improvement (1.0× – 1.6×)** in most cases (0, 1, 2, 4, 5, 6, 7, 8).
272+
- For tiny matrices (case 4: 32×128×32), blocking gives the largest gain (1.63×) because the entire problem fits inside the cache and blocking reduces redundant memory traffic.
273+
- Cases 3 and 9 show a *slowdown* with blocking. These results reflect measurement noise in the Codespaces environment — when execution time is in the millisecond range, normal variance can flip the ordering between runs.
274+
275+
**Parallel (OpenMP)**:
276+
- Parallelization helps **only when the matrices are large enough** to amortize OpenMP thread creation overhead.
277+
- For tiny matrices (cases 0, 4, 8), parallel is dramatically *slower* (0.22× – 0.63×) because thread setup costs exceed the actual compute work.
278+
- For larger matrices (case 6: 256×256×256), parallel achieves the expected speedup (~1.37×) using 4 threads.
279+
- A more impressive parallel speedup would be visible at matrix dimensions above ~512×512, where the problem size exceeds OpenMP's fixed overhead.
280+
281+
**Block Size Choice**: 64 was chosen because modern CPU cache lines are 64 bytes (8 doubles per line). This block size keeps the working set within L1 cache (typically 32–64 KB) without excessive loop overhead.
282+
283+
**Thread Count**: 4 threads were used to match the Codespaces default core allocation.
284+
285+
### Challenges
286+
287+
1. **Small Test Cases**: The provided test cases are too small to fully showcase OpenMP parallelism. On these sizes, the OpenMP setup cost is comparable to (or exceeds) the actual work. Larger matrices (1024×1024 and above) would demonstrate parallel speedups closer to the theoretical 4× on 4 threads.
288+
289+
2. **Measurement Noise**: With timings in the sub-millisecond range, run-to-run variance is significant. Reported numbers are from a single run; averaging across multiple runs would give more stable speedup figures.
290+
291+
3. **Text-format I/O**: The `.raw` files are space-separated text, not binary doubles. Reading was done using `ifstream >> double` with the first two integers as dimensions.
292+

main.cpp

Lines changed: 106 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -1,114 +1,138 @@
11
#include <iostream>
22
#include <fstream>
3+
#include <cstdlib>
4+
#include <cmath>
35
#include <string>
6+
#include <algorithm>
47
#include <omp.h>
5-
#include <cmath>
68

7-
void naive_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) {
8-
//TODO : Implement naive matrix multiplication
9-
}
9+
using namespace std;
1010

11-
void blocked_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p, uint32_t block_size) {
12-
// TODO: Implement blocked matrix multiplication
13-
// A is m x n, B is n x p, C is m x p
14-
// Use block_size to divide matrices into submatrices
11+
void naive_matmul(double *A, double *B, double *C, int m, int n, int p) {
12+
for (int i = 0; i < m; i++) {
13+
for (int j = 0; j < p; j++) {
14+
C[i*p + j] = 0.0;
15+
for (int k = 0; k < n; k++) {
16+
C[i*p + j] += A[i*n + k] * B[k*p + j];
17+
}
18+
}
19+
}
1520
}
1621

17-
void parallel_matmul(float *C, float *A, float *B, uint32_t m, uint32_t n, uint32_t p) {
18-
// TODO: Implement parallel matrix multiplication using OpenMP
19-
// A is m x n, B is n x p, C is m x p
22+
void blocked_matmul(double *A, double *B, double *C, int m, int n, int p, int bs) {
23+
for (int i = 0; i < m*p; i++) C[i] = 0.0;
24+
for (int ii = 0; ii < m; ii += bs) {
25+
for (int jj = 0; jj < p; jj += bs) {
26+
for (int kk = 0; kk < n; kk += bs) {
27+
int ie = min(ii+bs, m);
28+
int je = min(jj+bs, p);
29+
int ke = min(kk+bs, n);
30+
for (int i = ii; i < ie; i++) {
31+
for (int j = jj; j < je; j++) {
32+
for (int k = kk; k < ke; k++) {
33+
C[i*p + j] += A[i*n + k] * B[k*p + j];
34+
}
35+
}
36+
}
37+
}
38+
}
39+
}
2040
}
2141

22-
bool validate_result(const std::string &result_file, const std::string &reference_file) {
23-
//TODO : Implement result validation
42+
void parallel_matmul(double *A, double *B, double *C, int m, int n, int p) {
43+
#pragma omp parallel for collapse(2)
44+
for (int i = 0; i < m; i++) {
45+
for (int j = 0; j < p; j++) {
46+
C[i*p + j] = 0.0;
47+
for (int k = 0; k < n; k++) {
48+
C[i*p + j] += A[i*n + k] * B[k*p + j];
49+
}
50+
}
51+
}
2452
}
2553

26-
int main(int argc, char *argv[]) {
27-
if (argc != 2) {
28-
std::cerr << "Usage: " << argv[0] << " <case_number>" << std::endl;
29-
return 1;
54+
double* read_matrix(const string& filename, int& rows, int& cols) {
55+
ifstream file(filename);
56+
if (!file) {
57+
cerr << "Cannot open file" << endl;
58+
exit(1);
59+
}
60+
file >> rows >> cols;
61+
double* M = (double*)malloc(rows * cols * sizeof(double));
62+
for (int i = 0; i < rows * cols; i++) {
63+
file >> M[i];
3064
}
65+
return M;
66+
}
3167

32-
int case_number = std::atoi(argv[1]);
33-
if (case_number < 0 || case_number > 9) {
34-
std::cerr << "Case number must be between 0 and 9" << std::endl;
68+
int main(int argc, char *argv[]) {
69+
if (argc < 2) {
70+
cerr << "Usage: program <case 0-9>" << endl;
3571
return 1;
3672
}
73+
int cn = atoi(argv[1]);
3774

38-
// Construct file paths
39-
std::string folder = "data/" + std::to_string(case_number) + "/";
40-
std::string input0_file = folder + "input0.raw";
41-
std::string input1_file = folder + "input1.raw";
42-
std::string result_file = folder + "result.raw";
43-
std::string reference_file = folder + "output.raw";
75+
string pA = "data/" + to_string(cn) + "/input0.raw";
76+
string pB = "data/" + to_string(cn) + "/input1.raw";
77+
string pC = "data/" + to_string(cn) + "/output.raw";
4478

45-
// TODO Read input0.raw (matrix A)
79+
int m, nA, nB, p, mo, po;
80+
double *A = read_matrix(pA, m, nA);
81+
double *B = read_matrix(pB, nB, p);
82+
double *Cref = read_matrix(pC, mo, po);
83+
int n = nA;
4684

47-
48-
// TODO Read input1.raw (matrix B)
49-
50-
51-
// Allocate memory for result matrices
52-
float *C_naive = new float[m * p];
53-
float *C_blocked = new float[m * p];
54-
float *C_parallel = new float[m * p];
55-
56-
// Measure performance of naive_matmul
57-
double start_time = omp_get_wtime();
58-
naive_matmul(C_naive, A, B, m, n, p);
59-
double naive_time = omp_get_wtime() - start_time;
60-
61-
// TODO Write naive result to file
62-
63-
64-
// Validate naive result
65-
bool naive_correct = validate_result(result_file, reference_file);
66-
if (!naive_correct) {
67-
std::cerr << "Naive result validation failed for case " << case_number << std::endl;
85+
if (nA != nB) {
86+
cerr << "Dimension mismatch" << endl;
87+
return 1;
6888
}
6989

70-
// Measure performance of blocked_matmul (use block_size = 32 as default)
71-
start_time = omp_get_wtime();
72-
blocked_matmul(C_blocked, A, B, m, n, p, 32);
73-
double blocked_time = omp_get_wtime() - start_time;
90+
cout << "Case " << cn << ": A(" << m << "x" << n << ") * B(" << n << "x" << p << ")" << endl;
7491

75-
// TODO Write blocked result to file
92+
double *Cn = (double*)malloc(m * p * sizeof(double));
93+
double *Cb = (double*)malloc(m * p * sizeof(double));
94+
double *Cp = (double*)malloc(m * p * sizeof(double));
7695

96+
omp_set_num_threads(4);
7797

78-
// Validate blocked result
79-
bool blocked_correct = validate_result(result_file, reference_file);
80-
if (!blocked_correct) {
81-
std::cerr << "Blocked result validation failed for case " << case_number << std::endl;
82-
}
98+
double t0 = omp_get_wtime();
99+
naive_matmul(A, B, Cn, m, n, p);
100+
double tn = omp_get_wtime() - t0;
83101

84-
// Measure performance of parallel_matmul
85-
start_time = omp_get_wtime();
86-
parallel_matmul(C_parallel, A, B, m, n, p);
87-
double parallel_time = omp_get_wtime() - start_time;
102+
t0 = omp_get_wtime();
103+
blocked_matmul(A, B, Cb, m, n, p, 64);
104+
double tb = omp_get_wtime() - t0;
88105

89-
// TODO Write parallel result to file
106+
t0 = omp_get_wtime();
107+
parallel_matmul(A, B, Cp, m, n, p);
108+
double tp = omp_get_wtime() - t0;
90109

91-
92-
// Validate parallel result
93-
bool parallel_correct = validate_result(result_file, reference_file);
94-
if (!parallel_correct) {
95-
std::cerr << "Parallel result validation failed for case " << case_number << std::endl;
110+
bool nok = true, bok = true, pok = true;
111+
double eps = 1e-2;
112+
for (int i = 0; i < m * p; i++) {
113+
if (fabs(Cn[i] - Cref[i]) > eps) nok = false;
114+
if (fabs(Cb[i] - Cref[i]) > eps) bok = false;
115+
if (fabs(Cp[i] - Cref[i]) > eps) pok = false;
96116
}
97117

98-
// Print performance results
99-
std::cout << "Case " << case_number << " (" << m << "x" << n << "x" << p << "):\n";
100-
std::cout << "Naive time: " << naive_time << " seconds\n";
101-
std::cout << "Blocked time: " << blocked_time << " seconds\n";
102-
std::cout << "Parallel time: " << parallel_time << " seconds\n";
103-
std::cout << "Blocked speedup: " << (naive_time / blocked_time) << "x\n";
104-
std::cout << "Parallel speedup: " << (naive_time / parallel_time) << "x\n";
105-
106-
// Clean up
107-
delete[] A;
108-
delete[] B;
109-
delete[] C_naive;
110-
delete[] C_blocked;
111-
delete[] C_parallel;
118+
cout << " Naive: " << tn << " s (" << (nok ? "OK" : "FAIL") << ")" << endl;
119+
cout << " Blocked: " << tb << " s (" << (bok ? "OK" : "FAIL") << ") speedup: " << tn/tb << "x" << endl;
120+
cout << " Parallel: " << tp << " s (" << (pok ? "OK" : "FAIL") << ") speedup: " << tn/tp << "x" << endl;
121+
122+
string rp = "data/" + to_string(cn) + "/result.raw";
123+
ofstream out(rp);
124+
out << m << " " << p << endl;
125+
for (int i = 0; i < m * p; i++) {
126+
out << Cp[i];
127+
if ((i + 1) % p == 0) out << endl;
128+
else out << " ";
129+
}
112130

131+
free(A);
132+
free(B);
133+
free(Cref);
134+
free(Cn);
135+
free(Cb);
136+
free(Cp);
113137
return 0;
114138
}

0 commit comments

Comments
 (0)