You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
All implementations validated against the reference output (`output.raw`) with tolerance `1e-2`. All cases marked **OK**.
265
+
266
+
### Analysis
267
+
268
+
**Correctness**: Every implementation produces identical results to the reference output for all 10 test cases.
269
+
270
+
**Cache Optimization (Blocked)**:
271
+
- Blocked multiplication gives a **modest improvement (1.0× – 1.6×)**in most cases (0, 1, 2, 4, 5, 6, 7, 8).
272
+
- For tiny matrices (case 4: 32×128×32), blocking gives the largest gain (1.63×) because the entire problem fits inside the cache and blocking reduces redundant memory traffic.
273
+
- Cases 3 and 9 show a *slowdown* with blocking. These results reflect measurement noise in the Codespaces environment — when execution time is in the millisecond range, normal variance can flip the ordering between runs.
274
+
275
+
**Parallel (OpenMP)**:
276
+
- Parallelization helps **only when the matrices are large enough** to amortize OpenMP thread creation overhead.
277
+
- For tiny matrices (cases 0, 4, 8), parallel is dramatically *slower* (0.22× – 0.63×) because thread setup costs exceed the actual compute work.
278
+
- For larger matrices (case 6: 256×256×256), parallel achieves the expected speedup (~1.37×) using 4 threads.
279
+
- A more impressive parallel speedup would be visible at matrix dimensions above ~512×512, where the problem size exceeds OpenMP's fixed overhead.
280
+
281
+
**Block Size Choice**: 64 was chosen because modern CPU cache lines are 64 bytes (8 doubles per line). This block size keeps the working set within L1 cache (typically 32–64 KB) without excessive loop overhead.
282
+
283
+
**Thread Count**: 4 threads were used to match the Codespaces default core allocation.
284
+
285
+
### Challenges
286
+
287
+
1. **Small Test Cases**: The provided test cases are too small to fully showcase OpenMP parallelism. On these sizes, the OpenMP setup cost is comparable to (or exceeds) the actual work. Larger matrices (1024×1024 and above) would demonstrate parallel speedups closer to the theoretical 4× on 4 threads.
288
+
289
+
2. **Measurement Noise**: With timings in the sub-millisecond range, run-to-run variance is significant. Reported numbers are from a single run; averaging across multiple runs would give more stable speedup figures.
290
+
291
+
3. **Text-format I/O**: The `.raw` files are space-separated text, not binary doubles. Reading was done using `ifstream >> double` with the first two integers as dimensions.
0 commit comments