Improving Load Imbalance in Thread-Parallel GEMM#5173
Improving Load Imbalance in Thread-Parallel GEMM#5173martin-frbg merged 1 commit intoOpenMathLib:developfrom
Conversation
|
Thank you for your measurements on Power10. For my results, I used the following values for OPENBLAS_LOOPS: Also, I set OMP_PROC_BIND=close. Additionally, my used machines were AWS c7a and c7g instances, both of which were 1SMT. I don't know if these are related to jitter, but those are the points that concerns me. |
|
Hi @nakagawa-fj @martin-frbg may I know how you ran this Benchmark test?? I would like to try it on PowerPC. |
|
@pratiklp00 I simply ran |
|
thanks @martin-frbg |
Closes #5172
This Pull Request improves partitioning of calculation on output matrix C ranges to each thread in parallel GEMM.
Our investigation revealed that the data partitioning in the row direction can be uneven when the output matrix C is divided into 2D partitions in thread-parallel GEMM calculations. This leads to an imbalance where too much amount of calculation is assigned to threads of lower number.
We changed the previous single loop to 2-level nested loops in order to distribute calculation uniformly.
This resulted in better thread balance and improved performance.
The graph below shows improved performance and smoother curve compared to v0.3.29.
Evaluation on other CPUs by this community would be helpful and appreciated.