Improving Load Imbalance in Thread-Parallel GEMM by nakagawa-fj · Pull Request #5173 · OpenMathLib/OpenBLAS

nakagawa-fj · 2025-03-11T12:27:14Z

This Pull Request improves partitioning of calculation on output matrix C ranges to each thread in parallel GEMM.

Our investigation revealed that the data partitioning in the row direction can be uneven when the output matrix C is divided into 2D partitions in thread-parallel GEMM calculations. This leads to an imbalance where too much amount of calculation is assigned to threads of lower number.

We changed the previous single loop to 2-level nested loops in order to distribute calculation uniformly.
This resulted in better thread balance and improved performance.
The graph below shows improved performance and smoother curve compared to v0.3.29.

Evaluation on other CPUs by this community would be helpful and appreciated.

martin-frbg · 2025-03-11T22:06:12Z

Thank you very much - I've only had time for a quick test on POWER10 (24 cores, 192 threads - host cfarm120 in the GCC Compile Farm) where this PR appears to result in faster runs overall, but a much more jittery graph:
before:

after:

(not sure if this is due to some "invisible" load on this shared resource, or to some fundamental difference in cpu architecture - or maybe using OPENBLAS_LOOPS=100 for the benchmark was not enough to average out normal jitter)

nakagawa-fj · 2025-03-12T02:30:30Z

Thank you for your measurements on Power10.

For my results, I used the following values for OPENBLAS_LOOPS:

Up to 1000: OPENBLAS_LOOPS=1000
From 1000: OPENBLAS_LOOPS=20

Also, I set OMP_PROC_BIND=close.

Additionally, my used machines were AWS c7a and c7g instances, both of which were 1SMT.
Power10 is 8SMT, so number of SMTs are different.

I don't know if these are related to jitter, but those are the points that concerns me.

pratiklp00 · 2025-03-12T15:33:05Z

Hi @nakagawa-fj @martin-frbg may I know how you ran this Benchmark test?? I would like to try it on PowerPC.

martin-frbg · 2025-03-12T16:12:54Z

@pratiklp00 I simply ran sgemm.goto in the benchmark directory with parameters 1 5000 (and the environment variable OPENBLAS_LOOPS set to 1000 in my case to average each data point over 1000 runs - turns out this is still a bit jiggly on this shared (but idle, to the best of my knowledge) hardware.
@nakagawa-fj thanks - indeed I had completely forgotten to set up the OMP environment for that run, of course most of the jitter is gone with OMP_PROC_BIND=CLOSE, with only a few glitches remaining. (And those are not reproducible in individual runs, so obviously related to some other system activity)

martin-frbg · 2025-03-12T21:28:48Z

just for completeness, the mostly cleaned up graph for POWER10 with this PR applied and OMP_PROC_BIND=CLOSE

pratiklp00 · 2025-03-13T02:40:39Z

thanks @martin-frbg

Add Improving Load Imbalance in Thread-Parallel GEMM

80d3c2a

martin-frbg added this to the 0.3.30 milestone Mar 12, 2025

martin-frbg merged commit 37b8547 into OpenMathLib:develop Mar 12, 2025
84 of 86 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Load Imbalance in Thread-Parallel GEMM#5173

Improving Load Imbalance in Thread-Parallel GEMM#5173
martin-frbg merged 1 commit intoOpenMathLib:developfrom
nakagawa-fj:gemm_load_imbalance

nakagawa-fj commented Mar 11, 2025 •

edited

Loading

Uh oh!

martin-frbg commented Mar 11, 2025

Uh oh!

nakagawa-fj commented Mar 12, 2025

Uh oh!

pratiklp00 commented Mar 12, 2025 •

edited

Loading

Uh oh!

martin-frbg commented Mar 12, 2025

Uh oh!

martin-frbg commented Mar 12, 2025

Uh oh!

Uh oh!

pratiklp00 commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nakagawa-fj commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Mar 11, 2025

Uh oh!

nakagawa-fj commented Mar 12, 2025

Uh oh!

pratiklp00 commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Mar 12, 2025

Uh oh!

martin-frbg commented Mar 12, 2025

Uh oh!

Uh oh!

pratiklp00 commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nakagawa-fj commented Mar 11, 2025 •

edited

Loading

pratiklp00 commented Mar 12, 2025 •

edited

Loading