Skip to content

Improving Load Imbalance in Thread-Parallel GEMM#5173

Merged
martin-frbg merged 1 commit intoOpenMathLib:developfrom
nakagawa-fj:gemm_load_imbalance
Mar 12, 2025
Merged

Improving Load Imbalance in Thread-Parallel GEMM#5173
martin-frbg merged 1 commit intoOpenMathLib:developfrom
nakagawa-fj:gemm_load_imbalance

Conversation

@nakagawa-fj
Copy link
Copy Markdown
Contributor

@nakagawa-fj nakagawa-fj commented Mar 11, 2025

Closes #5172

This Pull Request improves partitioning of calculation on output matrix C ranges to each thread in parallel GEMM.

Our investigation revealed that the data partitioning in the row direction can be uneven when the output matrix C is divided into 2D partitions in thread-parallel GEMM calculations. This leads to an imbalance where too much amount of calculation is assigned to threads of lower number.

We changed the previous single loop to 2-level nested loops in order to distribute calculation uniformly.
This resulted in better thread balance and improved performance.
The graph below shows improved performance and smoother curve compared to v0.3.29.

Evaluation on other CPUs by this community would be helpful and appreciated.

sgemm EPYC 192t
sgemm EPYC 96t
sgemm NeoverseV1 64t

@martin-frbg
Copy link
Copy Markdown
Collaborator

Thank you very much - I've only had time for a quick test on POWER10 (24 cores, 192 threads - host cfarm120 in the GCC Compile Farm) where this PR appears to result in faster runs overall, but a much more jittery graph:
before:
before_patch
after:
with_patch
(not sure if this is due to some "invisible" load on this shared resource, or to some fundamental difference in cpu architecture - or maybe using OPENBLAS_LOOPS=100 for the benchmark was not enough to average out normal jitter)

@nakagawa-fj
Copy link
Copy Markdown
Contributor Author

Thank you for your measurements on Power10.

For my results, I used the following values for OPENBLAS_LOOPS:

Up to 1000: OPENBLAS_LOOPS=1000
From 1000: OPENBLAS_LOOPS=20

Also, I set OMP_PROC_BIND=close.

Additionally, my used machines were AWS c7a and c7g instances, both of which were 1SMT.
Power10 is 8SMT, so number of SMTs are different.

I don't know if these are related to jitter, but those are the points that concerns me.

@pratiklp00
Copy link
Copy Markdown
Contributor

pratiklp00 commented Mar 12, 2025

Hi @nakagawa-fj @martin-frbg may I know how you ran this Benchmark test?? I would like to try it on PowerPC.

@martin-frbg
Copy link
Copy Markdown
Collaborator

@pratiklp00 I simply ran sgemm.goto in the benchmark directory with parameters 1 5000 (and the environment variable OPENBLAS_LOOPS set to 1000 in my case to average each data point over 1000 runs - turns out this is still a bit jiggly on this shared (but idle, to the best of my knowledge) hardware.
@nakagawa-fj thanks - indeed I had completely forgotten to set up the OMP environment for that run, of course most of the jitter is gone with OMP_PROC_BIND=CLOSE, with only a few glitches remaining. (And those are not reproducible in individual runs, so obviously related to some other system activity)

@martin-frbg martin-frbg added this to the 0.3.30 milestone Mar 12, 2025
@martin-frbg
Copy link
Copy Markdown
Collaborator

just for completeness, the mostly cleaned up graph for POWER10 with this PR applied and OMP_PROC_BIND=CLOSE
with_patch

@martin-frbg martin-frbg merged commit 37b8547 into OpenMathLib:develop Mar 12, 2025
84 of 86 checks passed
@pratiklp00
Copy link
Copy Markdown
Contributor

thanks @martin-frbg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The performance curve of parallel GEMM with many cores shows significant up-down

3 participants