Skip to content

Improvement of 2D thread-partitioned GEMM for M << N case#5276

Merged
martin-frbg merged 1 commit intoOpenMathLib:developfrom
nakagawa-fj:gemm_2d_thread_partitioning
May 21, 2025
Merged

Improvement of 2D thread-partitioned GEMM for M << N case#5276
martin-frbg merged 1 commit intoOpenMathLib:developfrom
nakagawa-fj:gemm_2d_thread_partitioning

Conversation

@nakagawa-fj
Copy link
Copy Markdown
Contributor

@nakagawa-fj nakagawa-fj commented May 21, 2025

Closes #5270
The 2D thread partitioning in GEMM (PR#4655) requires nthreads_m % 2 == 0. This can prevent optimal nthreads_m and nthreads_n combinations on architectures like A64FX (48 cores) or Grace (144 cores) when M<<N, due to core counts having divisors other than 2.
Specifically, when matrix size N is significantly larger than M, the number of threads for N direction should be increased.
However, if nthreads_m includes divisors other than 2, such as 3, the increase of nthreads_n is prevented by ' nthreads_m % 2 == 0 '.
This modification removes the nthreads_m % 2 == 0 restriction and selects the combination that minimizes the following objective function 'n * nthreads_m + m * nthreads_n'.
This change improves the performance of multi-threaded GEMM for M << N cases.

image

image

@martin-frbg martin-frbg added this to the 0.3.30 milestone May 21, 2025
@martin-frbg
Copy link
Copy Markdown
Collaborator

Thank you

@martin-frbg martin-frbg merged commit e2e6a4d into OpenMathLib:develop May 21, 2025
82 of 86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improvement of 2D thread-partitioned GEMM for M << N case

2 participants