Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1 by nakagawa-fj · Pull Request #5407 · OpenMathLib/OpenBLAS

nakagawa-fj · 2025-07-29T10:09:15Z

This pull request provides a performance improvement for Neoverse V1, addressing Issue #5347.
It differs from the fix in pull request #5353 for A64FX, focusing on matrix size N=2.
While this change primarily enhances performance for N=2, there's potential for further gains up to N=6 on certain architectures. To support this, a new macro, GEMM_DIVIDE_LIMIT, has been introduced to manage the DIVIDE_RATE threshold.
This modification has shown performance improvements for GEMM operations on AWS Graviton3E (Neoverse V1) when N=2, as illustrated in the graph below.

Multi-thread GEMM Performance Improvement on NeoverseV1 (DIVIDE_RATE=1)

7e29f11

martin-frbg added this to the 0.3.31 milestone Jul 30, 2025

martin-frbg merged commit d23680b into OpenMathLib:develop Jul 30, 2025
77 of 88 checks passed

martin-frbg mentioned this pull request Aug 3, 2025

test_extensions/test_sgemmt.c fails with SME on Apple M4 #5414

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1#5407

Multi-thread Performance Improvement of GEMM on NeoverseV1 with DIVIDE_RATE=1#5407
martin-frbg merged 1 commit intoOpenMathLib:developfrom
nakagawa-fj:feature/gemm_divide_rate_for_neoversev1

nakagawa-fj commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nakagawa-fj commented Jul 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants