Setting optimized [SD]GEMM_DEFAULT_[PQR] parameters for A64FX#5554
Merged
martin-frbg merged 1 commit intoOpenMathLib:developfrom Jan 11, 2026
Merged
Conversation
Contributor
|
Hi @hideaki-motoki -san Overall LGTM. For the performance comparisons between v0.3.30 and this updated version, could you clarify the build configuration? |
Contributor
Author
|
Hi, @abhishek-iitmadras -san.
It was built with |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #5553.


The parameters
[SD]GEMM_DEFAULT_[PQR]have been tuned to obtain the performance improvement in[SD]GEMMunder the multi-process evaluation using all cores ofA64FX. This change improves the performance of[SD]GEMMshown in the left and center figures. In this pull-request, performance is compared between OpenBLAS v0.3.30 and modified one (labeled as update). I also confirmed that the performance improves under the single-process evaluation shown in right figure.While the performance improves in most Level 3 BLAS kernels, the performance degrades in kernels related to triangular matrix (
TRMMandTRSM), which comes from the same reason described in Issue#4742.Above figures show the performance change in
GEMM,TRMMandTRSM.To understand the extent of the performance degradation in
TRMMandTRSM, I evaluated the performance ratio relative to the v0.3.30 up to size=5,000 and summarized the results in the table below.This indicates that while the pert of performance of
TRMMandTRSMdecreases, there are benefits to fine-turn the[SD]GEMM_DEFAULT_[PQR]parameters forA64FX.