Implementing SVE in [SD]AXPY Kernels for A64FX and Graviton3E#5426
Implementing SVE in [SD]AXPY Kernels for A64FX and Graviton3E#5426martin-frbg merged 2 commits intoOpenMathLib:developfrom
[SD]AXPY Kernels for A64FX and Graviton3E#5426Conversation
| DGEMVTKERNEL = gemv_t_sve_v1x3.c | ||
|
|
||
| SAXPYKERNEL = axpy_sve.c | ||
| DAXPYKERNEL = axpy_sve.c |
There was a problem hiding this comment.
since you have used the SVL for the implementation instead of hardcoding the vector width, the kernel should work on NEOVERSEV2 as well. Please check this on Graviton4 and add it to KERNEL.NEOVERSEV2 as well.
There was a problem hiding this comment.
I tried performance evaluations on a Grace equipped with Neoverse V2, as I did not have access to a Graviton4 for testing. The results showed that AXPY with SVE did not show significant performance improvement compared to the original version.
This graph shows the single thread performance of DAXPY on Grace. For this pull request, there is little advantage to implementing SVE on Neoverse V2.
| BLASLONG sve_size = SV_COUNT(); | ||
|
|
||
| if (n < 0) return (0); | ||
| if (da == 0.0) return (0); |
There was a problem hiding this comment.
why can't these two checks be combined into one?
There was a problem hiding this comment.
Thank you for your comments.
There was another way you mentioned, but I followed kernel/arm/axpy.c#L45-L46.
|
Hi @hideaki-motoki , thanks for the PR! I have added few comments. |
Resolves #5417.


This change improves the performance of
[SD]AXPYon bothA64FXandGraviton3E.The graphs below show the single thread performance improvement of
[D]AXPYonA64FXandGraviton3E, respectively.The performance improved by 2.57 times on the
A64FXand 1.13 times on theGraviton3E.I have confirmed that this optimization also yields performance benefits for Level 2 BLAS kernels that utilize
[SD]AXPY, such as[SD]SPMVand[SD]GER.