Further performance improvements to non-transposed [SD]GEMV kernels for A64FX and Neoverse V1. by iha-taisei · Pull Request #5220 · OpenMathLib/OpenBLAS

iha-taisei · 2025-04-11T11:18:14Z

This pull request proposes a patch for issue #5210.
I have implemented a loop unrolling in the kernel of the non-transposed [SD]GEMV for A64FX and Neoverse V1.
This PullRequest improves performance by 1.7x for A64FX and 2x for Neoverse V1 compared to v0.3.29.

annop-w · 2025-04-15T21:46:39Z

@iha-taisei Thank you very much for your contribution. Could we please then remove gemv_n_sve.c as well ? since it will no longer be used. I think we should try keeping the number of kernels low.

annop-w · 2025-04-15T22:32:31Z

@iha-taisei I have just benchmarked the gemv_n_sve_v1x3.c against the NEON kernel in #5225 on NEOVERSEV2 and the SVE version wins slightly.

So, would you mind adding gemv_n_sve_v1x3.c in kernel/arm64/KERNEL.NEOVERSEN2 as well please ? I will close #5225 in favor of this one. Thank you.

martin-frbg · 2025-04-16T07:39:22Z

@iha-taisei Thank you very much for your contribution. Could we please then remove gemv_n_sve.c as well ? since it will no longer be used. I think we should try keeping the number of kernels low.

I'm not convinced that we need to remove kernel files simply because they are (currently) not in use by any hardware

abhishek-iitmadras · 2025-04-16T08:24:26Z

Hi @annop-w

Can we use gemv_n_sve_v1x3.c for KERNEL.ARMV8SVE, like we have already for [S/D]GEMVTKERNEL with patch #5215?

cc @iha-taisei

iha-taisei · 2025-04-16T08:50:05Z

Hi @annop-w

So, would you mind adding gemv_n_sve_v1x3.c in kernel/arm64/KERNEL.NEOVERSEN2 as well please ? I will close #5225 in favor of this one. Thank you.

Yes, you can add the kernel to KERNEL.NEOVERSEN2, although I haven't evaluated it on such CPUs.

annop-w · 2025-04-16T09:40:11Z

@iha-taisei

Yes, you can add the kernel to KERNEL.NEOVERSEN2, although I haven't evaluated it on such CPUs.

I have results for NEOVERSEV2, which currently uses the same settings as NEOVERSEN2 for DYNAMIC_ARCH, in my above comment. I have not benchmarked on N2 but I believe the result will hold as well and we will see speedup.

annop-w · 2025-04-16T09:42:38Z

@martin-frbg

I'm not convinced that we need to remove kernel files simply because they are (currently) not in use by any hardware

From a quick look, the kernel gemv_n_sve.c in #5157 is very similar to the one in this PR (the same no. of unrolling). That's why I suggested removing the old one. In fact, it is interesting to try to understand why this one performs significantly better.

annop-w · 2025-04-16T11:38:20Z

@abhishek-iitmadras

Can we use gemv_n_sve_v1x3.c for KERNEL.ARMV8SVE, like we have already for [S/D]GEMVTKERNEL with patch #5215?

Yes, but I have not tried benchmarking on those CORTEX-As and -Xs. But, seeing how this new SVE kernel outperforms the assembly one on V1 and V2, I expect the same on those cores perhaps.

martin-frbg · 2025-04-16T11:52:29Z

@abhishek-iitmadras

Can we use gemv_n_sve_v1x3.c for KERNEL.ARMV8SVE, like we have already for [S/D]GEMVTKERNEL with patch #5215?

Yes, but I have not tried benchmarking on those CORTEX-As and -Xs. But, seeing how this new SVE kernel outperforms the assembly one on V1 and V2, I expect the same on those cores perhaps.

I can benchmark on a Pixel8, if we can agree that it is an underrated supercomputer (and if I can find the time and energy for non-trivial work again)

Further performance improvements to [SD]GEMV.

f1e628b

martin-frbg added this to the 0.3.30 milestone Apr 11, 2025

annop-w mentioned this pull request Apr 15, 2025

Improve performance for SGEMVN on NEONVERSEN1 #5225

Merged

martin-frbg merged commit 0241d51 into OpenMathLib:develop Apr 16, 2025
85 of 86 checks passed

tetsuzo-usui mentioned this pull request Apr 17, 2025

Segmentation Fault Introduced by PR #5220 #5231

Closed

annop-w mentioned this pull request Apr 22, 2025

Use SVE kernel for S/DGEMVN for SVE machines #5239

Merged

iha-taisei mentioned this pull request Apr 22, 2025

Fix: Potential out-of-bounds read in non-transposed [SD]GEMV kernels for A64FX and Neoverse V1. #5240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Further performance improvements to non-transposed [SD]GEMV kernels for A64FX and Neoverse V1.#5220

Further performance improvements to non-transposed [SD]GEMV kernels for A64FX and Neoverse V1.#5220
martin-frbg merged 1 commit intoOpenMathLib:developfrom
iha-taisei:sdgemv_n_unroll

iha-taisei commented Apr 11, 2025

Uh oh!

annop-w commented Apr 15, 2025

Uh oh!

annop-w commented Apr 15, 2025

Uh oh!

martin-frbg commented Apr 16, 2025

Uh oh!

abhishek-iitmadras commented Apr 16, 2025

Uh oh!

iha-taisei commented Apr 16, 2025

Uh oh!

annop-w commented Apr 16, 2025

Uh oh!

annop-w commented Apr 16, 2025

Uh oh!

annop-w commented Apr 16, 2025 •

edited

Loading

Uh oh!

martin-frbg commented Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

iha-taisei commented Apr 11, 2025

Uh oh!

annop-w commented Apr 15, 2025

Uh oh!

annop-w commented Apr 15, 2025

Uh oh!

martin-frbg commented Apr 16, 2025

Uh oh!

abhishek-iitmadras commented Apr 16, 2025

Uh oh!

iha-taisei commented Apr 16, 2025

Uh oh!

annop-w commented Apr 16, 2025

Uh oh!

annop-w commented Apr 16, 2025

Uh oh!

annop-w commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-frbg commented Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

annop-w commented Apr 16, 2025 •

edited

Loading