Further performance improvements to non-transposed [SD]GEMV kernels for A64FX and Neoverse V1.#5220
Conversation
|
@iha-taisei Thank you very much for your contribution. Could we please then remove |
|
@iha-taisei I have just benchmarked the So, would you mind adding |
I'm not convinced that we need to remove kernel files simply because they are (currently) not in use by any hardware |
|
Hi @annop-w Can we use gemv_n_sve_v1x3.c for KERNEL.ARMV8SVE, like we have already for [S/D]GEMVTKERNEL with patch #5215? cc @iha-taisei |
I have results for NEOVERSEV2, which currently uses the same settings as NEOVERSEN2 for DYNAMIC_ARCH, in my above comment. I have not benchmarked on N2 but I believe the result will hold as well and we will see speedup. |
From a quick look, the kernel |
Yes, but I have not tried benchmarking on those CORTEX-As and -Xs. But, seeing how this new SVE kernel outperforms the assembly one on V1 and V2, I expect the same on those cores perhaps. |
I can benchmark on a Pixel8, if we can agree that it is an underrated supercomputer (and if I can find the time and energy for non-trivial work again) |

close #5210
This pull request proposes a patch for issue #5210.


I have implemented a loop unrolling in the kernel of the non-transposed [SD]GEMV for A64FX and Neoverse V1.
This PullRequest improves performance by 1.7x for A64FX and 2x for Neoverse V1 compared to v0.3.29.