kernel/riscv64:fixed the performance problem in RISCV64_ZVL256 when OPENBLAS_K is small#5291
Conversation
|
This PR is directly related to issue #5286 |
|
Thanks - that's pretty much the same as what I came up with in my initial experimentation. I do wonder if the _rvv kernels (used by the ZVL128B and x280 targets) perform better - one oddity I noticed is that the _vector kernels always request maximum vector length (VSETVL_MAX) while their _rvv counterparts seem to try to match the vector length to the actual amount of data. (I'm still rather new to RISCV though, so may be misreading the code...) |
|
I'm also relatively new to RVV. My understanding is that both kernels will try to match the vector length to the actual amount of data. VSETVL_MAX is generally only used outside the computation loop to initialize some variables (or perform other types of operations), which need to be long enough to ensure they can be used for subsequent vector computations. Once inside the loop, both kernels will use VSETVL to match the appropriate length. |
|
Ah, you're right of course, I missed the later vsetvl() in zdot_vector.c |
I made these two code modifications to address the HBMV issue. When the computation scale is too small, the performance of RVV is very poor. Therefore, I call the unvectorized code when the scale is small.
The numbers 8 and 16 in the code are the balance points I found. Around these values, the performance of the RVV version and the unvectorized version is close.