Skip to content

Optimize gemv_n_sve kernel#5157

Merged
martin-frbg merged 1 commit intoOpenMathLib:developfrom
manaalmj:feature
Mar 12, 2025
Merged

Optimize gemv_n_sve kernel#5157
martin-frbg merged 1 commit intoOpenMathLib:developfrom
manaalmj:feature

Conversation

@manaalmj
Copy link
Copy Markdown
Contributor

@manaalmj manaalmj commented Feb 28, 2025

Loop-unrolling of sgemv_n kernel is implemented with svmla_lane, along with a main loop and a tail loop.
The graph below shows performance improvement for OMP_NUM_THREADS=1 on Graviton-3:

speedup

Comment thread kernel/arm64/gemv_n_sve.c Outdated
@fadara01
Copy link
Copy Markdown
Contributor

fadara01 commented Mar 3, 2025

Can we also update the plot s.t:

  • it's discontinuous - don't connect the points to each other
  • showing clearly what the current state of affairs is vs this PR - maybe a red horizontal line at speedup=1
  • It might also make sense to use log scale for the Y-axis.
  • X-axis label is MxKxN, right?
  • add the number of threads/machine to the title

Comment thread kernel/arm64/gemv_n_sve.c Outdated
@martin-frbg
Copy link
Copy Markdown
Collaborator

yes, an explanation of the benchmark graph would be nice - I'm not sure I understand why your PR would result in two very narrow regions where there is more than tenfold speedup ?

@manaalmj
Copy link
Copy Markdown
Contributor Author

manaalmj commented Mar 7, 2025

yes, an explanation of the benchmark graph would be nice - I'm not sure I understand why your PR would result in two very narrow regions where there is more than tenfold speedup ?

Please refer to the updated graph. For the regions where we see spikes, The baseline performance of these problem sizes are poor and that's why we see spike in speedup.

@manaalmj
Copy link
Copy Markdown
Contributor Author

manaalmj commented Mar 7, 2025

Can we also update the plot s.t:

  • it's discontinuous - don't connect the points to each other
  • showing clearly what the current state of affairs is vs this PR - maybe a red horizontal line at speedup=1
  • It might also make sense to use log scale for the Y-axis.
  • X-axis label is MxKxN, right?
  • add the number of threads/machine to the title

Please see the updated graph.

@fadara01
Copy link
Copy Markdown
Contributor

@martin-frbg could you please re-review?

Comment thread kernel/arm64/KERNEL.ARMV8SVE Outdated
Comment thread kernel/arm64/gemv_n_sve.c
Comment thread kernel/arm64/gemv_n_sve.c
@martin-frbg martin-frbg added this to the 0.3.30 milestone Mar 12, 2025
@martin-frbg
Copy link
Copy Markdown
Collaborator

Looks good to me, thank you very much.

@martin-frbg martin-frbg merged commit a3e7b16 into OpenMathLib:develop Mar 12, 2025
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants