[RVV] Optimizing x32 transpose microkernel#9996
Merged
copybara-service[bot] merged 2 commits intogoogle:masterfrom Apr 20, 2026
Merged
[RVV] Optimizing x32 transpose microkernel#9996copybara-service[bot] merged 2 commits intogoogle:masterfrom
copybara-service[bot] merged 2 commits intogoogle:masterfrom
Conversation
|
Hi @dsharletg @fbarchard @ken-unger is this something that would be of interest? |
dsharlet
approved these changes
Apr 20, 2026
Collaborator
dsharlet
left a comment
There was a problem hiding this comment.
Very nice! A vlen-agnostic transpose is great.
Contributor
|
@keaganchern will you update transpose-config.c to use 8xv2 in a new PR? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
XNNPACK: Optimize RVV x32 transposec microkernel
Summary
Optimizes the RVV x32 transpose microkernel using split segmented loads with interleaved store scheduling and full tail handling for arbitrary
block_width. Measured on Saturn (https://github.com/ucb-bar/saturn-vectors, RVV vector unit, VLEN=512): +25% geomean speedup across benchmarked shapes, +54% peak at 512×512 over the fastest correct existing baseline (16x8_rvv) for VLEN=512. As a bonus, each new LMUL variant (8xv1_rvv,8xv2_rvv,8xv4_rvv) is VLEN-agnostic.Optimizations
LMULsegmented loads of8/LMULfields each (1× vlsseg8_m1,2× vlsseg4_m2,4× vlsseg2_m4) — always legal under RVV'sEMUL × NFIELDS ≤ 8constraint while scaling rows per outer iteration with LMUL at the same total bandwidth.vsetvlat each outer iteration means one compiled binary is correct.Performance
Hardware: Saturn (https://github.com/ucb-bar/saturn-vectors, vector unit, VLEN=512). Benchmark:
bench/xN-transposec, shapes from the default suite. Times in ns; speedup vs16x8_rvv(the fastest correct existing kernel at VLEN=512):16x8_rvv(baseline)8xv2_rvv(this PR)Geomean: 1.25× (+25%). Peak: 1.54× at 512²
Secondary benefit: VLEN-agnostic kernels
Using dynamic
vsetvlat the top of every outer iteration (instead of baking atile_heightconstant at codegen) means each new LMUL variant is individually VLEN-agnostic:8xv1_rvv,8xv2_rvv, and8xv4_rvveach compile to a single binary that is correct on any VLEN ≥ 128.Attribution
Optimization using Autocomp.