Skip to content

[RVV] Optimizing x32 transpose microkernel#9996

Merged
copybara-service[bot] merged 2 commits intogoogle:masterfrom
keaganchern:optimized-transpose
Apr 20, 2026
Merged

[RVV] Optimizing x32 transpose microkernel#9996
copybara-service[bot] merged 2 commits intogoogle:masterfrom
keaganchern:optimized-transpose

Conversation

@keaganchern
Copy link
Copy Markdown
Contributor

@keaganchern keaganchern commented Apr 17, 2026

XNNPACK: Optimize RVV x32 transposec microkernel

Summary

Optimizes the RVV x32 transpose microkernel using split segmented loads with interleaved store scheduling and full tail handling for arbitrary block_width. Measured on Saturn (https://github.com/ucb-bar/saturn-vectors, RVV vector unit, VLEN=512): +25% geomean speedup across benchmarked shapes, +54% peak at 512×512 over the fastest correct existing baseline (16x8_rvv) for VLEN=512. As a bonus, each new LMUL variant (8xv1_rvv, 8xv2_rvv, 8xv4_rvv) is VLEN-agnostic.

Optimizations

  • LMUL-parameterized template with split segmented loads. The kernel runs at LMUL ∈ {1, 2, 4}. For each LMUL, the 8-column tile is split into LMUL segmented loads of 8/LMUL fields each (1× vlsseg8_m1, 2× vlsseg4_m2, 4× vlsseg2_m4) — always legal under RVV's EMUL × NFIELDS ≤ 8 constraint while scaling rows per outer iteration with LMUL at the same total bandwidth.
  • Interleaved load/store scheduling. The template emits the first half of tuple0's stores between load0 and load1, so the second load's address generation overlaps with the first load's consumption. This pays off on any implementation with decoupled vector load/store pipelines.
  • VLEN-agnostic outer loop. Dynamic vsetvl at each outer iteration means one compiled binary is correct.

Performance

Hardware: Saturn (https://github.com/ucb-bar/saturn-vectors, vector unit, VLEN=512). Benchmark: bench/xN-transposec, shapes from the default suite. Times in ns; speedup vs 16x8_rvv (the fastest correct existing kernel at VLEN=512):

Shape 16x8_rvv (baseline) 8xv2_rvv (this PR) Speedup
32² 1520 1182 1.29×
64² 5613 4455 1.26×
117² 19076 15912 1.20×
128² 22169 17714 1.25×
256² 104721 91675 1.14×
512² 897587 582929 1.54×
1024² 4092788 3572526 1.15×

Geomean: 1.25× (+25%). Peak: 1.54× at 512²

Secondary benefit: VLEN-agnostic kernels

Using dynamic vsetvl at the top of every outer iteration (instead of baking a tile_height constant at codegen) means each new LMUL variant is individually VLEN-agnostic: 8xv1_rvv, 8xv2_rvv, and 8xv4_rvv each compile to a single binary that is correct on any VLEN ≥ 128.

Attribution

Optimization using Autocomp.

@charleshong3
Copy link
Copy Markdown

Hi @dsharletg @fbarchard @ken-unger is this something that would be of interest?

Copy link
Copy Markdown
Collaborator

@dsharlet dsharlet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! A vlen-agnostic transpose is great.

@copybara-service copybara-service Bot merged commit 6af4ee5 into google:master Apr 20, 2026
3 checks passed
@ken-unger
Copy link
Copy Markdown
Contributor

@keaganchern will you update transpose-config.c to use 8xv2 in a new PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants