[RVV] Optimizing x32 transpose microkernel by keaganchern · Pull Request #9996 · google/XNNPACK

keaganchern · 2026-04-17T19:53:45Z

XNNPACK: Optimize RVV x32 transposec microkernel

Summary

Optimizes the RVV x32 transpose microkernel using split segmented loads with interleaved store scheduling and full tail handling for arbitrary block_width. Measured on Saturn (https://github.com/ucb-bar/saturn-vectors, RVV vector unit, VLEN=512): +25% geomean speedup across benchmarked shapes, +54% peak at 512×512 over the fastest correct existing baseline (16x8_rvv) for VLEN=512. As a bonus, each new LMUL variant (8xv1_rvv, 8xv2_rvv, 8xv4_rvv) is VLEN-agnostic.

Optimizations

LMUL-parameterized template with split segmented loads. The kernel runs at LMUL ∈ {1, 2, 4}. For each LMUL, the 8-column tile is split into LMUL segmented loads of 8/LMUL fields each (1× vlsseg8_m1, 2× vlsseg4_m2, 4× vlsseg2_m4) — always legal under RVV's EMUL × NFIELDS ≤ 8 constraint while scaling rows per outer iteration with LMUL at the same total bandwidth.
Interleaved load/store scheduling. The template emits the first half of tuple0's stores between load0 and load1, so the second load's address generation overlaps with the first load's consumption. This pays off on any implementation with decoupled vector load/store pipelines.
VLEN-agnostic outer loop. Dynamic vsetvl at each outer iteration means one compiled binary is correct.

Performance

Hardware: Saturn (https://github.com/ucb-bar/saturn-vectors, vector unit, VLEN=512). Benchmark: bench/xN-transposec, shapes from the default suite. Times in ns; speedup vs 16x8_rvv (the fastest correct existing kernel at VLEN=512):

Shape	`16x8_rvv` (baseline)	`8xv2_rvv` (this PR)	Speedup
32²	1520	1182	1.29×
64²	5613	4455	1.26×
117²	19076	15912	1.20×
128²	22169	17714	1.25×
256²	104721	91675	1.14×
512²	897587	582929	1.54×
1024²	4092788	3572526	1.15×

Geomean: 1.25× (+25%). Peak: 1.54× at 512²

Secondary benefit: VLEN-agnostic kernels

Using dynamic vsetvl at the top of every outer iteration (instead of baking a tile_height constant at codegen) means each new LMUL variant is individually VLEN-agnostic: 8xv1_rvv, 8xv2_rvv, and 8xv4_rvv each compile to a single binary that is correct on any VLEN ≥ 128.

Attribution

Optimization using Autocomp.

charleshong3 · 2026-04-20T21:52:51Z

Hi @dsharletg @fbarchard @ken-unger is this something that would be of interest?

dsharlet

Very nice! A vlen-agnostic transpose is great.

ken-unger · 2026-04-21T04:28:58Z

@keaganchern will you update transpose-config.c to use 8xv2 in a new PR?

keaganchern added 2 commits April 17, 2026 10:28

adding optimized autocomp transpose

4b9b7b3

updating template for transpose

0a7debb

dsharlet approved these changes Apr 20, 2026

View reviewed changes

copybara-service Bot mentioned this pull request Apr 20, 2026

Copybara import of the project: #10015

Merged

copybara-service Bot merged commit 6af4ee5 into google:master Apr 20, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RVV] Optimizing x32 transpose microkernel#9996

[RVV] Optimizing x32 transpose microkernel#9996
copybara-service[bot] merged 2 commits intogoogle:masterfrom
keaganchern:optimized-transpose

keaganchern commented Apr 17, 2026 •

edited

Loading

Uh oh!

charleshong3 commented Apr 20, 2026

Uh oh!

dsharlet left a comment

Uh oh!

Uh oh!

ken-unger commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

keaganchern commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

XNNPACK: Optimize RVV x32 transposec microkernel

Summary

Optimizations

Performance

Secondary benefit: VLEN-agnostic kernels

Attribution

Uh oh!

charleshong3 commented Apr 20, 2026

Uh oh!

dsharlet left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ken-unger commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

keaganchern commented Apr 17, 2026 •

edited

Loading