Commit 89b938a
[ET-VK][matmul] Re-implement fp32/fp16 matmul and linear with tiled compute and blocked weight packing
Pull Request resolved: #18171
Replace all existing matmul/linear operator implementations with new ones built
from the ground up using a tiled compute approach. Delete all legacy
implementations (MatMulLegacy.cpp, LinearLegacy.cpp, addmm_optimized.glsl,
addmm_naive_*.glsl).
New matmul (mm/bmm/addmm):
- Single matmul.glsl shader handles mm, bmm, and addmm using FPInputTile,
FPWeightTile, FPOutTile infrastructure from SDPA
- Adaptive tile size selection (TILE_M=4/2/1) based on GPU occupancy
- When mat2 is a constant tensor, automatically routes through the linear
path for blocked weight packing
New linear:
- Custom 4OC×4IC blocked weight prepacking via pack_fp_linear_weight.glsl
for optimal cache line utilization during tiled matmul
- Supports both transposed [N,K] and non-transposed [K,N] weights with
batch dimension support
- Separate texture2d weight storage with automatic buffer fallback for
large dimensions
Performance on Adreno 750 (fp16, vs legacy):
- Linear [4096,1024]x[256,1024]: 1.33x faster (texture)
- Linear [4096,64]x[128,64]: 2.67x faster (texture)
- BMM [1,4096,256]x[1,256,1024]: 1.63x faster (texture)
ghstack-source-id: 352051371
@exported-using-ghexport
Differential Revision: [D96488384](https://our.internmc.facebook.com/intern/diff/D96488384/)1 parent 8bec69b commit 89b938a
30 files changed
Lines changed: 2290 additions & 1406 deletions
File tree
- backends/vulkan
- runtime/graph/ops
- glsl
- impl
- test
- custom_ops
- impl
- op_tests
Lines changed: 0 additions & 86 deletions
This file was deleted.
Lines changed: 0 additions & 189 deletions
This file was deleted.
Lines changed: 0 additions & 24 deletions
This file was deleted.
0 commit comments