You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update base for Update on "[ET-VK][conv1d] Implement height-packed depthwise conv1d operator"
Implement a depthwise conv1d operator using height-packed layout where channels
are the packed dimension (WHCN dim 1). Depthwise conv applies a separate filter
to each channel independently (groups=C), so 4 channels can be processed in
parallel using element-wise vec4 FMA over kernel positions.
Thread mapping: X=C/4, Y=L_out, Z=N. Each thread computes one output texel
(4 channels at one spatial position). Inner loop iterates over kernel positions
K with bounds-checked input access for padding.
Weight [C,1,K] is prepacked as channels-packed so each vec4 load gives 4
channels' weights at one kernel position. Supports both buffer and texture3d
storage, fp32/fp16, optional bias, and arbitrary stride/padding/dilation.
Registered as et_vk.conv1d_dw.default (standalone custom op).
Performance on Adreno 750 (S24):
- [1,128,4096] K=31 buffer f16: 231 GFLOP/s
- [1,128,4096] K=31 buffer f32: 155 GFLOP/s
- [1,512,2048] K=5 buffer f32: 66 GFLOP/s
Differential Revision: [D97344091](https://our.internmc.facebook.com/intern/diff/D97344091/)
[ghstack-poisoned]
0 commit comments