Commit 560445b
metal : tighten input-position loop in kernel_conv_transpose_1d (ggml/1477)
For a given output position j on the time axis, only input positions
i such that i*s0 <= j < i*s0 + K contribute -- i.e.
i in [ceil((j - K + 1)/s0), floor(j/s0)] intersected with [0, IL-1].
That's at most ceil(K/s0) values (typically 2 for stride==K/2
transposed convs).
The current kernel iterates the full IL range and filters with an
`if`, amplifying per-thread work by IL/ceil(K/s0) (~160x for IL=320,
K=10, s0=5 -- a representative codec-decoder shape). On Apple M1
the wasted work trips the macOS GPU watchdog
(kIOGPUCommandBufferCallbackErrorImpactingInteractivity) on long
graphs.
Compute i_min, i_max analytically before the inner loop and iterate
only [i_min, i_max]. Output is bit-identical (same multiplies and
adds in the same order); loop bound shrinks by IL/ceil(K/s0).
Tested on M1 with a downstream consumer running a TTS codec at full
T_codec; end-to-end codec decode ~3-4x faster, zero watchdog hits
across long synthesis runs vs ~30% pre-patch.1 parent 2eb3e6b commit 560445b
1 file changed
Lines changed: 24 additions & 7 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4881 | 4881 | | |
4882 | 4882 | | |
4883 | 4883 | | |
4884 | | - | |
| 4884 | + | |
| 4885 | + | |
| 4886 | + | |
| 4887 | + | |
| 4888 | + | |
| 4889 | + | |
| 4890 | + | |
| 4891 | + | |
| 4892 | + | |
| 4893 | + | |
| 4894 | + | |
| 4895 | + | |
| 4896 | + | |
| 4897 | + | |
| 4898 | + | |
| 4899 | + | |
| 4900 | + | |
4885 | 4901 | | |
4886 | | - | |
4887 | | - | |
4888 | | - | |
| 4902 | + | |
| 4903 | + | |
| 4904 | + | |
| 4905 | + | |
| 4906 | + | |
4889 | 4907 | | |
4890 | | - | |
4891 | | - | |
4892 | | - | |
| 4908 | + | |
| 4909 | + | |
4893 | 4910 | | |
4894 | 4911 | | |
4895 | 4912 | | |
| |||
0 commit comments