Commit 995f6e3
committed
kernels/custom: grid_sampler_2d fp16 — accumulate in fp32
Match the precision of the portable kernel (after pytorch#19117)
and avoid fp16 catastrophic cancellation on weight computation. The NEON
half variant previously did interpolation weight computation and FMA
accumulation in fp16 via vmul_f16 / vfma_f16; this change loads fp16,
promotes to float32x4 via vcvt_f32_f16, does the four-corner FMA chain in
fp32, and casts back to fp16 on store.
Speed impact: two vcvt per 4-channel group — single-cycle on modern ARM,
unmeasurable at op level in a full-model benchmark (3.5 ms for a typical
call shape, unchanged).
Precision impact: max_abs vs an fp32-then-down-cast reference drops from
~0.1 to 0 on the shapes the polycam depth model uses.1 parent b5a2967 commit 995f6e3
1 file changed
Lines changed: 15 additions & 13 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
146 | 146 | | |
147 | 147 | | |
148 | 148 | | |
| 149 | + | |
| 150 | + | |
149 | 151 | | |
150 | 152 | | |
151 | 153 | | |
| |||
161 | 163 | | |
162 | 164 | | |
163 | 165 | | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
168 | 170 | | |
169 | 171 | | |
170 | 172 | | |
| |||
214 | 216 | | |
215 | 217 | | |
216 | 218 | | |
217 | | - | |
218 | | - | |
219 | | - | |
220 | | - | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
221 | 223 | | |
222 | | - | |
223 | | - | |
224 | | - | |
225 | | - | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
226 | 228 | | |
227 | 229 | | |
228 | | - | |
| 230 | + | |
229 | 231 | | |
230 | 232 | | |
231 | 233 | | |
| |||
0 commit comments