Commit a7e7305
committed
opencl: GDN sv128 — stage k/q/g through __local (4× fewer global reads)
In the sv128 specialization, the 4 columns in each workgroup share the
same head, so all four read identical k/q/g vectors. Each thread was
fetching 4 strided floats of k and q from global; with 4 cols per WG
that's 4× redundancy across the workgroup (the Adreno L1 absorbs some
of this, but explicit __local staging avoids the redundancy and frees
L1 footprint for the state column reads, which are unique per column).
128 threads cooperatively load 128 floats of k/q (1 element per thread,
fully coalesced) plus exp(g) in the kda path; one barrier; then each
lane's 4 reads hit __local instead of global. v[col] is per-column so
stays as a direct global read. Bit-exact in test-backend-ops -o
GATED_DELTA_NET (8/8 OpenCL cases OK, the head_size=128 case hits the
new path).
Perf is in the noise on a single benching session — Adreno X2 throttles
fast under sustained load and cross-config absolute numbers drift across
back-to-back -r 2 runs by ~30% on tg. Will re-bench from cold and update
[[gdn_opencl_wip]] when there's a clean number.1 parent 0d4ac15 commit a7e7305
1 file changed
Lines changed: 19 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
206 | 206 | | |
207 | 207 | | |
208 | 208 | | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
209 | 225 | | |
210 | 226 | | |
211 | 227 | | |
| |||
215 | 231 | | |
216 | 232 | | |
217 | 233 | | |
218 | | - | |
219 | | - | |
| 234 | + | |
| 235 | + | |
220 | 236 | | |
221 | 237 | | |
222 | 238 | | |
223 | 239 | | |
224 | 240 | | |
225 | | - | |
| 241 | + | |
226 | 242 | | |
227 | 243 | | |
228 | 244 | | |
| |||
0 commit comments