Commit 931739f
Speedup linearize_index via flat-grid kernel (#5768)
Summary:
Pull Request resolved: #5768
X-link: https://github.com/facebookresearch/FBGEMM/pull/2696
linearize_index_wo_infos_kernel was launched with grid = ceil(total_B / kMaxThreads), kMaxThreads = 1024. On the prod IFR-MTML mc7 shape total_B is in the low thousands, so the launch consumed only ~5 SMs out of 132 on H100. Each warp also did intra-warp shuffle (`shfl_sync` x kWarpSize) to redistribute work among lanes, adding latency on top of the SM under-utilization. Bench measures this kernel at ~1.7 ms / call.
Replace with a flat-grid kernel launched as grid = total_B (one block per (t, b) sample), threads = 256. Each block recovers (t, b) from blockIdx.x via the existing FixedDivisor argument and stripes the per-sample reads/writes across its threads. No shuffle, full SM utilization. Bench measures the new kernel at ~140 us / call (~12x speedup on this kernel).
No public API change. Output dtype is unchanged (still index_t); downstream kernels (at::_unique, delinearize, length kernel) consume the same buffer in the same format.
Reviewed By: q10
Differential Revision: D105005594
fbshipit-source-id: 33f3d2b41713268765db134f40a37ad7485e1bc41 parent 7121bf0 commit 931739f
1 file changed
Lines changed: 27 additions & 26 deletions
Lines changed: 27 additions & 26 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
31 | | - | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
32 | 39 | | |
33 | | - | |
| 40 | + | |
34 | 41 | | |
35 | 42 | | |
36 | 43 | | |
| |||
40 | 47 | | |
41 | 48 | | |
42 | 49 | | |
43 | | - | |
| 50 | + | |
44 | 51 | | |
45 | 52 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
58 | | - | |
59 | | - | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
64 | 65 | | |
65 | 66 | | |
66 | 67 | | |
| |||
170 | 171 | | |
171 | 172 | | |
172 | 173 | | |
173 | | - | |
| 174 | + | |
174 | 175 | | |
175 | 176 | | |
176 | 177 | | |
| |||
204 | 205 | | |
205 | 206 | | |
206 | 207 | | |
207 | | - | |
208 | | - | |
209 | | - | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
210 | 211 | | |
211 | 212 | | |
212 | 213 | | |
| |||
0 commit comments