Commit 2f07feb
authored
chore[gpu]: don't inline bitunpack lane impls (#7441)
Fully inlining the bitunpack kernels adds memory pressure such that the
dynamic dispatch kernel previously operated at the 32 registers per
thread max. Preventing to inline the bit unpack line impls drops the
register count per thread to 24. This is relevant for future changes
which will requires more registers such as patches support on the GPU.
The performance hit we take for bitunpack kernels here is 10%. It's
probably worthwhile to investigate whether there might a tradeoff here
to get similar perf with less aggressive inlining in the future. One
thing we could also look at is trading in register spills via launch
bounds for more occupancy.
Splitting out the lane implementations into headers is needed such that
the dynamic dispatch kernel ptx can be compiled without the standalone
bitunpack kernels which are not relevant for the dynamic dispatch
kernel. This reduces the amount of assembly for the dynamic dispatch to
48k lines from 128k lines. Besides nvcc compile times, this is relevant
for the dynamic dispatch kernel in terms of ptx to device compilation
which should be as fast as possible.
---------
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>1 parent 4a5b7d7 commit 2f07feb
12 files changed
Lines changed: 17012 additions & 16952 deletions
File tree
- .github/workflows
- vortex-cuda
- kernels/src
- src
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
277 | 277 | | |
278 | 278 | | |
279 | 279 | | |
| 280 | + | |
280 | 281 | | |
281 | 282 | | |
282 | 283 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | | - | |
| 17 | + | |
| 18 | + | |
18 | 19 | | |
19 | 20 | | |
20 | 21 | | |
| |||
72 | 73 | | |
73 | 74 | | |
74 | 75 | | |
75 | | - | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
76 | 81 | | |
77 | 82 | | |
78 | 83 | | |
| |||
94 | 99 | | |
95 | 100 | | |
96 | 101 | | |
97 | | - | |
98 | | - | |
99 | | - | |
100 | | - | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
101 | 115 | | |
102 | 116 | | |
103 | 117 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
11 | | - | |
12 | | - | |
13 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
30 | | - | |
31 | | - | |
32 | | - | |
33 | | - | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
| |||
0 commit comments