Commit c3f0807
perf: Optimize translate() UDF for scalar inputs (#20305)
## Which issue does this PR close?
- Closes #20302.
## Rationale for this change
`translate()` is commonly invoked with constant values for its second
and third arguments. We can take advantage of that to significantly
optimize its performance by precomputing the translation lookup table,
rather than recomputing it for every row. For ASCII-only inputs, we can
further replace the hashmap lookup table with a fixed-size array that
maps ASCII byte values directly.
For scalar ASCII inputs, this yields roughly a 10x performance
improvement. For scalar UTF8 inputs, the performance improvement is more
like 50%, although less so for long strings.
Along the way, add support for `translate()` on `LargeUtf8` input, along
with an SLT test, and improve the docs.
## What changes are included in this PR?
* Add a benchmark for scalar/constant input to translate
* Add a missing test case
* Improve translate() docs
* Support translate() on LargeUtf8 input
* Optimize translate() for scalar inputs by precomputing lookup hashmap
* Optimize translate() for ASCII inputs by precomputing ASCII byte-wise
lookup table
## Are these changes tested?
Yes. Added an extra test case and did a bunch of benchmarking.
## Are there any user-facing changes?
No.
---------
Co-authored-by: Martin Grigorov <martin-g@users.noreply.github.com>
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>1 parent 4f4e814 commit c3f0807
File tree
4 files changed
+221
-26
lines changed- datafusion
- functions
- benches
- src/unicode
- sqllogictest/test_files
- docs/source/user-guide/sql
4 files changed
+221
-26
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
23 | 22 | | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
30 | | - | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
31 | 34 | | |
32 | | - | |
33 | 35 | | |
34 | 36 | | |
35 | 37 | | |
| |||
40 | 42 | | |
41 | 43 | | |
42 | 44 | | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
43 | 58 | | |
44 | 59 | | |
45 | 60 | | |
| |||
67 | 82 | | |
68 | 83 | | |
69 | 84 | | |
70 | | - | |
71 | | - | |
72 | | - | |
73 | | - | |
74 | | - | |
75 | | - | |
76 | | - | |
77 | | - | |
78 | | - | |
79 | | - | |
80 | | - | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
81 | 101 | | |
82 | 102 | | |
83 | 103 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
39 | | - | |
| 38 | + | |
| 39 | + | |
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | | - | |
52 | | - | |
| 51 | + | |
| 52 | + | |
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| |||
71 | 71 | | |
72 | 72 | | |
73 | 73 | | |
| 74 | + | |
74 | 75 | | |
75 | 76 | | |
76 | 77 | | |
| |||
99 | 100 | | |
100 | 101 | | |
101 | 102 | | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
102 | 158 | | |
103 | 159 | | |
104 | 160 | | |
| |||
107 | 163 | | |
108 | 164 | | |
109 | 165 | | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
110 | 174 | | |
111 | 175 | | |
112 | 176 | | |
| |||
123 | 187 | | |
124 | 188 | | |
125 | 189 | | |
126 | | - | |
127 | | - | |
| 190 | + | |
| 191 | + | |
128 | 192 | | |
129 | 193 | | |
130 | 194 | | |
| |||
170 | 234 | | |
171 | 235 | | |
172 | 236 | | |
173 | | - | |
| 237 | + | |
174 | 238 | | |
175 | 239 | | |
176 | 240 | | |
| |||
199 | 263 | | |
200 | 264 | | |
201 | 265 | | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
202 | 357 | | |
203 | 358 | | |
204 | 359 | | |
| |||
284 | 439 | | |
285 | 440 | | |
286 | 441 | | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
287 | 457 | | |
288 | 458 | | |
289 | 459 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
239 | 239 | | |
240 | 240 | | |
241 | 241 | | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
242 | 247 | | |
243 | 248 | | |
244 | 249 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2068 | 2068 | | |
2069 | 2069 | | |
2070 | 2070 | | |
2071 | | - | |
| 2071 | + | |
2072 | 2072 | | |
2073 | 2073 | | |
2074 | | - | |
| 2074 | + | |
2075 | 2075 | | |
2076 | 2076 | | |
2077 | 2077 | | |
2078 | 2078 | | |
2079 | 2079 | | |
2080 | | - | |
2081 | | - | |
| 2080 | + | |
| 2081 | + | |
2082 | 2082 | | |
2083 | 2083 | | |
2084 | 2084 | | |
| |||
0 commit comments