Commit 6c106ba
authored
## Which issue does this PR close?
- Closes #21060.
## Rationale for this change
`lpad`, `rpad`, and `translate` use grapheme segmentation. This is
inconsistent with how these functions behave in Postgres and DuckDB, as
well as the SQL standard -- segmentation based on Unicode codepoints is
used instead. It also happens that grapheme-based segmentation is
significantly more expensive than codepoint-based segmentation.
In the case of `lpad` and `rpad`, graphemes and codepoints were used
inconsistently: the input string was measured in code points but the
fill string was measured in graphemes.
#3054 switched to using codepoints for most string-related functions in
DataFusion but these three functions still need to be changed.
Benchmarks (M4 Max):
lpad size=1024:
- lpad utf8 [str_len=5, target=20]: 12.4 µs → 12.8 µs, +3.0%
- lpad stringview [str_len=5, target=20]: 11.5 µs → 11.7 µs, +1.4%
- lpad utf8 [str_len=20, target=50]: 11.3 µs → 11.3 µs, +0.1%
- lpad stringview [str_len=20, target=50]: 11.8 µs → 12.0 µs, +1.6%
- lpad utf8 unicode [target=20]: 98.4 µs → 24.4 µs, -75.1%
- lpad stringview unicode [target=20]: 99.8 µs → 26.0 µs, -74.0%
- lpad utf8 scalar [str_len=5, target=20, fill='x']: 8.7 µs → 8.8 µs,
+1.0%
- lpad stringview scalar [str_len=5, target=20, fill='x']: 10.2 µs →
10.1 µs, -0.1%
- lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 44.7 µs →
10.9 µs, -75.7%
- lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 152.5 µs →
11.7 µs, -92.3%
lpad size=4096:
- lpad utf8 [str_len=5, target=20]: 55.9 µs → 55.1 µs, -1.4%
- lpad stringview [str_len=5, target=20]: 49.2 µs → 50.1 µs, +1.8%
- lpad utf8 [str_len=20, target=50]: 46.6 µs → 46.4 µs, -0.5%
- lpad stringview [str_len=20, target=50]: 47.5 µs → 48.5 µs, +2.1%
- lpad utf8 unicode [target=20]: 401.3 µs → 100.1 µs, -75.0%
- lpad stringview unicode [target=20]: 397.7 µs → 104.9 µs, -73.6%
- lpad utf8 scalar [str_len=5, target=20, fill='x']: 34.2 µs → 35.0 µs,
+2.4%
- lpad stringview scalar [str_len=5, target=20, fill='x']: 40.1 µs →
40.4 µs, +0.6%
- lpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 178.3 µs →
42.9 µs, -76.0%
- lpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 601.3 µs →
46.2 µs, -92.3%
rpad size=1024:
- rpad utf8 [str_len=5, target=20]: 15.5 µs → 14.4 µs, -7.1%
- rpad stringview [str_len=5, target=20]: 13.8 µs → 14.0 µs, +1.7%
- rpad utf8 [str_len=20, target=50]: 12.6 µs → 12.7 µs, +1.3%
- rpad stringview [str_len=20, target=50]: 13.0 µs → 13.1 µs, +0.7%
- rpad utf8 unicode [target=20]: 103.5 µs → 26.0 µs, -74.8%
- rpad stringview unicode [target=20]: 101.2 µs → 27.6 µs, -72.7%
- rpad utf8 scalar [str_len=5, target=20, fill='x']: 11.4 µs → 10.9 µs,
-3.9%
- rpad stringview scalar [str_len=5, target=20, fill='x']: 12.2 µs →
12.6 µs, +2.8%
- rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 46.3 µs →
12.4 µs, -73.1%
- rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 155.6 µs →
11.6 µs, -92.4%
rpad size=4096:
- rpad utf8 [str_len=5, target=20]: 70.1 µs → 61.6 µs, -12.2%
- rpad stringview [str_len=5, target=20]: 60.4 µs → 56.8 µs, -6.0%
- rpad utf8 [str_len=20, target=50]: 50.6 µs → 51.2 µs, +1.2%
- rpad stringview [str_len=20, target=50]: 53.7 µs → 53.3 µs, -0.8%
- rpad utf8 unicode [target=20]: 407.1 µs → 104.0 µs, -74.5%
- rpad stringview unicode [target=20]: 404.8 µs → 114.5 µs, -71.7%
- rpad utf8 scalar [str_len=5, target=20, fill='x']: 47.5 µs → 45.6 µs,
-4.0%
- rpad stringview scalar [str_len=5, target=20, fill='x']: 56.4 µs →
58.5 µs, +3.6%
- rpad utf8 scalar unicode [str_len=5, target=20, fill='é']: 184.1 µs →
48.1 µs, -73.9%
- rpad utf8 scalar truncate [str_len=20, target=5, fill='é']: 606.4 µs →
45.6 µs, -92.5%
translate size=1024:
- array_from_to [str_len=8]: 140.0 µs → 37.6 µs, -73.2%
- scalar_from_to [str_len=8]: 9.0 µs → 8.8 µs, -2.7%
- array_from_to [str_len=32]: 371.3 µs → 65.6 µs, -82.3%
- scalar_from_to [str_len=32]: 19.9 µs → 19.2 µs, -3.6%
- array_from_to [str_len=128]: 1249.6 µs → 188.7 µs, -84.9%
- scalar_from_to [str_len=128]: 70.2 µs → 64.7 µs, -7.9%
- array_from_to [str_len=1024]: 9349.4 µs → 1378.1 µs, -85.3%
- scalar_from_to [str_len=1024]: 506.5 µs → 445.8 µs, -12.0%
translate size=4096:
- array_from_to [str_len=8]: 548.0 µs → 147.1 µs, -73.2%
- scalar_from_to [str_len=8]: 33.9 µs → 32.8 µs, -3.1%
- array_from_to [str_len=32]: 1457.2 µs → 266.0 µs, -81.7%
- scalar_from_to [str_len=32]: 78.0 µs → 75.5 µs, -3.2%
- array_from_to [str_len=128]: 4935.0 µs → 791.1 µs, -84.0%
- scalar_from_to [str_len=128]: 278.2 µs → 260.7 µs, -6.3%
- array_from_to [str_len=1024]: 37496 µs → 5591 µs, -85.1%
- scalar_from_to [str_len=1024]: 2058.0 µs → 1770 µs, -14.0%
## What changes are included in this PR?
* Switch from grapheme segmentation to codepoint segmentation for
`lpad`, `rpad`, and `translate`
* Add SLT tests
* Refactor a few helper functions
* Remove dependency on `unicode_segmentation` crate as it is no longer
used
## Are these changes tested?
Yes. The new SLT tests were also run against DuckDB and Postgres to
confirm the behavior is consistent.
## Are there any user-facing changes?
Yes. This PR changes the behavior of `lpad`, `rpad`, and `translate`,
although the new behavior is more consistent with the SQL standard and
with other SQL implementations.
1 parent 6cf94c7 commit 6c106ba
File tree
8 files changed
+255
-169
lines changed- datafusion
- functions
- src/unicode
- sqllogictest/test_files/string
- docs/source/library-user-guide/upgrading
8 files changed
+255
-169
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
59 | 59 | | |
60 | 60 | | |
61 | 61 | | |
62 | | - | |
| 62 | + | |
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
| |||
87 | 87 | | |
88 | 88 | | |
89 | 89 | | |
90 | | - | |
91 | 90 | | |
92 | 91 | | |
93 | 92 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
78 | 78 | | |
79 | 79 | | |
80 | 80 | | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
81 | 114 | | |
82 | 115 | | |
83 | 116 | | |
| |||
88 | 121 | | |
89 | 122 | | |
90 | 123 | | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | | - | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
96 | 127 | | |
97 | 128 | | |
98 | 129 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | 27 | | |
29 | 28 | | |
30 | 29 | | |
| |||
178 | 177 | | |
179 | 178 | | |
180 | 179 | | |
181 | | - | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
182 | 183 | | |
183 | 184 | | |
184 | 185 | | |
| |||
270 | 271 | | |
271 | 272 | | |
272 | 273 | | |
273 | | - | |
274 | 274 | | |
275 | 275 | | |
276 | 276 | | |
277 | | - | |
278 | | - | |
279 | | - | |
280 | | - | |
281 | | - | |
282 | | - | |
283 | | - | |
284 | | - | |
285 | | - | |
286 | | - | |
287 | | - | |
288 | | - | |
289 | | - | |
290 | | - | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
291 | 287 | | |
292 | 288 | | |
293 | | - | |
| 289 | + | |
294 | 290 | | |
295 | 291 | | |
296 | 292 | | |
| |||
378 | 374 | | |
379 | 375 | | |
380 | 376 | | |
381 | | - | |
382 | 377 | | |
383 | 378 | | |
384 | 379 | | |
| |||
407 | 402 | | |
408 | 403 | | |
409 | 404 | | |
410 | | - | |
411 | | - | |
| 405 | + | |
412 | 406 | | |
413 | 407 | | |
414 | 408 | | |
| |||
428 | 422 | | |
429 | 423 | | |
430 | 424 | | |
431 | | - | |
432 | | - | |
433 | | - | |
434 | | - | |
435 | 425 | | |
436 | 426 | | |
437 | 427 | | |
438 | | - | |
439 | | - | |
440 | | - | |
441 | | - | |
442 | | - | |
443 | | - | |
444 | | - | |
445 | | - | |
446 | | - | |
447 | | - | |
448 | | - | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
449 | 442 | | |
450 | | - | |
451 | 443 | | |
452 | 444 | | |
453 | 445 | | |
| |||
458 | 450 | | |
459 | 451 | | |
460 | 452 | | |
461 | | - | |
462 | 453 | | |
463 | 454 | | |
464 | 455 | | |
| |||
491 | 482 | | |
492 | 483 | | |
493 | 484 | | |
494 | | - | |
495 | | - | |
496 | | - | |
497 | | - | |
498 | | - | |
499 | | - | |
500 | | - | |
501 | | - | |
502 | | - | |
503 | | - | |
504 | | - | |
| 485 | + | |
| 486 | + | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
505 | 494 | | |
506 | | - | |
507 | 495 | | |
508 | 496 | | |
509 | 497 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
28 | 27 | | |
29 | 28 | | |
30 | 29 | | |
| |||
178 | 177 | | |
179 | 178 | | |
180 | 179 | | |
181 | | - | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
182 | 183 | | |
183 | 184 | | |
184 | 185 | | |
| |||
271 | 272 | | |
272 | 273 | | |
273 | 274 | | |
274 | | - | |
275 | 275 | | |
276 | 276 | | |
277 | 277 | | |
278 | | - | |
279 | | - | |
280 | | - | |
281 | | - | |
282 | | - | |
283 | | - | |
284 | | - | |
285 | | - | |
286 | | - | |
287 | | - | |
288 | | - | |
289 | | - | |
290 | | - | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
291 | 283 | | |
292 | | - | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
293 | 289 | | |
294 | 290 | | |
295 | | - | |
| 291 | + | |
296 | 292 | | |
297 | 293 | | |
298 | 294 | | |
| |||
377 | 373 | | |
378 | 374 | | |
379 | 375 | | |
380 | | - | |
381 | 376 | | |
382 | 377 | | |
383 | 378 | | |
| |||
406 | 401 | | |
407 | 402 | | |
408 | 403 | | |
409 | | - | |
410 | | - | |
| 404 | + | |
411 | 405 | | |
412 | 406 | | |
413 | 407 | | |
| |||
428 | 422 | | |
429 | 423 | | |
430 | 424 | | |
431 | | - | |
432 | | - | |
433 | | - | |
434 | 425 | | |
435 | 426 | | |
436 | 427 | | |
437 | | - | |
438 | | - | |
439 | | - | |
440 | | - | |
441 | | - | |
442 | | - | |
443 | | - | |
444 | | - | |
445 | | - | |
446 | | - | |
447 | | - | |
448 | | - | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
449 | 443 | | |
450 | | - | |
451 | 444 | | |
452 | 445 | | |
453 | 446 | | |
| |||
458 | 451 | | |
459 | 452 | | |
460 | 453 | | |
461 | | - | |
462 | 454 | | |
463 | 455 | | |
464 | 456 | | |
| |||
492 | 484 | | |
493 | 485 | | |
494 | 486 | | |
495 | | - | |
496 | | - | |
497 | | - | |
498 | | - | |
499 | | - | |
500 | | - | |
501 | | - | |
502 | | - | |
503 | | - | |
504 | | - | |
505 | | - | |
| 487 | + | |
| 488 | + | |
| 489 | + | |
| 490 | + | |
| 491 | + | |
| 492 | + | |
| 493 | + | |
| 494 | + | |
| 495 | + | |
| 496 | + | |
506 | 497 | | |
507 | | - | |
508 | 498 | | |
509 | 499 | | |
510 | 500 | | |
| |||
0 commit comments