Commit 9f893a4
authored
## Which issue does this PR close?
- Closes #21117.
- Closes #21118 .
## Rationale for this change
`split_part` currently accepts `Utf8View` but always returns `Utf8`.
When given `Utf8View` input, it should instead return `Utf8View` output.
While we're at it, optimize `split_part` for single-character delimiters
(the common case): `str::split(&str)` is significantly slower than
`str::split(char)` for single-character ASCII delimiters, because the
former uses a general string matching algorithm but the latter uses
`memchr::memchr`.
Benchmark results (M4 Max):
- `utf8_single_char/pos_first`: 142 µs → 104 µs (-26%)
- `utf8_single_char/pos_middle`: 389 µs → 365 µs (-6%)
- `utf8_single_char/pos_negative`: 154 µs → 109 µs (-29%)
- `utf8_multi_char/pos_middle`: 356 µs → 361 µs (~0%, noise)
- `utf8view_single_char/pos_first`: 143 µs → 111 µs (-22%)
- `utf8_long_strings/pos_middle`: 2568 µs → 1984 µs (-23%)
- `utf8view_long_parts/pos_middle`: 998 µs → 470 µs (-53%)
## What changes are included in this PR?
* Revise `split_part` benchmarks to reduce redundancy and improve
`Utf8View` coverage
* Support `Utf8View` -> `Utf8View` in `split_part`
* Refactor `split_part` to cleanup some redundant code
* Optimize `split_part` for single-character delimiters
* Add SLT test coverage for `split_part` with `Utf8View` input
## Are these changes tested?
Yes. New tests and benchmarks added.
## Are there any user-facing changes?
No.
1 parent e913557 commit 9f893a4
File tree
3 files changed
+297
-384
lines changed- datafusion
- functions
- benches
- src/string
- sqllogictest/test_files/string
3 files changed
+297
-384
lines changed
0 commit comments