[Enhancement] Support Arrow StringView (Utf8View) type in Arrow-to-StarRocks converter#71306
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 541f133258
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
541f133 to
e1ab33e
Compare
|
To use Codex here, create a Codex account and connect to github. |
c863e0e to
e45cfa0
Compare
|
I fixed the BE coverage problem, but it didnt fire the new coverage report, fyi. |
|
I have no idea to resolve DCO problem on a commit automatically added, not by me. Also, failing BE architecture check, I have no idea what it is :D Can someone help me on that? |
|
@stdpain do you have some improvement suggestions? |
|
@mergify rebase |
… arrow_type_traits.h - Add M_ArrowTypeIdToTypeStruct(ArrowTypeId::STRING_VIEW, arrow::StringViewType) - Add VALUE_GUARD(ArrowTypeId, StringViewATGuard, at_is_string_view, ArrowTypeId::STRING_VIEW) - Add explicit CppType specialization mapping STRING_VIEW to std::string_view - Add #include <string_view> for robustness - STRING_VIEW intentionally excluded from BinaryATGuard Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
…e traits - static_assert ArrowTypeIdToArrayType<STRING_VIEW> == arrow::StringViewArray - static_assert ArrowTypeIdToCppType<STRING_VIEW> == std::string_view - static_assert at_is_string_view<STRING_VIEW> is true - static_assert at_is_binary<STRING_VIEW> is false (guard isolation) - All assertions compile cleanly, proving TYPR-01 and TYPR-02 Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
- Add ArrowConverter<AT, LT, is_nullable, is_strict, StringViewATGuard<AT>, StringOrBinaryGuard<LT>> specialization - Uses per-element GetView(i) loop (not optimize_non_fixed_size_binary bulk path) - Handles nullable slots via pre-filled null_data array - Enforces binary_max_length<LT> with nullable/strict mode branching - Reports length violations via ctx->report_error_message with reported flag Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
- Add ArrowTypeId::STRING_VIEW to TYPE_VARCHAR entry in global_strict_arrow_conv_table (INBD-05) - Add ARROW_CONV_ENTRY(ArrowTypeId::STRING_VIEW, TYPE_CHAR, TYPE_VARCHAR) in global_optimized_arrow_conv_table (INBD-06) - Enables get_strict_type(STRING_VIEW) -> TYPE_VARCHAR and get_arrow_converter(STRING_VIEW, ...) -> non-null ConvertFunc Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
- Add create_string_view_array helper using arrow::StringViewBuilder - TEST-01: inline strings (0, 1, 11, 12 bytes) via StringView inline path - TEST-02: out-of-line strings (13, 20, 50, 100 bytes) via variadic buffer - TEST-03: multi-buffer (600 x 64-byte = 38,400 bytes > 32KB) verifies second variadic buffer correctness Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
Signed-off-by: Metehan Yildirim <metegenez@gmail.com> style: auto-fix clang-format issues
…, and ctx overflow paths Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
|
Deprecation notice: This pull request comes from a fork and was rebased using |
✅ Branch has been successfully rebased |
0dd6a9b to
d26d3b0
Compare
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 35 / 35 (100.00%) file detail
|
|
@alvin-celerdata It didn't fire again.
|
…arRocks converter (StarRocks#71306) Signed-off-by: Metehan Yildirim <metegenez@gmail.com> Co-authored-by: wanpengfei-git <wanpengfei91@163.com>
Why I'm doing:
Modern Arrow clients like DataFusion, DuckDB, and Polars produce StringView (Utf8View) arrays as their default string representation. When these clients send data to StarRocks over Arrow Flight SQL, StarRocks fails with an unrecognized type error because the Arrow converter has no handler for StringView.
What I'm doing:
This PR adds StringView support to the Arrow-to-StarRocks conversion layer. The core challenge is that StringView uses a fundamentally different memory layout from traditional Arrow String/Binary types. Traditional types store all string data in one contiguous buffer with an offsets array. StringView instead uses a 16-byte view struct per element: strings up to 12 bytes are stored inline in the struct itself, while longer strings reference data in one of potentially many variadic data buffers (a new buffer is allocated roughly every 32KB).
Because of this scattered layout, the existing bulk memcpy fast path (optimize_non_fixed_size_binary) cannot be reused. It assumes a single contiguous data buffer and would silently corrupt out-of-line strings. Instead, the converter iterates per element using Arrow's GetView(i) API, which transparently resolves both inline and out-of-line storage. This is the same approach used by DuckDB, Velox, and Arrow's own internal cast kernels.
An alternative would be to first cast StringView to a regular String array using Arrow's compute library, then let the existing Binary/String converter handle it. We intentionally avoid this because it copies the data twice: once into a temporary contiguous Arrow String array, and again into StarRocks' BinaryColumn. The direct approach copies once.
The changes:
Register STRING_VIEW type traits and a standalone StringViewATGuard in arrow_type_traits.h. The guard is kept separate from BinaryATGuard so template dispatch routes StringView to the safe per-element path instead of the bulk memcpy path.
Add a new ArrowConverter template specialization for StringViewATGuard that handles VARCHAR and CHAR conversion. It uses GetView(i) per element, skips null slots via pre-filled null_data, and enforces max length with the same nullable/strict branching as the existing Binary/String converter.
Register StringView in both global_strict_arrow_conv_table (TYPE_VARCHAR entry) and global_optimized_arrow_conv_table (TYPE_CHAR + TYPE_VARCHAR entry) so the converter is reachable in both execution modes.
Add compile-time static_assert tests verifying that STRING_VIEW type traits resolve correctly and that StringView is excluded from BinaryATGuard.
Add five runtime unit tests covering inline strings (0-12 bytes), out-of-line strings (13+ bytes), multi-buffer batches (>32KB forcing multiple variadic buffers), nullable arrays with interspersed nulls, and CHAR conversion with length enforcement.
Fixes #71280
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: