Skip to content

[Enhancement] Support Arrow StringView (Utf8View) type in Arrow-to-StarRocks converter#71306

Merged
stdpain merged 8 commits into
StarRocks:mainfrom
metegenez:string_view_arrow
Apr 15, 2026
Merged

[Enhancement] Support Arrow StringView (Utf8View) type in Arrow-to-StarRocks converter#71306
stdpain merged 8 commits into
StarRocks:mainfrom
metegenez:string_view_arrow

Conversation

@metegenez
Copy link
Copy Markdown
Contributor

@metegenez metegenez commented Apr 5, 2026

Why I'm doing:

Modern Arrow clients like DataFusion, DuckDB, and Polars produce StringView (Utf8View) arrays as their default string representation. When these clients send data to StarRocks over Arrow Flight SQL, StarRocks fails with an unrecognized type error because the Arrow converter has no handler for StringView.

What I'm doing:

This PR adds StringView support to the Arrow-to-StarRocks conversion layer. The core challenge is that StringView uses a fundamentally different memory layout from traditional Arrow String/Binary types. Traditional types store all string data in one contiguous buffer with an offsets array. StringView instead uses a 16-byte view struct per element: strings up to 12 bytes are stored inline in the struct itself, while longer strings reference data in one of potentially many variadic data buffers (a new buffer is allocated roughly every 32KB).

Because of this scattered layout, the existing bulk memcpy fast path (optimize_non_fixed_size_binary) cannot be reused. It assumes a single contiguous data buffer and would silently corrupt out-of-line strings. Instead, the converter iterates per element using Arrow's GetView(i) API, which transparently resolves both inline and out-of-line storage. This is the same approach used by DuckDB, Velox, and Arrow's own internal cast kernels.

An alternative would be to first cast StringView to a regular String array using Arrow's compute library, then let the existing Binary/String converter handle it. We intentionally avoid this because it copies the data twice: once into a temporary contiguous Arrow String array, and again into StarRocks' BinaryColumn. The direct approach copies once.

The changes:

  1. Register STRING_VIEW type traits and a standalone StringViewATGuard in arrow_type_traits.h. The guard is kept separate from BinaryATGuard so template dispatch routes StringView to the safe per-element path instead of the bulk memcpy path.

  2. Add a new ArrowConverter template specialization for StringViewATGuard that handles VARCHAR and CHAR conversion. It uses GetView(i) per element, skips null slots via pre-filled null_data, and enforces max length with the same nullable/strict branching as the existing Binary/String converter.

  3. Register StringView in both global_strict_arrow_conv_table (TYPE_VARCHAR entry) and global_optimized_arrow_conv_table (TYPE_CHAR + TYPE_VARCHAR entry) so the converter is reachable in both execution modes.

  4. Add compile-time static_assert tests verifying that STRING_VIEW type traits resolve correctly and that StringView is excluded from BinaryATGuard.

  5. Add five runtime unit tests covering inline strings (0-12 bytes), out-of-line strings (13+ bytes), multi-buffer batches (>32KB forcing multiple variadic buffers), nullable arrays with interspersed nulls, and CHAR conversion with length enforcement.

Fixes #71280

What type of PR is this:

  • BugFix
  • [] Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

@alvin-phoenix-ai
Copy link
Copy Markdown
Contributor

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 541f133258

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread be/src/exec/arrow_to_starrocks_converter.cpp
@metegenez metegenez force-pushed the string_view_arrow branch from 541f133 to e1ab33e Compare April 5, 2026 20:19
@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@stdpain stdpain self-assigned this Apr 6, 2026
@metegenez metegenez force-pushed the string_view_arrow branch from c863e0e to e45cfa0 Compare April 8, 2026 13:16
@metegenez
Copy link
Copy Markdown
Contributor Author

I fixed the BE coverage problem, but it didnt fire the new coverage report, fyi.

@metegenez
Copy link
Copy Markdown
Contributor Author

I have no idea to resolve DCO problem on a commit automatically added, not by me. Also, failing BE architecture check, I have no idea what it is :D Can someone help me on that?

@metegenez
Copy link
Copy Markdown
Contributor Author

@stdpain do you have some improvement suggestions?

@alvin-phoenix-ai
Copy link
Copy Markdown
Contributor

@mergify rebase
@metegenez could you check the BE Architecture failure?

metegenez and others added 8 commits April 11, 2026 16:38
… arrow_type_traits.h

- Add M_ArrowTypeIdToTypeStruct(ArrowTypeId::STRING_VIEW, arrow::StringViewType)
- Add VALUE_GUARD(ArrowTypeId, StringViewATGuard, at_is_string_view, ArrowTypeId::STRING_VIEW)
- Add explicit CppType specialization mapping STRING_VIEW to std::string_view
- Add #include <string_view> for robustness
- STRING_VIEW intentionally excluded from BinaryATGuard

Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
…e traits

- static_assert ArrowTypeIdToArrayType<STRING_VIEW> == arrow::StringViewArray
- static_assert ArrowTypeIdToCppType<STRING_VIEW> == std::string_view
- static_assert at_is_string_view<STRING_VIEW> is true
- static_assert at_is_binary<STRING_VIEW> is false (guard isolation)
- All assertions compile cleanly, proving TYPR-01 and TYPR-02

Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
- Add ArrowConverter<AT, LT, is_nullable, is_strict, StringViewATGuard<AT>, StringOrBinaryGuard<LT>> specialization
- Uses per-element GetView(i) loop (not optimize_non_fixed_size_binary bulk path)
- Handles nullable slots via pre-filled null_data array
- Enforces binary_max_length<LT> with nullable/strict mode branching
- Reports length violations via ctx->report_error_message with reported flag

Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
- Add ArrowTypeId::STRING_VIEW to TYPE_VARCHAR entry in global_strict_arrow_conv_table (INBD-05)
- Add ARROW_CONV_ENTRY(ArrowTypeId::STRING_VIEW, TYPE_CHAR, TYPE_VARCHAR) in global_optimized_arrow_conv_table (INBD-06)
- Enables get_strict_type(STRING_VIEW) -> TYPE_VARCHAR and get_arrow_converter(STRING_VIEW, ...) -> non-null ConvertFunc

Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
- Add create_string_view_array helper using arrow::StringViewBuilder
- TEST-01: inline strings (0, 1, 11, 12 bytes) via StringView inline path
- TEST-02: out-of-line strings (13, 20, 50, 100 bytes) via variadic buffer
- TEST-03: multi-buffer (600 x 64-byte = 38,400 bytes > 32KB) verifies second variadic buffer correctness

Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
Signed-off-by: Metehan Yildirim <metegenez@gmail.com>

style: auto-fix clang-format issues
…, and ctx overflow paths

Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 11, 2026

Deprecation notice: This pull request comes from a fork and was rebased using bot_account impersonation. This capability will be removed on July 1, 2026. After this date, the rebase action will no longer be able to rebase fork pull requests with this configuration. Please switch to the update action/command to ensure compatibility going forward.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 11, 2026

rebase

✅ Branch has been successfully rebased

@github-actions
Copy link
Copy Markdown
Contributor

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Copy Markdown
Contributor

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Copy Markdown
Contributor

[BE Incremental Coverage Report]

pass : 35 / 35 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/exec/arrow_to_starrocks_converter.cpp 35 35 100.00% []

@metegenez
Copy link
Copy Markdown
Contributor Author

@alvin-celerdata It didn't fire again.

CI PIPELINE / BE Architecture (pull_request)Successful in 46s

@stdpain stdpain merged commit 902b1cc into StarRocks:main Apr 15, 2026
53 of 54 checks passed
robd003 pushed a commit to robd003/starrocks that referenced this pull request May 6, 2026
…arRocks converter (StarRocks#71306)

Signed-off-by: Metehan Yildirim <metegenez@gmail.com>
Co-authored-by: wanpengfei-git <wanpengfei91@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] Support Arrow StringView (Utf8View) and BinaryView types in Arrow-to-StarRocks converter

5 participants