Skip to content

[refine](column) add ColumnArrayView utility for Array<scalar> column access#62242

Open
Mryange wants to merge 5 commits intoapache:masterfrom
Mryange:util-array-view
Open

[refine](column) add ColumnArrayView utility for Array<scalar> column access#62242
Mryange wants to merge 5 commits intoapache:masterfrom
Mryange:util-array-view

Conversation

@Mryange
Copy link
Copy Markdown
Contributor

@Mryange Mryange commented Apr 8, 2026

What problem does this PR solve?

Problem Summary:

Array function implementations (function_array_element.h, function_array_join.h, function_array_distance.h, etc.) share a common boilerplate pattern: manually unwrapping ColumnConst / ColumnNullable / ColumnArray layers, extracting offsets, nested null maps, and data pointers. This repeated code is error-prone and hard to maintain.

This PR introduces ColumnArrayView<PType> — a lightweight, zero-copy, read-only view that automatically handles Const/Nullable/Plain unwrapping and provides a clean index-based API for accessing array elements.

Changes

New utility: be/src/core/column/column_array_view.h

  • ArrayDataView<PType>: represents a single row's array slice with size(), value_at(idx), is_null_at(idx)
  • ColumnArrayView<PType>: column-level view with create(ColumnPtr) factory, operator[](row) returning ArrayDataView, is_null_at(row), size()
  • Handles ColumnConst, ColumnNullable(ColumnArray), and plain ColumnArray transparently

Applied to: function_array_join.h

  • Replaced extract_column_array_info / ColumnArrayExecutionData with ColumnArrayView<TYPE_STRING>::create()
  • Removed dependency on function_array_utils.h
  • Simplified _execute_string signature from 6 params to 4

Unit tests: column_array_view_test.cpp — 8 test cases covering Plain, Nullable elements, outer Nullable, Const, Const+Nullable, empty arrays, and String arrays.

Benchmarks: Two benchmark files comparing ColumnArrayView vs hand-written access:

  • benchmark_column_array_view.hpp — element-level access (Int64/String/Const/Nullable)
  • benchmark_column_array_view_distance.hpp — simulated array distance computation (128-dim Float32 L2)

Performance Impact

Benchmarked on 192-core x86_64, RELEASE build, 3 repetitions each. ColumnArrayView vs hand-written:

Element access (4096 rows × 8 elements):

Scenario Handwritten ArrayView Diff
Int64 Plain 16,666 ns 16,182 ns -2.9%
String Plain 17,404 ns 17,333 ns -0.4%
Int64 Const 16,094 ns 16,201 ns +0.7%
Int64 Nullable 16,753 ns 16,881 ns +0.8%
Int64 WithNulls 25,662 ns 31,342 ns +22.1%

Array distance (4096 rows × 128-dim L2):

Scenario Handwritten ArrayView Diff
Plain vs Plain 293,386 ns 297,356 ns +1.4%
Const vs Plain 284,318 ns 288,428 ns +1.4%
Nullable vs Plain 291,870 ns 292,372 ns +0.2%

Conclusion: For compute-heavy workloads (distance, string operations), the view overhead is negligible (<1.5%). The WithNulls case shows +22% regression due to offset + idx indirection in is_null_at() vs direct flat index — this only affects inner-loop null checking of individual elements. In real array functions, this is typically dominated by per-element computation cost.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 8, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 80.26% (61/76) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.67% (27383/37171)
Line Coverage 57.31% (295551/515685)
Region Coverage 54.56% (246221/451248)
Branch Coverage 56.21% (106696/189833)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 9, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 80.00% (40/50) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.00% (20130/37978)
Line Coverage 36.55% (189189/517667)
Region Coverage 32.80% (146832/447662)
Branch Coverage 33.91% (64250/189463)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 95.45% (42/44) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.67% (27399/37192)
Line Coverage 57.27% (295539/516079)
Region Coverage 54.29% (245279/451788)
Branch Coverage 56.10% (106604/190041)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 9, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 81.71% (67/82) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.00% (20130/37978)
Line Coverage 36.54% (189160/517618)
Region Coverage 32.82% (146900/447652)
Branch Coverage 33.92% (64255/189457)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (76/76) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.64% (27389/37192)
Line Coverage 57.27% (295537/516030)
Region Coverage 54.64% (246847/451778)
Branch Coverage 56.18% (106756/190035)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 10, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. be/src/core/column/column_array_view.h now unconditionally casts array_column.get_data() to ColumnNullable. That regresses plain Array<T> inputs: DataTypeArray::create_column() still builds ColumnArray(nested->create_column(), ...), and the old extract_column_array_info() handled both nullable and non-nullable nested storage. As a result, array_join(Array<String>) and the array-distance functions will now fail on non-nullable arrays.
  2. ArrayDataView::get_data() turns an empty nested column into nullptr + offset. When the whole nested column is empty, ColumnElementView::get_data() returns nullptr, and FunctionArrayDistance now calls get_data() unconditionally for every row. Inputs like [] / const([]) therefore hit undefined behavior (and typically ASAN) even though the previous code path used PaddedPODArray::data() and remained valid for empty arrays.

Critical checkpoints:

  • Goal of the task: introduce a reusable zero-copy array view and migrate selected array functions. The abstraction is mostly in place, but the current implementation does not preserve the old behavior for plain Array<T> or empty-array distance calls, so the task is not completed correctly yet.
  • Small, clear, focused change: yes overall; the refactor is localized to the helper plus two function call sites.
  • Concurrency: not involved.
  • Lifecycle/static initialization: not involved.
  • Configuration changes: none.
  • Compatibility/incompatible changes: no FE/BE protocol or storage-format changes, but there is a runtime compatibility regression for valid existing SQL inputs using non-nullable arrays.
  • Parallel code paths: only array_join and array-distance were migrated. That is acceptable for a staged refactor, but it makes preserving old semantics in the shared helper especially important.
  • Special conditional checks: the new unconditional inner-nullable cast lacks a proven invariant and is not safe given the existing DataTypeArray/extract_column_array_info() behavior.
  • Test coverage: insufficient. The new tests cover nullable-backed arrays and some const/outer-nullable forms, but they miss plain non-nullable Array<String> / Array<Float> and do not exercise FunctionArrayDistance on empty arrays.
  • Observability: no additional observability is needed for this refactor.
  • Transaction/persistence: not involved.
  • Data write/atomicity: not involved.
  • FE/BE variable passing: not involved.
  • Performance review: benchmarks are useful, but correctness regressions must be fixed before the benchmark results matter.
  • Other issues: none beyond the two correctness problems above.

Summary opinion: this should not merge as-is because the shared helper changes runtime behavior for existing valid array inputs.

const auto& array_column = assert_cast<const ColumnArray&>(*array_raw);

// Step 4: unpack inner nullable (data column is always Nullable)
const auto& nested_nullable = assert_cast<const ColumnNullable&>(array_column.get_data());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-casts every array's nested storage to ColumnNullable, but Doris still produces plain ColumnArray(nested->create_column(), ...) for non-nullable Array<T> in DataTypeArray::create_column(), and the old extract_column_array_info() handled both shapes. That means array_join(Array<String>) / l2_distance(Array<Float>, ...) now regress on ordinary non-nullable arrays.

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 10, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 81.71% (67/82) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.00% (20130/37978)
Line Coverage 36.54% (189144/517618)
Region Coverage 32.79% (146798/447649)
Branch Coverage 33.91% (64249/189455)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 10, 2026

run p0

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 10, 2026

run external

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (76/76) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.69% (27405/37192)
Line Coverage 57.30% (295685/516030)
Region Coverage 54.36% (245580/451775)
Branch Coverage 56.13% (106657/190033)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (76/76) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.70% (27409/37192)
Line Coverage 57.31% (295762/516030)
Region Coverage 54.38% (245671/451775)
Branch Coverage 56.14% (106687/190033)

@Mryange
Copy link
Copy Markdown
Contributor Author

Mryange commented Apr 14, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings:

  1. be/src/core/column/column_array_view.h now hardcodes the array payload as ColumnNullable, but Doris still represents valid Array<T NOT NULL> values with a plain nested column (DataTypeArray::create_column() returns ColumnArray::create(nested->create_column(), ...)). The old callers handled both shapes. With the new helper, ColumnArrayView::create() will assert_cast fail as soon as a non-nullable array reaches array_join or the distance functions. This is a behavioral regression for existing queries such as the *_notnull array_join regression cases.

Critical checkpoints:

  • Goal of the task: introduce a reusable zero-copy array view and migrate selected array functions. The abstraction itself is useful, but the current implementation does not preserve existing Array<T> behavior, so the task is not completed correctly yet.
  • Is the modification small, clear, and focused: yes. The refactor is localized, which makes the regression easy to fix in the shared helper.
  • Concurrency: not involved.
  • Lifecycle/static initialization: not involved.
  • Configuration changes: none.
  • Incompatible changes: no storage/protocol changes, but there is a runtime compatibility regression for existing valid SQL using non-nullable arrays.
  • Parallel code paths: only array_join and array-distance were migrated. That is acceptable for a staged refactor, but the shared helper must preserve the semantics already handled by extract_column_array_info() and the previous array-distance code.
  • Special conditional checks: the unconditional inner ColumnNullable cast lacks a valid invariant; existing Doris array code explicitly checks both nullable and non-nullable nested layouts.
  • Test coverage: insufficient for this refactor. The new unit tests only build nullable-backed array payloads, so they miss plain Array<T NOT NULL> cases, and the migrated functions are not covered by new regression/UT assertions for that shape.
  • Test result files: not applicable.
  • Observability: no additional observability is needed here.
  • Transaction/persistence: not involved.
  • Data writes/atomicity: not involved.
  • FE/BE variable passing: not involved.
  • Performance: benchmark work is fine, but correctness compatibility needs to be restored first.
  • Other issues: none beyond the blocking compatibility regression above.

Summary opinion: requesting changes because the new shared helper breaks valid non-nullable array inputs on the migrated execution paths.

const auto& array_column = assert_cast<const ColumnArray&>(*array_raw);

// Step 4: unpack inner nullable (data column is always Nullable)
const auto& nested_nullable = assert_cast<const ColumnNullable&>(array_column.get_data());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnArray does not guarantee that its nested payload is always ColumnNullable. DataTypeArray::create_column() still builds ColumnArray::create(nested->create_column(), ...), so Array<T NOT NULL> reaches BE with a plain nested column, and the previous implementations handled that shape explicitly. This unconditional assert_cast<const ColumnNullable&> therefore regresses existing valid inputs: for example the array_join(..._notnull) regression cases now crash/fail as soon as they hit ColumnArrayView::create(). The helper needs to preserve both layouts, e.g. by checking whether array_column.get_data() is nullable and only wiring nested_null_map when it actually exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants