[refine](column) add ColumnArrayView utility for Array<scalar> column access#62242
[refine](column) add ColumnArrayView utility for Array<scalar> column access#62242Mryange wants to merge 5 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Findings:
be/src/core/column/column_array_view.hnow unconditionally castsarray_column.get_data()toColumnNullable. That regresses plainArray<T>inputs:DataTypeArray::create_column()still buildsColumnArray(nested->create_column(), ...), and the oldextract_column_array_info()handled both nullable and non-nullable nested storage. As a result,array_join(Array<String>)and the array-distance functions will now fail on non-nullable arrays.ArrayDataView::get_data()turns an empty nested column intonullptr + offset. When the whole nested column is empty,ColumnElementView::get_data()returnsnullptr, andFunctionArrayDistancenow callsget_data()unconditionally for every row. Inputs like[]/const([])therefore hit undefined behavior (and typically ASAN) even though the previous code path usedPaddedPODArray::data()and remained valid for empty arrays.
Critical checkpoints:
- Goal of the task: introduce a reusable zero-copy array view and migrate selected array functions. The abstraction is mostly in place, but the current implementation does not preserve the old behavior for plain
Array<T>or empty-array distance calls, so the task is not completed correctly yet. - Small, clear, focused change: yes overall; the refactor is localized to the helper plus two function call sites.
- Concurrency: not involved.
- Lifecycle/static initialization: not involved.
- Configuration changes: none.
- Compatibility/incompatible changes: no FE/BE protocol or storage-format changes, but there is a runtime compatibility regression for valid existing SQL inputs using non-nullable arrays.
- Parallel code paths: only
array_joinand array-distance were migrated. That is acceptable for a staged refactor, but it makes preserving old semantics in the shared helper especially important. - Special conditional checks: the new unconditional inner-nullable cast lacks a proven invariant and is not safe given the existing
DataTypeArray/extract_column_array_info()behavior. - Test coverage: insufficient. The new tests cover nullable-backed arrays and some const/outer-nullable forms, but they miss plain non-nullable
Array<String>/Array<Float>and do not exerciseFunctionArrayDistanceon empty arrays. - Observability: no additional observability is needed for this refactor.
- Transaction/persistence: not involved.
- Data write/atomicity: not involved.
- FE/BE variable passing: not involved.
- Performance review: benchmarks are useful, but correctness regressions must be fixed before the benchmark results matter.
- Other issues: none beyond the two correctness problems above.
Summary opinion: this should not merge as-is because the shared helper changes runtime behavior for existing valid array inputs.
| const auto& array_column = assert_cast<const ColumnArray&>(*array_raw); | ||
|
|
||
| // Step 4: unpack inner nullable (data column is always Nullable) | ||
| const auto& nested_nullable = assert_cast<const ColumnNullable&>(array_column.get_data()); |
There was a problem hiding this comment.
This hard-casts every array's nested storage to ColumnNullable, but Doris still produces plain ColumnArray(nested->create_column(), ...) for non-nullable Array<T> in DataTypeArray::create_column(), and the old extract_column_array_info() handled both shapes. That means array_join(Array<String>) / l2_distance(Array<Float>, ...) now regress on ordinary non-nullable arrays.
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run p0 |
|
run external |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
224be17 to
52c116f
Compare
|
/review |
There was a problem hiding this comment.
Findings:
be/src/core/column/column_array_view.hnow hardcodes the array payload asColumnNullable, but Doris still represents validArray<T NOT NULL>values with a plain nested column (DataTypeArray::create_column()returnsColumnArray::create(nested->create_column(), ...)). The old callers handled both shapes. With the new helper,ColumnArrayView::create()willassert_castfail as soon as a non-nullable array reachesarray_joinor the distance functions. This is a behavioral regression for existing queries such as the*_notnullarray_joinregression cases.
Critical checkpoints:
- Goal of the task: introduce a reusable zero-copy array view and migrate selected array functions. The abstraction itself is useful, but the current implementation does not preserve existing
Array<T>behavior, so the task is not completed correctly yet. - Is the modification small, clear, and focused: yes. The refactor is localized, which makes the regression easy to fix in the shared helper.
- Concurrency: not involved.
- Lifecycle/static initialization: not involved.
- Configuration changes: none.
- Incompatible changes: no storage/protocol changes, but there is a runtime compatibility regression for existing valid SQL using non-nullable arrays.
- Parallel code paths: only
array_joinand array-distance were migrated. That is acceptable for a staged refactor, but the shared helper must preserve the semantics already handled byextract_column_array_info()and the previous array-distance code. - Special conditional checks: the unconditional inner
ColumnNullablecast lacks a valid invariant; existing Doris array code explicitly checks both nullable and non-nullable nested layouts. - Test coverage: insufficient for this refactor. The new unit tests only build nullable-backed array payloads, so they miss plain
Array<T NOT NULL>cases, and the migrated functions are not covered by new regression/UT assertions for that shape. - Test result files: not applicable.
- Observability: no additional observability is needed here.
- Transaction/persistence: not involved.
- Data writes/atomicity: not involved.
- FE/BE variable passing: not involved.
- Performance: benchmark work is fine, but correctness compatibility needs to be restored first.
- Other issues: none beyond the blocking compatibility regression above.
Summary opinion: requesting changes because the new shared helper breaks valid non-nullable array inputs on the migrated execution paths.
| const auto& array_column = assert_cast<const ColumnArray&>(*array_raw); | ||
|
|
||
| // Step 4: unpack inner nullable (data column is always Nullable) | ||
| const auto& nested_nullable = assert_cast<const ColumnNullable&>(array_column.get_data()); |
There was a problem hiding this comment.
ColumnArray does not guarantee that its nested payload is always ColumnNullable. DataTypeArray::create_column() still builds ColumnArray::create(nested->create_column(), ...), so Array<T NOT NULL> reaches BE with a plain nested column, and the previous implementations handled that shape explicitly. This unconditional assert_cast<const ColumnNullable&> therefore regresses existing valid inputs: for example the array_join(..._notnull) regression cases now crash/fail as soon as they hit ColumnArrayView::create(). The helper needs to preserve both layouts, e.g. by checking whether array_column.get_data() is nullable and only wiring nested_null_map when it actually exists.
What problem does this PR solve?
Problem Summary:
Array function implementations (
function_array_element.h,function_array_join.h,function_array_distance.h, etc.) share a common boilerplate pattern: manually unwrappingColumnConst/ColumnNullable/ColumnArraylayers, extracting offsets, nested null maps, and data pointers. This repeated code is error-prone and hard to maintain.This PR introduces
ColumnArrayView<PType>— a lightweight, zero-copy, read-only view that automatically handles Const/Nullable/Plain unwrapping and provides a clean index-based API for accessing array elements.Changes
New utility:
be/src/core/column/column_array_view.hArrayDataView<PType>: represents a single row's array slice withsize(),value_at(idx),is_null_at(idx)ColumnArrayView<PType>: column-level view withcreate(ColumnPtr)factory,operator[](row)returningArrayDataView,is_null_at(row),size()ColumnConst,ColumnNullable(ColumnArray), and plainColumnArraytransparentlyApplied to:
function_array_join.hextract_column_array_info/ColumnArrayExecutionDatawithColumnArrayView<TYPE_STRING>::create()function_array_utils.h_execute_stringsignature from 6 params to 4Unit tests:
column_array_view_test.cpp— 8 test cases covering Plain, Nullable elements, outer Nullable, Const, Const+Nullable, empty arrays, and String arrays.Benchmarks: Two benchmark files comparing
ColumnArrayViewvs hand-written access:benchmark_column_array_view.hpp— element-level access (Int64/String/Const/Nullable)benchmark_column_array_view_distance.hpp— simulated array distance computation (128-dim Float32 L2)Performance Impact
Benchmarked on 192-core x86_64, RELEASE build, 3 repetitions each.
ColumnArrayViewvs hand-written:Element access (4096 rows × 8 elements):
Array distance (4096 rows × 128-dim L2):
Conclusion: For compute-heavy workloads (distance, string operations), the view overhead is negligible (<1.5%). The
WithNullscase shows +22% regression due tooffset + idxindirection inis_null_at()vs direct flat index — this only affects inner-loop null checking of individual elements. In real array functions, this is typically dominated by per-element computation cost.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)