Skip to content

Commit d5fb1e5

Browse files
HappenLeeCopilot
andauthored
[fix](be) Fix NOT_IMPLEMENTED_ERROR for length() on dict-encoded varchar columns (#63498)
Problem Summary: `SELECT length(col_varchar)` on a low-cardinality (dict-encoded) column failed with: NOT_IMPLEMENTED_ERROR: Method insert_offsets_from_lengths is not supported Root cause (two-step): 1. When `enable_low_cardinality_optimize=true`, the predicate column is created as `ColumnDictI32` by `Schema::get_predicate_column_ptr()`. The `only_read_offsets` code path in `BinaryDictPageDecoder` resolves dict codes to lengths and calls `dst->insert_offsets_from_lengths()`, but `ColumnDictI32` does not implement that method. 2. After converting `ColumnDictI32` to `PredicateColumnType<TYPE_STRING>` via `convert_to_predicate_column_if_dictionary()`, the same call failed again because `PredicateColumnType` also lacked the implementation. Fix: - `binary_dict_page.cpp`: call `convert_to_predicate_column_if_dictionary()` on `dst` before invoking `insert_offsets_from_lengths` in both `next_batch()` and `read_by_rowids()` (covers dict-encoded pages and the plain-encoded fallback path). - `predicate_column.h`: implement `insert_offsets_from_lengths` for the `StringRef` specialisation of `PredicateColumnType`. A single backing buffer is allocated from the internal Arena and zero-filled; each element records the correct length so that downstream `filter_by_selector` / `copy_column_data_by_selector` can materialise the right-sized strings into the output `ColumnString`, giving the correct result for `length()`. ### Release note Fix crash/error when calling `length()` on a varchar/char column that uses the low-cardinality (dictionary) optimisation. ### Check List (For Author) - Test: Manual test (query reproduced and confirmed fixed) - Behavior changed: No - Does this need documentation: No ### What problem does this PR solve? Issue Number: close #xxx Related PR: #xxx Problem Summary: ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into --> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 1e75bfc commit d5fb1e5

4 files changed

Lines changed: 204 additions & 0 deletions

File tree

be/src/core/column/predicate_column.h

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,41 @@ class PredicateColumnType final : public COWHelper<IColumn, PredicateColumnType<
293293
}
294294
}
295295

296+
// Insert `num` entries with only length information (no actual char data).
297+
// The chars buffer is zero-filled so that filter_by_selector can safely
298+
// memcpy without reading meaningful content. Used in OFFSET_ONLY reading
299+
// mode where only string lengths (for length() function) are needed.
300+
void insert_offsets_from_lengths(const uint32_t* lengths, size_t num) override {
301+
if constexpr (std::is_same_v<T, StringRef>) {
302+
if (UNLIKELY(num == 0)) {
303+
return;
304+
}
305+
size_t total_bytes = 0;
306+
for (size_t i = 0; i < num; ++i) {
307+
total_bytes += lengths[i];
308+
}
309+
// Allocate and zero-fill a single backing buffer so that each StringRef
310+
// points to valid (though meaningless) memory. filter_by_selector will
311+
// memcpy from these pointers, so they must not be null for non-zero lengths.
312+
char* buf = total_bytes > 0 ? _arena.alloc(total_bytes) : nullptr;
313+
if (total_bytes > 0) {
314+
memset(buf, 0, total_bytes);
315+
}
316+
size_t org_elem_num = data.size();
317+
data.resize(org_elem_num + num);
318+
size_t offset = 0;
319+
for (size_t i = 0; i < num; ++i) {
320+
// For zero-length strings, data pointer is null; insert_many_strings
321+
// and filter_by_selector both guard on size > 0 before dereferencing.
322+
data[org_elem_num + i].data = (lengths[i] > 0) ? (buf + offset) : nullptr;
323+
data[org_elem_num + i].size = lengths[i];
324+
offset += lengths[i];
325+
}
326+
} else {
327+
IColumn::insert_offsets_from_lengths(lengths, num);
328+
}
329+
}
330+
296331
void insert_default() override { data.push_back(T()); }
297332

298333
void clear() override {

be/src/storage/segment/binary_dict_page.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,10 @@ Status BinaryDictPageDecoder::next_batch(size_t* n, MutableColumnPtr& dst) {
290290
if (_options.only_read_offsets) {
291291
// OFFSET_ONLY mode: resolve dict codes to get real string lengths
292292
// without copying actual char data. This allows length() to work.
293+
// ColumnDictI32 does not implement insert_offsets_from_lengths, so convert
294+
// it to a predicate column (ColumnString) first. This is a no-op for
295+
// non-dictionary columns and for ColumnNullable it converts the nested column.
296+
dst = dst->convert_to_predicate_column_if_dictionary();
293297
const auto* data_array = reinterpret_cast<const int32_t*>(_bit_shuffle_ptr->get_data(0));
294298
size_t start_index = _bit_shuffle_ptr->_cur_index;
295299
// Reuse _buffer (int32_t vector) to store uint32_t lengths.
@@ -334,6 +338,9 @@ Status BinaryDictPageDecoder::read_by_rowids(const rowid_t* rowids, ordinal_t pa
334338
if (_options.only_read_offsets) {
335339
// OFFSET_ONLY mode: resolve dict codes to get real string lengths
336340
// without copying actual char data. This allows length() to work correctly.
341+
// ColumnDictI32 does not implement insert_offsets_from_lengths, so convert
342+
// it to a predicate column (ColumnString) first.
343+
dst = dst->convert_to_predicate_column_if_dictionary();
337344
const auto* data_array = reinterpret_cast<const int32_t*>(_bit_shuffle_ptr->get_data(0));
338345
size_t read_count = 0;
339346
_buffer.resize(total);
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
-- This file is automatically generated. You should know what you did if you want to edit this
2+
-- !length_nullable_not_null --
3+
1
4+
1
5+
1
6+
1
7+
1
8+
2
9+
10+
-- !length_not_null_col --
11+
1
12+
1
13+
1
14+
1
15+
1
16+
1
17+
1
18+
1
19+
20+
-- !length_with_nulls --
21+
\N 50
22+
\N 51
23+
1 24
24+
1 28
25+
1 30
26+
1 41
27+
1 5
28+
2 60
29+
30+
-- !char_length_nullable_not_null --
31+
1
32+
1
33+
1
34+
1
35+
1
36+
2
37+
38+
-- !length_nullable_is_null --
39+
\N
40+
\N
41+
42+
-- !char_length_all_rows --
43+
\N 50
44+
\N 51
45+
1 24
46+
1 28
47+
1 30
48+
1 41
49+
1 5
50+
2 60
51+
52+
-- !char_length_nullable_is_null --
53+
\N
54+
\N
55+
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing,
12+
// software distributed under the License is distributed on an
13+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
// KIND, either express or implied. See the License for the
15+
// specific language governing permissions and limitations
16+
// under the License.
17+
18+
// Regression test for: length() / char_length() on dict-encoded (low-cardinality)
19+
// varchar columns must not throw NOT_IMPLEMENTED_ERROR when the "only_read_offsets"
20+
// optimisation is active (i.e. when the storage layer resolves dict codes to string
21+
// lengths without materialising actual character data).
22+
suite("test_length_dict_encoded") {
23+
sql "DROP TABLE IF EXISTS test_length_dict_varchar"
24+
sql """
25+
CREATE TABLE test_length_dict_varchar (
26+
col_int_undef_signed int NULL,
27+
col_int_undef_signed_not_null int NOT NULL,
28+
col_date_undef_signed date NULL,
29+
col_date_undef_signed_not_null date NOT NULL,
30+
col_varchar_5__undef_signed varchar(5) NULL,
31+
col_varchar_5__undef_signed_not_null varchar(5) NOT NULL,
32+
pk int NULL
33+
) ENGINE=OLAP
34+
DUPLICATE KEY(col_int_undef_signed)
35+
PARTITION BY RANGE(col_int_undef_signed) (
36+
PARTITION p0 VALUES [('-2147483648'), ('4')),
37+
PARTITION p1 VALUES [('4'), ('6')),
38+
PARTITION p2 VALUES [('6'), ('7')),
39+
PARTITION p3 VALUES [('7'), ('8')),
40+
PARTITION p4 VALUES [('8'), ('10')),
41+
PARTITION p5 VALUES [('10'), ('83647')),
42+
PARTITION p100 VALUES [('83647'), ('2147483647'))
43+
)
44+
DISTRIBUTED BY HASH(pk) BUCKETS 10
45+
PROPERTIES ('replication_allocation' = 'tag.location.default: 1')
46+
"""
47+
48+
sql """
49+
INSERT INTO test_length_dict_varchar VALUES
50+
(6, 5, '2023-12-13', '2023-12-11', 'o', 'i', 30),
51+
(6, 6, NULL, '2023-12-18', 'w', 'l', 24),
52+
(8, -8278102, '2023-12-13', '2023-12-11', 'x', 'c', 28),
53+
(15971, 8, NULL, '2015-06-11', 'h', 'r', 41),
54+
(6, 5, '2023-12-11', '2023-12-17', 'd', 'q', 5),
55+
(7, 100, '2023-12-14', '2023-12-15', NULL, 'a', 50),
56+
(7, 101, '2023-12-15', '2023-12-16', NULL, 'b', 51),
57+
(9, 200, NULL, '2023-12-17', 'ab', 'd', 60)
58+
"""
59+
60+
// length() on nullable dict-encoded varchar filtered by IS NOT NULL predicate
61+
order_qt_length_nullable_not_null """
62+
SELECT length(col_varchar_5__undef_signed)
63+
FROM test_length_dict_varchar
64+
WHERE col_varchar_5__undef_signed IS NOT NULL
65+
"""
66+
67+
// length() on NOT NULL dict-encoded varchar (no predicate needed)
68+
order_qt_length_not_null_col """
69+
SELECT length(col_varchar_5__undef_signed_not_null)
70+
FROM test_length_dict_varchar
71+
"""
72+
73+
// length() returns NULL for NULL varchar values
74+
order_qt_length_with_nulls """
75+
SELECT length(col_varchar_5__undef_signed), pk
76+
FROM test_length_dict_varchar
77+
ORDER BY pk
78+
"""
79+
80+
// char_length() is a synonym and must work as well
81+
order_qt_char_length_nullable_not_null """
82+
SELECT char_length(col_varchar_5__undef_signed)
83+
FROM test_length_dict_varchar
84+
WHERE col_varchar_5__undef_signed IS NOT NULL
85+
"""
86+
87+
// length() returns NULL for NULL values — verify NULL passthrough
88+
order_qt_length_nullable_is_null """
89+
SELECT length(col_varchar_5__undef_signed)
90+
FROM test_length_dict_varchar
91+
WHERE col_varchar_5__undef_signed IS NULL
92+
"""
93+
94+
// char_length() on all rows including NULLs — verify dict code path with mixed null/non-null
95+
order_qt_char_length_all_rows """
96+
SELECT char_length(col_varchar_5__undef_signed), pk
97+
FROM test_length_dict_varchar
98+
ORDER BY pk
99+
"""
100+
101+
// char_length() on NULL-only rows
102+
order_qt_char_length_nullable_is_null """
103+
SELECT char_length(col_varchar_5__undef_signed)
104+
FROM test_length_dict_varchar
105+
WHERE col_varchar_5__undef_signed IS NULL
106+
"""
107+
}

0 commit comments

Comments
 (0)