Skip to content

Array has optim backport#151

Open
freakyzoidberg wants to merge 2 commits into
branch-54from
array_has_optim_backport
Open

Array has optim backport#151
freakyzoidberg wants to merge 2 commits into
branch-54from
array_has_optim_backport

Conversation

@freakyzoidberg

Copy link
Copy Markdown
Member

backport of freakyzoidberg#1
see linked PR for details

freakyzoidberg and others added 2 commits June 29, 2026 13:25
`array_has(array, element)` returns, for each row, whether the array contains
the element. When the `element` (needle) is an array rather than a scalar -- the
needle argument is a column with one value per row, e.g. `array_has(t1.tags,
t2.key)` in a join filter -- execution goes through `array_has_dispatch_for_array`
(the `ColumnarValue::Array` needle branch), which compared each row by invoking
the Arrow `eq` kernel once per row. That kernel allocates a `BooleanArray` and
pays downcast and dispatch overhead on every row. (The scalar-needle branch was
optimized separately in apache#20374.)

Add a fast path for primitive and string element types, preserving the Arrow
`eq` kernel semantics (total-order float equality; null elements never match).
With all-valid elements each row is a single branchless OR-reduction over the
native values. When primitive elements contain nulls -- whose backing values
are arbitrary -- the per-element equality bitmap is ANDed with the validity
bitmap (one word-parallel op, no per-element branch) before the per-row
reduction, so null slots never match regardless of their value. Past a moderate
average list length (`NULL_FAST_PATH_MAX_LEN`) this bitmap's extra passes lose to
the per-row kernel, so the element-null branch bails to it there (an O(1)
check); the all-valid fold has no such crossover. Nested and other element types
keep using the per-row `eq` kernel.

What this removes is the fixed per-row kernel overhead, not the element
comparison itself, so the gain is largest for short lists and shrinks as lists
grow. Criterion microbenchmark, array needle on a 100k-row batch vs the per-row
`eq` kernel (NOT end-to-end query time):

    i64, all-valid elements:  ~13x (64-elem lists), ~1.9x at 1024-elem
    i64, ~30% null elements:  ~10x (8-elem) ... ~1x (512-elem); falls back beyond
    string elements:          ~1.45x

Every shape improves with no regression.

End-to-end impact depends on how much of a query `array_has` accounts for. For a
query dominated by an array-needle `array_has` join filter (a `NestedLoopJoinExec`
with `filter=array_has(tags, key)` over 3000x3000 rows of 8-element lists) total
time drops from 0.95s to 0.059s (~16x, identical results). For a workload where
`array_has` is a smaller fraction -- e.g. the ~6% of profile that motivated this
(see apache#18070 / apache#18161, which fixed the join's deep-copy but left the per-row
`array_has` cost) -- the overall speedup is single-digit percent.

Part of apache#18727.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
BooleanBuffer::has_true() was added in arrow 59; branch-54 is on arrow 58.3, so the element-null per-row reduction uses count_set_bits() > 0 instead. No behavior change (17 array_has unit tests pass, clippy clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant