Array has optim backport by freakyzoidberg · Pull Request #151 · DataDog/datafusion

freakyzoidberg · 2026-06-29T15:26:10Z

backport of freakyzoidberg#1
see linked PR for details

`array_has(array, element)` returns, for each row, whether the array contains the element. When the `element` (needle) is an array rather than a scalar -- the needle argument is a column with one value per row, e.g. `array_has(t1.tags, t2.key)` in a join filter -- execution goes through `array_has_dispatch_for_array` (the `ColumnarValue::Array` needle branch), which compared each row by invoking the Arrow `eq` kernel once per row. That kernel allocates a `BooleanArray` and pays downcast and dispatch overhead on every row. (The scalar-needle branch was optimized separately in apache#20374.) Add a fast path for primitive and string element types, preserving the Arrow `eq` kernel semantics (total-order float equality; null elements never match). With all-valid elements each row is a single branchless OR-reduction over the native values. When primitive elements contain nulls -- whose backing values are arbitrary -- the per-element equality bitmap is ANDed with the validity bitmap (one word-parallel op, no per-element branch) before the per-row reduction, so null slots never match regardless of their value. Past a moderate average list length (`NULL_FAST_PATH_MAX_LEN`) this bitmap's extra passes lose to the per-row kernel, so the element-null branch bails to it there (an O(1) check); the all-valid fold has no such crossover. Nested and other element types keep using the per-row `eq` kernel. What this removes is the fixed per-row kernel overhead, not the element comparison itself, so the gain is largest for short lists and shrinks as lists grow. Criterion microbenchmark, array needle on a 100k-row batch vs the per-row `eq` kernel (NOT end-to-end query time): i64, all-valid elements: ~13x (64-elem lists), ~1.9x at 1024-elem i64, ~30% null elements: ~10x (8-elem) ... ~1x (512-elem); falls back beyond string elements: ~1.45x Every shape improves with no regression. End-to-end impact depends on how much of a query `array_has` accounts for. For a query dominated by an array-needle `array_has` join filter (a `NestedLoopJoinExec` with `filter=array_has(tags, key)` over 3000x3000 rows of 8-element lists) total time drops from 0.95s to 0.059s (~16x, identical results). For a workload where `array_has` is a smaller fraction -- e.g. the ~6% of profile that motivated this (see apache#18070 / apache#18161, which fixed the join's deep-copy but left the per-row `array_has` cost) -- the overall speedup is single-digit percent. Part of apache#18727. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

BooleanBuffer::has_true() was added in arrow 59; branch-54 is on arrow 58.3, so the element-null per-row reduction uses count_set_bits() > 0 instead. No behavior change (17 array_has unit tests pass, clippy clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

freakyzoidberg and others added 2 commits June 29, 2026 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Array has optim backport#151

Array has optim backport#151
freakyzoidberg wants to merge 2 commits into
branch-54from
array_has_optim_backport

freakyzoidberg commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

freakyzoidberg commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant