Skip to content

Commit b9b9da9

Browse files
committed
[Analytics Backend / DataFusion] Onboard array_length scalar function (Part 3)
Wires Calcite's `SqlLibraryOperators.ARRAY_LENGTH` to DataFusion's native `array_length`, completing the end-to-end story for PPL `rex` extract-mode multi-match: queries can now size the list returned by `rex_extract_multi` (`eval count = array_length(g)`). * `ScalarFunction.ARRAY_LENGTH` enum value (resolves via the `valueOf()` fallback on the Calcite operator name). * Registered in `STANDARD_PROJECT_OPS`. Returns `bigint`, so the existing `SUPPORTED_FIELD_TYPES` (numeric ∪ keyword ∪ date ∪ {BOOLEAN, TEXT}) covers the capability lookup — no special-case needed. * `FunctionMappings.s(SqlLibraryOperators.ARRAY_LENGTH, "array_length")` in `DataFusionFragmentConvertor.ADDITIONAL_SCALAR_SIGS`. Library operators don't auto-resolve through the substrait default catalog — the same explicit pinning pattern used for `ILIKE`, `DATE_PART`, and the `REGEXP_REPLACE_*` family. * `array_length` extension declaration in `opensearch_scalar_functions.yaml` with `list<varchar<L1>>` → `i64` and `list<string>` → `i64` impls. Without a custom YAML extension that matches the actual list type, isthmus emits "Unable to convert call ARRAY_LENGTH(list<varchar<...>>)" for the `rex_extract_multi` output. Lifts CalciteRexCommandIT (SQL plugin's standard rex IT class) through the analytics-engine route from 14/18 → 17/18. The remaining failure (testRexMaxMatchConfigurableLimit) is a unified-query architectural gap — `UnifiedQueryContext` ignores cluster-setting overrides and uses the static default — unrelated to rex or array_length. Signed-off-by: Jialiang Liang <jiallian@amazon.com>
1 parent 9c29b41 commit b9b9da9

3 files changed

Lines changed: 24 additions & 0 deletions

File tree

sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionAnalyticsBackendPlugin.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,12 @@ public class DataFusionAnalyticsBackendPlugin implements AnalyticsSearchBackendP
140140
ScalarFunction.REX_EXTRACT,
141141
ScalarFunction.REX_EXTRACT_MULTI,
142142
ScalarFunction.REX_OFFSET,
143+
// ARRAY_LENGTH — counts elements in array<*>; needed end-to-end so PPL queries can size
144+
// the list returned by `rex field=f "(?<g>...)"` extract-mode (CalciteRexCommandIT's
145+
// testRexMaxMatch{Zero,Within,At}DefaultLimit and testRexMaxMatchConfigurableLimit all
146+
// do `eval count = array_length(g)`). DataFusion has it natively; isthmus default
147+
// catalog binds it.
148+
ScalarFunction.ARRAY_LENGTH,
143149
ScalarFunction.PLUS,
144150
ScalarFunction.TIMES,
145151
ScalarFunction.DIVIDE,

sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionFragmentConvertor.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -180,6 +180,7 @@ public class DataFusionFragmentConvertor implements FragmentConvertor {
180180
FunctionMappings.s(RexExtractAdapter.LOCAL_REX_EXTRACT_OP, "rex_extract"),
181181
FunctionMappings.s(RexExtractMultiAdapter.LOCAL_REX_EXTRACT_MULTI_OP, "rex_extract_multi"),
182182
FunctionMappings.s(RexOffsetAdapter.LOCAL_REX_OFFSET_OP, "rex_offset"),
183+
FunctionMappings.s(SqlLibraryOperators.ARRAY_LENGTH, "array_length"),
183184
FunctionMappings.s(SqlStdOperatorTable.TRUNCATE, "trunc"),
184185
FunctionMappings.s(SqlStdOperatorTable.CBRT, "cbrt"),
185186
FunctionMappings.s(SqlStdOperatorTable.COT, "cot"),

sandbox/plugins/analytics-backend-datafusion/src/main/resources/opensearch_scalar_functions.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -433,6 +433,23 @@ scalar_functions:
433433
- value: "string"
434434
name: "pattern"
435435
return: string
436+
- name: array_length
437+
description: >-
438+
Number of elements in a list. Lowering target for PPL's `array_length()`
439+
function (Calcite `SqlLibraryOperators.ARRAY_LENGTH`). Needed end-to-end
440+
so PPL queries can size the list returned by `rex` extract-mode
441+
(`eval count = array_length(g)` after `rex field=f "(?<g>...)"`).
442+
datafusion-substrait resolves the extension name "array_length" to
443+
DataFusion's native `array_length` UDF.
444+
impls:
445+
- args:
446+
- value: "list<varchar<L1>>"
447+
name: "input"
448+
return: i64
449+
- args:
450+
- value: "list<string>"
451+
name: "input"
452+
return: i64
436453

437454
# ascii(str) — Unicode code point of the first character.
438455
- name: "ascii"

0 commit comments

Comments
 (0)