Commit d2f73b7
committed
[Analytics Backend / DataFusion] Onboard PPL mvfind via custom Rust UDF
PPL `mvfind(arr, regex)` finds the 0-based index of the first array element
matching a regex pattern (Java `Matcher.find` substring-match semantics), or
NULL if no match. DataFusion has no stdlib equivalent, and rewriting in terms
of array_position requires per-element regex evaluation that's only
expressible with substrait lambda support — out of scope here. Onboards a
custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session
context, mirroring the mvzip/convert_tz pattern.
Templated shape:
Rust side:
udf::mvfind::MvfindUdf — Signature::user_defined; coerce_types pins arg 0
to a list type and arg 1 to Utf8; invoke_with_args walks each row and
finds the first non-null element whose stringified form matches the
regex via Rust's `regex` crate (`Regex::is_match` is unanchored, same
as Java's `Matcher.find`). Scalar pattern operands compile once up
front and surface invalid-regex errors at plan time (mirrors the SQL
plugin's plan-time `tryCompileLiteralPattern`); column-valued patterns
compile per row and yield NULL for invalid patterns. Supports list
element types Utf8 / Int{8,16,32,64} / UInt{8,16,32,64} / Float{32,64}
/ Boolean / Null. 7 unit tests cover the basic-match / no-match /
null-array / empty-array / null-element / numeric-array / unanchored
shapes.
Registered on each session context via udf::register_all alongside
convert_tz and mvzip.
Java side:
ScalarFunction.MVFIND enum entry (SqlKind.OTHER_FUNCTION; resolves
through identifier-name valueOf("MVFIND") since PPL's
MVFindFunctionImpl registers under the function name "mvfind").
MvfindAdapter — locally-declared SqlFunction("mvfind") +
ADDITIONAL_SCALAR_SIGS bridge so isthmus emits a Substrait scalar
function call with the exact name the Rust UDF is registered under.
DataFusionAnalyticsBackendPlugin: STANDARD_PROJECT_OPS membership
(returns INTEGER, registered against the existing scalar
SUPPORTED_FIELD_TYPES); adapter registration in
scalarFunctionAdapters().
opensearch_array_functions.yaml: arity-2 impl returning `i32?`.
* Before: 34/60.
* After: 42/60.
Newly passing — 8 of 9 testMvfind* variants:
testMvfindWithMatch, testMvfindWithFirstMatch, testMvfindWithMultipleMatches,
testMvfindWithNoMatch, testMvfindWithEmptyArray, testMvfindWithNumericArray,
testMvfindWithCaseInsensitive, testMvfindWithComplexRegex.
Remaining mvfind failure:
testMvfindWithDynamicRegex — fails with "Unable to convert call
CONCAT(string, string)" because the test computes the pattern via
`concat('ban', '.*')` and substrait can't bind the CONCAT call. This is a
separate analytics-engine CONCAT type-conversion issue, not mvfind-specific.
Signed-off-by: Kai Huang <ahkcs@amazon.com>1 parent c059c38 commit d2f73b7
8 files changed
Lines changed: 470 additions & 4 deletions
File tree
- sandbox
- libs/analytics-framework/src/main/java/org/opensearch/analytics/spi
- plugins/analytics-backend-datafusion
- rust
- src/udf
- src/main
- java/org/opensearch/be/datafusion
- resources
Lines changed: 9 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
166 | 166 | | |
167 | 167 | | |
168 | 168 | | |
169 | | - | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
170 | 178 | | |
171 | 179 | | |
172 | 180 | | |
| |||
Lines changed: 3 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
52 | 55 | | |
53 | 56 | | |
54 | 57 | | |
| |||
Lines changed: 3 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
113 | 113 | | |
114 | 114 | | |
115 | 115 | | |
| 116 | + | |
116 | 117 | | |
117 | 118 | | |
118 | 119 | | |
119 | 120 | | |
120 | 121 | | |
121 | | - | |
| 122 | + | |
| 123 | + | |
122 | 124 | | |
123 | 125 | | |
124 | 126 | | |
| |||
0 commit comments