Skip to content

Commit d2f73b7

Browse files
committed
[Analytics Backend / DataFusion] Onboard PPL mvfind via custom Rust UDF
PPL `mvfind(arr, regex)` finds the 0-based index of the first array element matching a regex pattern (Java `Matcher.find` substring-match semantics), or NULL if no match. DataFusion has no stdlib equivalent, and rewriting in terms of array_position requires per-element regex evaluation that's only expressible with substrait lambda support — out of scope here. Onboards a custom Rust ScalarUDF on the analytics-backend-datafusion plugin's session context, mirroring the mvzip/convert_tz pattern. Templated shape: Rust side: udf::mvfind::MvfindUdf — Signature::user_defined; coerce_types pins arg 0 to a list type and arg 1 to Utf8; invoke_with_args walks each row and finds the first non-null element whose stringified form matches the regex via Rust's `regex` crate (`Regex::is_match` is unanchored, same as Java's `Matcher.find`). Scalar pattern operands compile once up front and surface invalid-regex errors at plan time (mirrors the SQL plugin's plan-time `tryCompileLiteralPattern`); column-valued patterns compile per row and yield NULL for invalid patterns. Supports list element types Utf8 / Int{8,16,32,64} / UInt{8,16,32,64} / Float{32,64} / Boolean / Null. 7 unit tests cover the basic-match / no-match / null-array / empty-array / null-element / numeric-array / unanchored shapes. Registered on each session context via udf::register_all alongside convert_tz and mvzip. Java side: ScalarFunction.MVFIND enum entry (SqlKind.OTHER_FUNCTION; resolves through identifier-name valueOf("MVFIND") since PPL's MVFindFunctionImpl registers under the function name "mvfind"). MvfindAdapter — locally-declared SqlFunction("mvfind") + ADDITIONAL_SCALAR_SIGS bridge so isthmus emits a Substrait scalar function call with the exact name the Rust UDF is registered under. DataFusionAnalyticsBackendPlugin: STANDARD_PROJECT_OPS membership (returns INTEGER, registered against the existing scalar SUPPORTED_FIELD_TYPES); adapter registration in scalarFunctionAdapters(). opensearch_array_functions.yaml: arity-2 impl returning `i32?`. * Before: 34/60. * After: 42/60. Newly passing — 8 of 9 testMvfind* variants: testMvfindWithMatch, testMvfindWithFirstMatch, testMvfindWithMultipleMatches, testMvfindWithNoMatch, testMvfindWithEmptyArray, testMvfindWithNumericArray, testMvfindWithCaseInsensitive, testMvfindWithComplexRegex. Remaining mvfind failure: testMvfindWithDynamicRegex — fails with "Unable to convert call CONCAT(string, string)" because the test computes the pattern via `concat('ban', '.*')` and substrait can't bind the CONCAT call. This is a separate analytics-engine CONCAT type-conversion issue, not mvfind-specific. Signed-off-by: Kai Huang <ahkcs@amazon.com>
1 parent c059c38 commit d2f73b7

8 files changed

Lines changed: 470 additions & 4 deletions

File tree

sandbox/libs/analytics-framework/src/main/java/org/opensearch/analytics/spi/ScalarFunction.java

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,15 @@ public enum ScalarFunction {
166166
* No DataFusion stdlib equivalent — the analytics-backend-datafusion plugin ships
167167
* a custom Rust UDF (`udf::mvzip`) registered on its session context.
168168
*/
169-
MVZIP(Category.SCALAR, SqlKind.OTHER_FUNCTION);
169+
MVZIP(Category.SCALAR, SqlKind.OTHER_FUNCTION),
170+
/**
171+
* PPL {@code mvfind(arr, regex)} — find the 0-based index of the first array
172+
* element matching a regex, or NULL if no match. Resolves through the SQL
173+
* plugin's {@code MVFindFunctionImpl} UDF named {@code "mvfind"}. No
174+
* DataFusion stdlib equivalent — the analytics-backend-datafusion plugin
175+
* ships a custom Rust UDF (`udf::mvfind`) registered on its session context.
176+
*/
177+
MVFIND(Category.SCALAR, SqlKind.OTHER_FUNCTION);
170178

171179
/**
172180
* Category of scalar function.

sandbox/plugins/analytics-backend-datafusion/rust/Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,9 @@ chrono-tz = "0.10"
4949

5050
tokio-metrics = { workspace = true }
5151

52+
# mvfind UDF — regex matching against stringified array elements
53+
regex = "1.10"
54+
5255
[dev-dependencies]
5356
criterion = { workspace = true }
5457
tempfile = { workspace = true }

sandbox/plugins/analytics-backend-datafusion/rust/src/udf/mod.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,12 +113,14 @@ pub(crate) fn coerce_args(
113113
}
114114

115115
pub mod convert_tz;
116+
pub mod mvfind;
116117
pub mod mvzip;
117118

118119
pub fn register_all(ctx: &SessionContext) {
119120
convert_tz::register_all(ctx);
120121
mvzip::register_all(ctx);
121-
log::info!("OpenSearch UDF register_all: convert_tz, mvzip registered");
122+
mvfind::register_all(ctx);
123+
log::info!("OpenSearch UDF register_all: convert_tz, mvzip, mvfind registered");
122124
}
123125

124126
#[cfg(test)]

0 commit comments

Comments
 (0)