Skip to content

Commit 7366517

Browse files
committed
[Analytics Backend / DataFusion] Wire PPL rex sed-mode (Part 1) — bridge-only
Onboards the PPL `rex` command's `mode=sed` surface — the part that lowers to standard Calcite library operators and bridges through Substrait to DataFusion's native UDFs. Three sed sub-variants covered: * `rex field=f mode=sed "s/old/new/"` (no flags) → SqlLibraryOperators.REGEXP_REPLACE_3 (already mapped via the PPL `replace` onboarding from opensearch-project#21527 — no-op here). * `rex field=f mode=sed "s/old/new/g"` / `/i` / `/gi` → SqlLibraryOperators.REGEXP_REPLACE_PG_4 (4-arg with flags string). New bridge in this PR. DataFusion's regexp_replace natively accepts 4-arg `(str, pat, repl, flags)` per its substrait UDF binding. * `rex field=f mode=sed "y/from/to/"` (transliteration) → SqlLibraryOperators.TRANSLATE3. New bridge in this PR. Resolves to DataFusion's `translate` UDF (datafusion-functions/src/unicode/translate.rs). The 4-arg `REGEXP_REPLACE_PG_4` carries the same Java-regex syntax baggage as the 3-arg form: `\Q…\E` quoted-literal blocks (Rust regex rejects them) and bare `$N` backreferences in the replacement (Rust's identifier-greedy parser mis-resolves them). RegexpReplaceAdapter, introduced for the 3-arg form in and replacement at position 2 in both signatures — the rewrite logic doesn't change. Operands beyond position 2 (the flags string in the 4-arg form) pass through verbatim. Two new RegexpReplaceAdapterTests cover the 4-arg path. `TRANSLATE3` doesn't need an adapter — its arguments are character classes, not regex syntax. * Rex extract mode (`rex field=f "(?<g>...)"`) — uses the SQL plugin's custom Java UDFs `REX_EXTRACT`, `REX_EXTRACT_MULTI`, `REX_OFFSET`, which have no native DataFusion equivalent. Slated for a follow-up PR that adds Rust-side UDF implementations, similar to the convert_tz precedent (opensearch-project#21476). * Sed with occurrence flag (`s/.../.../<N>`) — emits 5-arg `REGEXP_REPLACE_5`, which DataFusion's native `regexp_replace` does not support (max 4 args). Also Part 2. * `RegexpReplaceAdapterTests` — 21/21 (19 from opensearch-project#21527 + 2 new for the 4-arg path). * `RexCommandIT` (new self-contained QA IT, calcs dataset) — 9/9. Covers all sed sub-variants: literal (no flags), `/g` global, `/i` case-insensitive, `/gi` combined, backreferences via `$N`, transliteration `y/from/to/` and no-match passthrough. * `./gradlew check -p sandbox -Dsandbox.enabled=true` — green. The unified-path NPE caused by a missing PPL_REX_MAX_MATCH_LIMIT default is fixed in opensearch-project/sql#5418 — required for any rex query (sed or extract) to reach the planner via /_analytics/ppl. This PR's Test results assume opensearch-project#5418 is applied. Pre-fix: every query NPEs in `AstBuilder.visitRexCommand`. Post-fix: 9/9 RexCommandIT pass. Signed-off-by: Jialiang Liang <jiallian@amazon.com>
1 parent 139b9f9 commit 7366517

7 files changed

Lines changed: 355 additions & 3 deletions

File tree

sandbox/libs/analytics-framework/src/main/java/org/opensearch/analytics/spi/ScalarFunction.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ public enum ScalarFunction {
6969
CHAR_LENGTH(Category.STRING, SqlKind.OTHER_FUNCTION),
7070
REPLACE(Category.STRING, SqlKind.OTHER_FUNCTION),
7171
REGEXP_REPLACE(Category.STRING, SqlKind.OTHER_FUNCTION),
72+
TRANSLATE(Category.STRING, SqlKind.OTHER_FUNCTION),
7273

7374
// ── Math ─────────────────────────────────────────────────────────
7475
PLUS(Category.MATH, SqlKind.PLUS),

sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionAnalyticsBackendPlugin.java

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,7 @@ public class DataFusionAnalyticsBackendPlugin implements AnalyticsSearchBackendP
137137
ScalarFunction.REGEXP_CONTAINS,
138138
ScalarFunction.REPLACE,
139139
ScalarFunction.REGEXP_REPLACE,
140+
ScalarFunction.TRANSLATE,
140141
ScalarFunction.PLUS,
141142
ScalarFunction.TIMES,
142143
ScalarFunction.DIVIDE,

sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/DataFusionFragmentConvertor.java

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -95,6 +95,15 @@ public class DataFusionFragmentConvertor implements FragmentConvertor {
9595
* <li>{@link SqlLibraryOperators#REGEXP_REPLACE_3} → {@code regexp_replace} (regex string
9696
* replacement; lowering target for PPL `replace` command on wildcard patterns and for
9797
* PPL `replace()` / `regexp_replace()` functions in `eval`).</li>
98+
* <li>{@link SqlLibraryOperators#REGEXP_REPLACE_PG_4} → {@code regexp_replace} (4-arg
99+
* PostgreSQL-style with flags string; lowering target for PPL `rex mode=sed` with
100+
* {@code g}/{@code i} flags. Reuses the same DataFusion {@code regexp_replace} UDF as
101+
* the 3-arg form.</li>
102+
* <li>{@link SqlLibraryOperators#TRANSLATE3} → {@code translate} (3-arg character
103+
* transliteration; lowering target for PPL `rex mode=sed` with {@code y/from/to/}
104+
* transliteration syntax). DataFusion's substrait consumer resolves the extension name
105+
* "translate" to its native {@code translate} UDF
106+
* (datafusion-functions/src/unicode/translate.rs).</li>
98107
* </ul>
99108
*/
100109
private static final List<FunctionMappings.Sig> ADDITIONAL_SCALAR_SIGS = List.of(
@@ -106,6 +115,8 @@ public class DataFusionFragmentConvertor implements FragmentConvertor {
106115
FunctionMappings.s(SqlLibraryOperators.REGEXP_CONTAINS, "regex_match"),
107116
FunctionMappings.s(SqlStdOperatorTable.REPLACE, "replace"),
108117
FunctionMappings.s(SqlLibraryOperators.REGEXP_REPLACE_3, "regexp_replace"),
118+
FunctionMappings.s(SqlLibraryOperators.REGEXP_REPLACE_PG_4, "regexp_replace"),
119+
FunctionMappings.s(SqlLibraryOperators.TRANSLATE3, "translate"),
109120
FunctionMappings.s(SqlLibraryOperators.REGEXP_CONTAINS, "regex_match"),
110121
FunctionMappings.s(UnixTimestampAdapter.LOCAL_TO_UNIXTIME_OP, "to_unixtime"),
111122
FunctionMappings.s(SqlStdOperatorTable.TRUNCATE, "trunc"),

sandbox/plugins/analytics-backend-datafusion/src/main/java/org/opensearch/be/datafusion/RegexpReplaceAdapter.java

Lines changed: 18 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,13 @@
4343
* changes. Calls without {@code \Q} in the pattern AND without bare {@code $N} in the
4444
* replacement pass through unchanged.
4545
*
46+
* <p>Handles both 3-arg ({@code SqlLibraryOperators.REGEXP_REPLACE_3}) and 4-arg
47+
* ({@code SqlLibraryOperators.REGEXP_REPLACE_PG_4}, with a trailing {@code flags} string)
48+
* signatures. The pattern is at operand position 1 and replacement at position 2 in both —
49+
* the rewrite logic is identical. The {@code flags} operand (when present) passes through
50+
* verbatim. The 4-arg form is the lowering target for PPL {@code rex mode=sed} with
51+
* {@code g}/{@code i} flags.
52+
*
4653
* <p>Pattern faithful to {@link java.util.regex.Pattern} semantics: an unterminated
4754
* {@code \Q} (no closing {@code \E}) quotes through end-of-string. Replacement preserves
4855
* existing {@code ${…}} braces and the {@code $$} literal-dollar escape.
@@ -56,8 +63,12 @@ class RegexpReplaceAdapter implements ScalarFunctionAdapter {
5663

5764
@Override
5865
public RexNode adapt(RexCall original, List<FieldStorageInfo> fieldStorage, RelOptCluster cluster) {
59-
// REGEXP_REPLACE_3 has signature (input, pattern, replacement) — exactly 3 operands.
60-
if (original.getOperands().size() != 3) {
66+
// REGEXP_REPLACE has both 3-arg (input, pattern, replacement) and 4-arg
67+
// (input, pattern, replacement, flags) signatures. Pattern is at position 1 and
68+
// replacement at position 2 in both — the rewrite logic is identical regardless
69+
// of whether a flags argument is appended. Operands beyond position 2 (e.g. the
70+
// flags string) pass through unchanged.
71+
if (original.getOperands().size() < 3 || original.getOperands().size() > 4) {
6172
return original;
6273
}
6374
RexNode patternOperand = original.getOperands().get(1);
@@ -93,10 +104,14 @@ public RexNode adapt(RexCall original, List<FieldStorageInfo> fieldStorage, RelO
93104
// makeLiteral(String) infers a CHAR type sized to the rewritten string. Reusing the
94105
// original literal's type would right-pad to the OLD length (e.g. CHAR(23) → 8 trailing
95106
// spaces after a 15-char rewrite), corrupting the value at runtime.
96-
List<RexNode> newOperands = new ArrayList<>(3);
107+
List<RexNode> newOperands = new ArrayList<>(original.getOperands().size());
97108
newOperands.add(original.getOperands().get(0));
98109
newOperands.add(rewrittenPattern != null ? rexBuilder.makeLiteral(rewrittenPattern) : patternOperand);
99110
newOperands.add(rewrittenReplacement != null ? rexBuilder.makeLiteral(rewrittenReplacement) : replacementOperand);
111+
// Append any trailing operand (the flags string in the 4-arg form) verbatim.
112+
for (int i = 3; i < original.getOperands().size(); i++) {
113+
newOperands.add(original.getOperands().get(i));
114+
}
100115
return rexBuilder.makeCall(original.getType(), original.getOperator(), newOperands);
101116
}
102117

sandbox/plugins/analytics-backend-datafusion/src/main/resources/opensearch_scalar_functions.yaml

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -204,3 +204,48 @@ scalar_functions:
204204
- value: "string"
205205
name: "replacement"
206206
return: string
207+
- args:
208+
- value: "varchar<L1>"
209+
name: "input"
210+
- value: "varchar<L2>"
211+
name: "pattern"
212+
- value: "varchar<L3>"
213+
name: "replacement"
214+
- value: "varchar<L4>"
215+
name: "flags"
216+
return: "varchar<L1>"
217+
- args:
218+
- value: "string"
219+
name: "input"
220+
- value: "string"
221+
name: "pattern"
222+
- value: "string"
223+
name: "replacement"
224+
- value: "string"
225+
name: "flags"
226+
return: string
227+
- name: translate
228+
description: >-
229+
Character transliteration — replace each character in `input` that appears
230+
in `from` with the corresponding character in `to`. Lowering target for
231+
PPL's `rex mode=sed` with {@code y/from/to/} transliteration syntax
232+
(Calcite `SqlLibraryOperators.TRANSLATE3`). datafusion-substrait resolves
233+
the extension name "translate" to DataFusion's native `translate` UDF
234+
(datafusion-functions/src/unicode/translate.rs).
235+
impls:
236+
- args:
237+
- value: "varchar<L1>"
238+
name: "input"
239+
- value: "varchar<L2>"
240+
name: "from"
241+
- value: "varchar<L3>"
242+
name: "to"
243+
return: "varchar<L1>"
244+
- args:
245+
- value: "string"
246+
name: "input"
247+
- value: "string"
248+
name: "from"
249+
- value: "string"
250+
name: "to"
251+
return: string

sandbox/plugins/analytics-backend-datafusion/src/test/java/org/opensearch/be/datafusion/RegexpReplaceAdapterTests.java

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,4 +222,41 @@ public void testAdaptRewritesBothPatternAndReplacement() {
222222
assertEquals("pattern unquoted", "^(.*?) (.*?)$", ((RexLiteral) result.getOperands().get(1)).getValueAs(String.class));
223223
assertEquals("replacement braced", "${1}_${2}", ((RexLiteral) result.getOperands().get(2)).getValueAs(String.class));
224224
}
225+
226+
public void testAdapt4ArgRewritesPatternAndPassesFlagsThrough() {
227+
// 4-arg REGEXP_REPLACE_PG_4 — emitted by PPL `rex mode=sed` with /g or /i flags.
228+
// Pattern + replacement get rewritten as in the 3-arg case; the trailing flags
229+
// operand passes through unchanged.
230+
RexNode field = rexBuilder.makeInputRef(varcharType, 0);
231+
RexNode pattern = rexBuilder.makeLiteral("^\\QFOO\\E");
232+
RexNode replacement = rexBuilder.makeLiteral("$1");
233+
RexNode flags = rexBuilder.makeLiteral("gi");
234+
RexCall original = (RexCall) rexBuilder.makeCall(
235+
SqlLibraryOperators.REGEXP_REPLACE_PG_4,
236+
List.of(field, pattern, replacement, flags)
237+
);
238+
239+
RexCall result = (RexCall) adapter.adapt(original, List.of(), cluster);
240+
241+
assertEquals("4-arg call preserved", 4, result.getOperands().size());
242+
assertEquals("pattern unquoted", "^FOO", ((RexLiteral) result.getOperands().get(1)).getValueAs(String.class));
243+
assertEquals("replacement braced", "${1}", ((RexLiteral) result.getOperands().get(2)).getValueAs(String.class));
244+
assertSame("flags operand passes through verbatim", flags, result.getOperands().get(3));
245+
}
246+
247+
public void testAdapt4ArgPassesThroughWhenNoRewriteNeeded() {
248+
// 4-arg call with Rust-compatible pattern and no $N — adapter must return identity.
249+
RexNode field = rexBuilder.makeInputRef(varcharType, 0);
250+
RexNode pattern = rexBuilder.makeLiteral("^foo$");
251+
RexNode replacement = rexBuilder.makeLiteral("bar");
252+
RexNode flags = rexBuilder.makeLiteral("g");
253+
RexCall original = (RexCall) rexBuilder.makeCall(
254+
SqlLibraryOperators.REGEXP_REPLACE_PG_4,
255+
List.of(field, pattern, replacement, flags)
256+
);
257+
258+
RexNode adapted = adapter.adapt(original, List.of(), cluster);
259+
260+
assertSame("identity — 4-arg call with no \\Q and no $N passes through", original, adapted);
261+
}
225262
}

0 commit comments

Comments
 (0)