Skip to content

Commit 9b88f23

Browse files
authored
[Feature] Implementation of regex Command In PPL (opensearch-project#4083)
* implement regex cmd with calcite support by suing java library Signed-off-by: Jialiang Liang <jiallian@amazon.com> * code hygiene fix Signed-off-by: Jialiang Liang <jiallian@amazon.com> * comment clean up Signed-off-by: Jialiang Liang <jiallian@amazon.com> * implement explain it Signed-off-by: Jialiang Liang <jiallian@amazon.com> * disable regex when calcite is disable and add a test in analyzer Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix spotless check Signed-off-by: Jialiang Liang <jiallian@amazon.com> * [refactor] refactor some regex fn into a util class for re-usage Signed-off-by: Jialiang Liang <jiallian@amazon.com> * [refactor] revert filter query builder cuz we do not need it anymore Signed-off-by: Jialiang Liang <jiallian@amazon.com> * add rst docs for regex cmd Signed-off-by: Jialiang Liang <jiallian@amazon.com> * add IT for regex cmd Signed-off-by: Jialiang Liang <jiallian@amazon.com> * add IT for calcite no pushdown Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix regex exp behavior for non string val Signed-off-by: Jialiang Liang <jiallian@amazon.com> * style - remove some verbose comments Signed-off-by: Jialiang Liang <jiallian@amazon.com> * remove string convertion Signed-off-by: Jialiang Liang <jiallian@amazon.com> * use existing operator of REGEXP_CONTAINS Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix integ test of rgex with pushdown after operator commit Signed-off-by: Jialiang Liang <jiallian@amazon.com> * remove some verbose comments and fix some style Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix explain it in no pushdown Signed-off-by: Jialiang Liang <jiallian@amazon.com> * comment - remove unused fn for string converting Signed-off-by: Jialiang Liang <jiallian@amazon.com> * remove duplicated regex match operator alias Signed-off-by: Jialiang Liang <jiallian@amazon.com> * unit test - initail commit Signed-off-by: Jialiang Liang <jiallian@amazon.com> * anonymizer with test Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix spotlessApply Signed-off-by: Jialiang Liang <jiallian@amazon.com> * add cross cluster IT Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix spotless apply Signed-off-by: Jialiang Liang <jiallian@amazon.com> * tomo - fix operator constant Signed-off-by: Jialiang Liang <jiallian@amazon.com> * tomo - fix regex java doc Signed-off-by: Jialiang Liang <jiallian@amazon.com> * tomo - field and pattern handling fix Signed-off-by: Jialiang Liang <jiallian@amazon.com> * tomo - fix LRUCache Signed-off-by: Jialiang Liang <jiallian@amazon.com> * tomo - remove unnecessary delegation layer Signed-off-by: Jialiang Liang <jiallian@amazon.com> * rst doc fix Signed-off-by: Jialiang Liang <jiallian@amazon.com> * tomo - fix comments Signed-off-by: Jialiang Liang <jiallian@amazon.com> * DEFAULT FIELD related change Signed-off-by: Jialiang Liang <jiallian@amazon.com> * DEFAULT FIELD - fix anonymizer tests Signed-off-by: Jialiang Liang <jiallian@amazon.com> * tomo - add unit test for regex util class Signed-off-by: Jialiang Liang <jiallian@amazon.com> * chen - remove code for legacy engine Signed-off-by: Jialiang Liang <jiallian@amazon.com> * chen - remove stalled logic for spcified field Signed-off-by: Jialiang Liang <jiallian@amazon.com> * chen - merge into 1 grammar in parser Signed-off-by: Jialiang Liang <jiallian@amazon.com> * properly handle non-string field Signed-off-by: Jialiang Liang <jiallian@amazon.com> * remove verbose comments Signed-off-by: Jialiang Liang <jiallian@amazon.com> * remove verbose comments Signed-off-by: Jialiang Liang <jiallian@amazon.com> * address commetns Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix doc test for regex Signed-off-by: Jialiang Liang <jiallian@amazon.com> * fix doc Signed-off-by: Jialiang Liang <jiallian@amazon.com> --------- Signed-off-by: Jialiang Liang <jiallian@amazon.com>
1 parent 39f37a5 commit 9b88f23

26 files changed

Lines changed: 945 additions & 25 deletions

File tree

core/src/main/java/org/opensearch/sql/analysis/Analyzer.java

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@
7979
import org.opensearch.sql.ast.tree.Patterns;
8080
import org.opensearch.sql.ast.tree.Project;
8181
import org.opensearch.sql.ast.tree.RareTopN;
82+
import org.opensearch.sql.ast.tree.Regex;
8283
import org.opensearch.sql.ast.tree.Relation;
8384
import org.opensearch.sql.ast.tree.RelationSubquery;
8485
import org.opensearch.sql.ast.tree.Rename;
@@ -743,6 +744,12 @@ public LogicalPlan visitReverse(Reverse node, AnalysisContext context) {
743744
"REVERSE is supported only when " + CALCITE_ENGINE_ENABLED.getKeyValue() + "=true");
744745
}
745746

747+
@Override
748+
public LogicalPlan visitRegex(Regex node, AnalysisContext context) {
749+
throw new UnsupportedOperationException(
750+
"REGEX is supported only when " + CALCITE_ENGINE_ENABLED.getKeyValue() + "=true");
751+
}
752+
746753
@Override
747754
public LogicalPlan visitPaginate(Paginate paginate, AnalysisContext context) {
748755
LogicalPlan child = paginate.getChild().get(0).accept(this, context);

core/src/main/java/org/opensearch/sql/ast/AbstractNodeVisitor.java

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@
6767
import org.opensearch.sql.ast.tree.Patterns;
6868
import org.opensearch.sql.ast.tree.Project;
6969
import org.opensearch.sql.ast.tree.RareTopN;
70+
import org.opensearch.sql.ast.tree.Regex;
7071
import org.opensearch.sql.ast.tree.Relation;
7172
import org.opensearch.sql.ast.tree.RelationSubquery;
7273
import org.opensearch.sql.ast.tree.Rename;
@@ -259,6 +260,10 @@ public T visitReverse(Reverse node, C context) {
259260
return visitChildren(node, context);
260261
}
261262

263+
public T visitRegex(Regex node, C context) {
264+
return visitChildren(node, context);
265+
}
266+
262267
public T visitLambdaFunction(LambdaFunction node, C context) {
263268
return visitChildren(node, context);
264269
}
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
/*
2+
* Copyright OpenSearch Contributors
3+
* SPDX-License-Identifier: Apache-2.0
4+
*/
5+
6+
package org.opensearch.sql.ast.tree;
7+
8+
import com.google.common.collect.ImmutableList;
9+
import java.util.List;
10+
import lombok.EqualsAndHashCode;
11+
import lombok.Getter;
12+
import lombok.Setter;
13+
import lombok.ToString;
14+
import org.opensearch.sql.ast.AbstractNodeVisitor;
15+
import org.opensearch.sql.ast.expression.Literal;
16+
import org.opensearch.sql.ast.expression.UnresolvedExpression;
17+
18+
@Getter
19+
@ToString
20+
@EqualsAndHashCode(callSuper = false)
21+
public class Regex extends UnresolvedPlan {
22+
public static final String EQUALS_OPERATOR = "=";
23+
24+
public static final String NOT_EQUALS_OPERATOR = "!=";
25+
26+
private final UnresolvedExpression field;
27+
28+
private final boolean negated;
29+
30+
private final Literal pattern;
31+
32+
@Setter private UnresolvedPlan child;
33+
34+
public Regex(UnresolvedExpression field, boolean negated, Literal pattern) {
35+
this.field = field;
36+
this.negated = negated;
37+
this.pattern = pattern;
38+
}
39+
40+
@Override
41+
public Regex attach(UnresolvedPlan child) {
42+
this.child = child;
43+
return this;
44+
}
45+
46+
@Override
47+
public List<UnresolvedPlan> getChild() {
48+
return this.child == null ? ImmutableList.of() : ImmutableList.of(this.child);
49+
}
50+
51+
@Override
52+
public <T, C> T accept(AbstractNodeVisitor<T, C> nodeVisitor, C context) {
53+
return nodeVisitor.visitRegex(this, context);
54+
}
55+
}

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@
4848
import org.apache.calcite.rex.RexNode;
4949
import org.apache.calcite.rex.RexWindowBounds;
5050
import org.apache.calcite.sql.fun.SqlStdOperatorTable;
51+
import org.apache.calcite.sql.type.SqlTypeFamily;
5152
import org.apache.calcite.sql.type.SqlTypeName;
5253
import org.apache.calcite.tools.RelBuilder;
5354
import org.apache.calcite.tools.RelBuilder.AggCall;
@@ -99,6 +100,7 @@
99100
import org.opensearch.sql.ast.tree.Patterns;
100101
import org.opensearch.sql.ast.tree.Project;
101102
import org.opensearch.sql.ast.tree.RareTopN;
103+
import org.opensearch.sql.ast.tree.Regex;
102104
import org.opensearch.sql.ast.tree.Relation;
103105
import org.opensearch.sql.ast.tree.Rename;
104106
import org.opensearch.sql.ast.tree.SPath;
@@ -174,6 +176,32 @@ public RelNode visitFilter(Filter node, CalcitePlanContext context) {
174176
return context.relBuilder.peek();
175177
}
176178

179+
@Override
180+
public RelNode visitRegex(Regex node, CalcitePlanContext context) {
181+
visitChildren(node, context);
182+
183+
RexNode fieldRex = rexVisitor.analyze(node.getField(), context);
184+
RexNode patternRex = rexVisitor.analyze(node.getPattern(), context);
185+
186+
if (!SqlTypeFamily.CHARACTER.contains(fieldRex.getType())) {
187+
throw new IllegalArgumentException(
188+
String.format(
189+
"Regex command requires field of string type, but got %s for field '%s'",
190+
fieldRex.getType().getSqlTypeName(), node.getField().toString()));
191+
}
192+
193+
RexNode regexCondition =
194+
context.rexBuilder.makeCall(
195+
org.apache.calcite.sql.fun.SqlLibraryOperators.REGEXP_CONTAINS, fieldRex, patternRex);
196+
197+
if (node.isNegated()) {
198+
regexCondition = context.rexBuilder.makeCall(SqlStdOperatorTable.NOT, regexCondition);
199+
}
200+
201+
context.relBuilder.filter(regexCondition);
202+
return context.relBuilder.peek();
203+
}
204+
177205
private boolean containsSubqueryExpression(Node expr) {
178206
if (expr == null) {
179207
return false;
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
/*
2+
* Copyright OpenSearch Contributors
3+
* SPDX-License-Identifier: Apache-2.0
4+
*/
5+
6+
package org.opensearch.sql.expression.parse;
7+
8+
import com.google.common.collect.ImmutableList;
9+
import java.util.Collections;
10+
import java.util.LinkedHashMap;
11+
import java.util.List;
12+
import java.util.Map;
13+
import java.util.regex.Matcher;
14+
import java.util.regex.Pattern;
15+
import java.util.regex.PatternSyntaxException;
16+
17+
/**
18+
* Common utilities for regex operations. Provides pattern caching and consistent matching behavior.
19+
*/
20+
public class RegexCommonUtils {
21+
22+
private static final Pattern NAMED_GROUP_PATTERN =
23+
Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
24+
25+
private static final int MAX_CACHE_SIZE = 1000;
26+
27+
private static final Map<String, Pattern> patternCache =
28+
Collections.synchronizedMap(
29+
new LinkedHashMap<>(MAX_CACHE_SIZE + 1, 0.75f, true) {
30+
@Override
31+
protected boolean removeEldestEntry(Map.Entry<String, Pattern> eldest) {
32+
return size() > MAX_CACHE_SIZE;
33+
}
34+
});
35+
36+
/**
37+
* Get compiled pattern from cache or compile and cache it.
38+
*
39+
* @param regex The regex pattern string
40+
* @return Compiled Pattern object
41+
* @throws PatternSyntaxException if the regex is invalid
42+
*/
43+
public static Pattern getCompiledPattern(String regex) {
44+
Pattern pattern = patternCache.get(regex);
45+
if (pattern == null) {
46+
pattern = Pattern.compile(regex);
47+
patternCache.put(regex, pattern);
48+
}
49+
return pattern;
50+
}
51+
52+
/**
53+
* Extract list of named group candidates from a regex pattern.
54+
*
55+
* @param pattern The regex pattern string
56+
* @return List of named group names found in the pattern
57+
*/
58+
public static List<String> getNamedGroupCandidates(String pattern) {
59+
ImmutableList.Builder<String> namedGroups = ImmutableList.builder();
60+
Matcher m = NAMED_GROUP_PATTERN.matcher(pattern);
61+
while (m.find()) {
62+
namedGroups.add(m.group(1));
63+
}
64+
return namedGroups.build();
65+
}
66+
67+
/**
68+
* Match using find() for partial match semantics with string pattern.
69+
*
70+
* @param text The text to match against
71+
* @param patternStr The pattern string
72+
* @return true if pattern is found anywhere in the text
73+
* @throws PatternSyntaxException if the regex is invalid
74+
*/
75+
public static boolean matchesPartial(String text, String patternStr) {
76+
if (text == null || patternStr == null) {
77+
return false;
78+
}
79+
Pattern pattern = getCompiledPattern(patternStr);
80+
return pattern.matcher(text).find();
81+
}
82+
83+
/**
84+
* Extract a specific named group from text using the pattern. Used by parse command regex method.
85+
*
86+
* @param text The text to extract from
87+
* @param pattern The compiled pattern with named groups
88+
* @param groupName The name of the group to extract
89+
* @return The extracted value or null if not found
90+
*/
91+
public static String extractNamedGroup(String text, Pattern pattern, String groupName) {
92+
if (text == null || pattern == null || groupName == null) {
93+
return null;
94+
}
95+
96+
Matcher matcher = pattern.matcher(text);
97+
98+
if (matcher.matches()) {
99+
try {
100+
return matcher.group(groupName);
101+
} catch (IllegalArgumentException e) {
102+
return null;
103+
}
104+
}
105+
106+
return null;
107+
}
108+
}

core/src/main/java/org/opensearch/sql/expression/parse/RegexExpression.java

Lines changed: 6 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,6 @@
55

66
package org.opensearch.sql.expression.parse;
77

8-
import com.google.common.collect.ImmutableList;
9-
import java.util.List;
10-
import java.util.regex.Matcher;
118
import java.util.regex.Pattern;
129
import lombok.EqualsAndHashCode;
1310
import lombok.Getter;
@@ -24,7 +21,6 @@
2421
@ToString
2522
public class RegexExpression extends ParseExpression {
2623
private static final Logger log = LogManager.getLogger(RegexExpression.class);
27-
private static final Pattern GROUP_PATTERN = Pattern.compile("\\(\\?<([a-zA-Z][a-zA-Z0-9]*)>");
2824
@Getter @EqualsAndHashCode.Exclude private final Pattern regexPattern;
2925

3026
/**
@@ -36,32 +32,19 @@ public class RegexExpression extends ParseExpression {
3632
*/
3733
public RegexExpression(Expression sourceField, Expression pattern, Expression identifier) {
3834
super("regex", sourceField, pattern, identifier);
39-
this.regexPattern = Pattern.compile(pattern.valueOf().stringValue());
35+
this.regexPattern = RegexCommonUtils.getCompiledPattern(pattern.valueOf().stringValue());
4036
}
4137

4238
@Override
4339
ExprValue parseValue(ExprValue value) throws ExpressionEvaluationException {
4440
String rawString = value.stringValue();
45-
Matcher matcher = regexPattern.matcher(rawString);
46-
if (matcher.matches()) {
47-
return new ExprStringValue(matcher.group(identifierStr));
41+
42+
String extracted = RegexCommonUtils.extractNamedGroup(rawString, regexPattern, identifierStr);
43+
44+
if (extracted != null) {
45+
return new ExprStringValue(extracted);
4846
}
4947
log.debug("failed to extract pattern {} from input ***", regexPattern.pattern());
5048
return new ExprStringValue("");
5149
}
52-
53-
/**
54-
* Get list of derived fields based on parse pattern.
55-
*
56-
* @param pattern pattern used for parsing
57-
* @return list of names of the derived fields
58-
*/
59-
public static List<String> getNamedGroupCandidates(String pattern) {
60-
ImmutableList.Builder<String> namedGroups = ImmutableList.builder();
61-
Matcher m = GROUP_PATTERN.matcher(pattern);
62-
while (m.find()) {
63-
namedGroups.add(m.group(1));
64-
}
65-
return namedGroups.build();
66-
}
6750
}

core/src/main/java/org/opensearch/sql/utils/ParseUtils.java

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import org.opensearch.sql.expression.parse.GrokExpression;
1717
import org.opensearch.sql.expression.parse.ParseExpression;
1818
import org.opensearch.sql.expression.parse.PatternsExpression;
19+
import org.opensearch.sql.expression.parse.RegexCommonUtils;
1920
import org.opensearch.sql.expression.parse.RegexExpression;
2021

2122
/** Utils for {@link ParseExpression}. */
@@ -57,7 +58,7 @@ public static List<String> getNamedGroupCandidates(
5758
ParseMethod parseMethod, String pattern, Map<String, Literal> arguments) {
5859
switch (parseMethod) {
5960
case REGEX:
60-
return RegexExpression.getNamedGroupCandidates(pattern);
61+
return RegexCommonUtils.getNamedGroupCandidates(pattern);
6162
case GROK:
6263
return GrokExpression.getNamedGroupCandidates(pattern);
6364
default:

core/src/test/java/org/opensearch/sql/analysis/AnalyzerTest.java

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1929,4 +1929,18 @@ public void brain_patterns_command() {
19291929

19301930
assertAnalyzeEqual(expectedPlan, patterns);
19311931
}
1932+
1933+
@Test
1934+
public void regex_command_throws_unsupported_exception_with_legacy_engine() {
1935+
UnsupportedOperationException exception =
1936+
assertThrows(
1937+
UnsupportedOperationException.class,
1938+
() ->
1939+
analyze(
1940+
new org.opensearch.sql.ast.tree.Regex(
1941+
field("lastname"), false, stringLiteral("^[A-Z][a-z]+$"))
1942+
.attach(relation("schema"))));
1943+
assertEquals(
1944+
"REGEX is supported only when plugins.calcite.enabled=true", exception.getMessage());
1945+
}
19321946
}

0 commit comments

Comments
 (0)