Skip to content

Commit 727c48d

Browse files
committed
Narrow native lowering in visitSearch to numeric / boolean RHS values
The visitor's native RexCall lowering routes structured search predicates through CalciteScriptEngine on the v2/Calcite-with-Lucene path. The script engine compiles a typed comparison directly, which is fine for numeric and boolean RHS values — but it can't reproduce Lucene's analyzer-aware string semantics that PPL `search` has always relied on: - Text fields are tokenised; native `=` would compare against analyzed terms and never match the raw literal (doctest `severityText="INFO"`). - `!=` on text fields filters only documents that match the analyzed token, not ones containing it — doctest `instrumentationScope.name!="@opentelemetry/instrumentation-http"` expects Lucene's "matches any token" semantics. - `IN` lists need the same analyzer semantics for each value. - `@timestamp >= "..."` literals reach the script engine as java.lang.String vs long (date field) — Janino can't compile that comparison and the query fails at shard time. Restrict native lowering to INTEGER / LONG / SHORT / FLOAT / DOUBLE / DECIMAL / BOOLEAN values via `isNativeLowerableValue`. Everything else (STRING, DATE/TIME/TIMESTAMP, IP, NULL) falls back to the original `query_string` path so Lucene handles it natively — on the analytics-engine route via the composite-parquet Lucene secondary, on the v2/Calcite path via the Lucene primary. The same gate applies to `SearchIn` — string IN lists fall back too. Drops the plan-time date-math resolver introduced in the prior iteration — it emitted TIMESTAMP literals that tripped the same script-engine compile failure on Lucene-backed shards, and the broader value-type gate subsumes its narrower string-shape detection. Update `explain_search_numeric_comparison.json` (both pushdown and no-pushdown variants) to reflect the new native shape for the `severityNumber > 15` example — pure numeric comparisons still lower to a native RexCall, which is the intended PR behavior. Update `CalcitePPLSearchTest.testSearchWithFilter` for the same reason — `DEPTNO=20` now lowers natively since the RHS is an integer literal. Signed-off-by: Kai Huang <ahkcs@amazon.com>
1 parent 53c33ab commit 727c48d

5 files changed

Lines changed: 53 additions & 65 deletions

File tree

core/src/main/java/org/opensearch/sql/calcite/CalciteRelNodeVisitor.java

Lines changed: 32 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@
9494
import org.opensearch.sql.ast.expression.AllFieldsExcludeMeta;
9595
import org.opensearch.sql.ast.expression.Argument;
9696
import org.opensearch.sql.ast.expression.Argument.ArgumentMap;
97+
import org.opensearch.sql.ast.expression.DataType;
9798
import org.opensearch.sql.ast.expression.Field;
9899
import org.opensearch.sql.ast.expression.Function;
99100
import org.opensearch.sql.ast.expression.Let;
@@ -384,19 +385,26 @@ private Optional<UnresolvedExpression> lowerSearchExpression(
384385
if (!isLowerableField(comp.getField(), knownFields)) {
385386
return Optional.empty();
386387
}
387-
if (isOpenSearchDateMath(comp.getValue().getLiteral())) {
388+
UnresolvedExpression value = comp.getValue().getLiteral();
389+
if (!isNativeLowerableValue(value)) {
390+
// Only numeric / boolean RHS values lower cleanly to native RexCall.
391+
// - Strings hit Lucene's analyzer-aware semantics on text fields; native `=` would
392+
// compare against the analyzed-term storage and never match the raw literal
393+
// (`severityText="INFO"` returns 0 rows when the text field tokenises to
394+
// `[info]`). PPL `search` has always relied on Lucene's analyzer.
395+
// - Date-math values (`now`, `now-1h`, `2024-01-15||+1d`, 12-digit epoch-millis)
396+
// reach the script engine as `java.lang.String` vs `long` (date field) — Janino
397+
// can't compile that comparison and the query fails at shard time.
398+
// Keep these in the original `query_string` fallback so Lucene handles them — both
399+
// on the v2/Calcite path (against the Lucene primary) and on the analytics-engine
400+
// path (via the composite-parquet Lucene secondary).
388401
return Optional.empty();
389402
}
390-
if (containsLuceneWildcard(comp.getValue().getLiteral())) {
391-
// `severityText=ERR*` / `field=foo?` / `name=*-service` — Lucene-style wildcards in
392-
// the right-hand value. A native `=` lowering would compare literally and drop every
393-
// matching document. Keep the query in query_string form so Lucene evaluates the
394-
// wildcard.
403+
if (containsLuceneWildcard(value)) {
404+
// Defensive: numeric values can't contain `*`/`?`, but guard anyway.
395405
return Optional.empty();
396406
}
397-
return Optional.of(
398-
AstDSL.compare(
399-
comp.getOperator().getSymbol(), comp.getField(), comp.getValue().getLiteral()));
407+
return Optional.of(AstDSL.compare(comp.getOperator().getSymbol(), comp.getField(), value));
400408
}
401409
if (e instanceof SearchIn) {
402410
SearchIn in = (SearchIn) e;
@@ -405,6 +413,10 @@ private Optional<UnresolvedExpression> lowerSearchExpression(
405413
}
406414
List<UnresolvedExpression> values =
407415
in.getValues().stream().map(SearchLiteral::getLiteral).collect(Collectors.toList());
416+
if (!values.stream().allMatch(CalciteRelNodeVisitor::isNativeLowerableValue)) {
417+
// Same rationale as SearchComparison — string IN lists need Lucene analyzer semantics.
418+
return Optional.empty();
419+
}
408420
return Optional.of(AstDSL.in(in.getField(), values));
409421
}
410422
// SearchLiteral (standalone free-text / phrase) — no native equivalent.
@@ -417,42 +429,21 @@ private static boolean isLowerableField(Field field, Set<String> knownFields) {
417429
}
418430

419431
/**
420-
* Recognise OpenSearch date-math values produced by {@code visitTimeModifierExpression} — {@code
421-
* "now"}, {@code "now-1h"}, {@code "now+1y/M-1M"}, anchored expressions like {@code
422-
* "2024-01-15||+1d"}, and epoch-millis strings such as {@code "1754020060000"}. These strings
423-
* have to be evaluated as Lucene-style date math; Calcite would compare them lexically. Detecting
424-
* them lets the visitor keep the comparison in {@code query_string} form so the existing
425-
* semantics survive.
432+
* Numeric and boolean RHS values lower cleanly to native {@link org.apache.calcite.rex.RexCall}
433+
* on every backend. Strings (and all date / interval shapes the parser models as strings) need
434+
* Lucene's analyzer-aware {@code query_string} evaluation — both for text-field tokenisation and
435+
* for date-math expressions like {@code now-1h} / anchored literals — so we keep them in the
436+
* fallback path.
426437
*/
427-
private static boolean isOpenSearchDateMath(UnresolvedExpression value) {
438+
private static boolean isNativeLowerableValue(UnresolvedExpression value) {
428439
if (!(value instanceof Literal)) {
429440
return false;
430441
}
431-
Object raw = ((Literal) value).getValue();
432-
if (!(raw instanceof String)) {
433-
return false;
434-
}
435-
String s = ((String) raw).trim();
436-
if (s.isEmpty()) {
437-
return false;
438-
}
439-
if (s.regionMatches(true, 0, "now", 0, 3)) {
440-
return true;
441-
}
442-
if (s.contains("||")) {
443-
return true;
444-
}
445-
// Epoch-millis from `earliest=1754020060` / `latest=1754020060.123` — 12+ digits only.
446-
if (s.length() >= 12) {
447-
for (int i = 0; i < s.length(); i++) {
448-
char c = s.charAt(i);
449-
if (c < '0' || c > '9') {
450-
return false;
451-
}
452-
}
453-
return true;
454-
}
455-
return false;
442+
DataType type = ((Literal) value).getType();
443+
return switch (type) {
444+
case INTEGER, LONG, SHORT, FLOAT, DOUBLE, DECIMAL, BOOLEAN -> true;
445+
default -> false;
446+
};
456447
}
457448

458449
/**
@@ -739,9 +730,6 @@ private List<RexNode> expandProjectFields(
739730
.filter(addedFields::add)
740731
.forEach(field -> expandedFields.add(context.relBuilder.field(field)));
741732
}
742-
case Alias alias -> {
743-
expandedFields.add(rexVisitor.analyze(alias, context));
744-
}
745733
default ->
746734
throw new IllegalStateException(
747735
"Unexpected expression type in project list: " + expr.getClass().getSimpleName());

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -170,21 +170,21 @@ private ExprType convertType(RelDataType type) {
170170

171171
/**
172172
* Recovers the pre-cast datetime types for the schema. {@code DatetimeOutputCastRule} (api
173-
* post-analysis pass) wraps every datetime-typed root column in {@code CAST(... AS VARCHAR)}
174-
* so the analytics-engine backend can emit PPL's documented space-separator format via {@code
175-
* to_char} ({@code DatetimeOutputCastRewriter} on the OpenSearch sandbox side). The downside
176-
* is that {@code plan.getRowType()} then advertises those columns as VARCHAR — the response
177-
* schema would report {@code "string"} for what tests (and callers) still expect as {@code
178-
* "timestamp"}. The legacy / Calcite-reference path doesn't have this divergence: the wire
179-
* value is a formatted string while the schema column type is the original datetime type.
173+
* post-analysis pass) wraps every datetime-typed root column in {@code CAST(... AS VARCHAR)} so
174+
* the analytics-engine backend can emit PPL's documented space-separator format via {@code
175+
* to_char} ({@code DatetimeOutputCastRewriter} on the OpenSearch sandbox side). The downside is
176+
* that {@code plan.getRowType()} then advertises those columns as VARCHAR — the response schema
177+
* would report {@code "string"} for what tests (and callers) still expect as {@code "timestamp"}.
178+
* The legacy / Calcite-reference path doesn't have this divergence: the wire value is a formatted
179+
* string while the schema column type is the original datetime type.
180180
*
181-
* <p>This walk inspects the OUTERMOST {@link Project} (descending through a single {@link
182-
* Sort}, the {@code LogicalSystemLimit} wrapper) and, for any slot of the exact shape {@code
181+
* <p>This walk inspects the OUTERMOST {@link Project} (descending through a single {@link Sort},
182+
* the {@code LogicalSystemLimit} wrapper) and, for any slot of the exact shape {@code
183183
* CAST(<datetime> AS VARCHAR)}, swaps in the inner {@link RelDataType} for the schema. Values
184184
* still come back as the strings DataFusion emitted; only the schema column type is restored.
185185
*
186-
* <p>Falls back to the slot's reported type for non-cast slots, non-datetime sources, or when
187-
* the root isn't a Project (raw scan / aggregate fragment).
186+
* <p>Falls back to the slot's reported type for non-cast slots, non-datetime sources, or when the
187+
* root isn't a Project (raw scan / aggregate fragment).
188188
*/
189189
private static List<RelDataType> recoverOriginalDatetimeTypes(RelNode plan) {
190190
List<RelDataTypeField> fields = plan.getRowType().getFieldList();
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"calcite": {
3-
"logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n LogicalProject(spanId=[$0], traceId=[$1], @timestamp=[$2], instrumentationScope=[$3], severityText=[$7], resource=[$8], flags=[$23], attributes=[$24], droppedAttributesCount=[$162], severityNumber=[$163], time=[$164], body=[$165])\n LogicalFilter(condition=[query_string(MAP('query', 'severityNumber:>15':VARCHAR))])\n CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]])\n",
4-
"physical": "CalciteEnumerableIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]], PushDownContext=[[PROJECT->[spanId, traceId, @timestamp, instrumentationScope, severityText, resource, flags, attributes, droppedAttributesCount, severityNumber, time, body], FILTER->query_string(MAP('query', 'severityNumber:>15':VARCHAR)), LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":10000,\"timeout\":\"1m\",\"query\":{\"query_string\":{\"query\":\"severityNumber:>15\",\"fields\":[],\"type\":\"best_fields\",\"default_operator\":\"or\",\"max_determinized_states\":10000,\"enable_position_increments\":true,\"fuzziness\":\"AUTO\",\"fuzzy_prefix_length\":0,\"fuzzy_max_expansions\":50,\"phrase_slop\":0,\"escape\":false,\"auto_generate_synonyms_phrase_query\":true,\"fuzzy_transpositions\":true,\"boost\":1.0}},\"_source\":{\"includes\":[\"spanId\",\"traceId\",\"@timestamp\",\"instrumentationScope\",\"severityText\",\"resource\",\"flags\",\"attributes\",\"droppedAttributesCount\",\"severityNumber\",\"time\",\"body\"]}}, requestedTotalSize=10000, pageSize=null, startFrom=0)])\n"
3+
"logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n LogicalProject(spanId=[$0], traceId=[$1], @timestamp=[$2], instrumentationScope=[$3], severityText=[$7], resource=[$8], flags=[$23], attributes=[$24], droppedAttributesCount=[$162], severityNumber=[$163], time=[$164], body=[$165])\n LogicalFilter(condition=[>($163, 15)])\n CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]])\n",
4+
"physical": "CalciteEnumerableIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]], PushDownContext=[[PROJECT->[spanId, traceId, @timestamp, instrumentationScope, severityText, resource, flags, attributes, droppedAttributesCount, severityNumber, time, body], FILTER->>($9, 15), LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"size\":10000,\"timeout\":\"1m\",\"query\":{\"range\":{\"severityNumber\":{\"from\":15,\"to\":null,\"include_lower\":false,\"include_upper\":true,\"boost\":1.0}}},\"_source\":{\"includes\":[\"spanId\",\"traceId\",\"@timestamp\",\"instrumentationScope\",\"severityText\",\"resource\",\"flags\",\"attributes\",\"droppedAttributesCount\",\"severityNumber\",\"time\",\"body\"]}}, requestedTotalSize=10000, pageSize=null, startFrom=0)])\n"
55
}
6-
}
6+
}
Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
{
22
"calcite": {
3-
"logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n LogicalProject(spanId=[$0], traceId=[$1], @timestamp=[$2], instrumentationScope=[$3], severityText=[$7], resource=[$8], flags=[$23], attributes=[$24], droppedAttributesCount=[$162], severityNumber=[$163], time=[$164], body=[$165])\n LogicalFilter(condition=[query_string(MAP('query', 'severityNumber:>15':VARCHAR))])\n CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]])\n",
4-
"physical": "EnumerableLimit(fetch=[10000])\n EnumerableCalc(expr#0..171=[{inputs}], proj#0..3=[{exprs}], severityText=[$t7], resource=[$t8], flags=[$t23], attributes=[$t24], droppedAttributesCount=[$t162], severityNumber=[$t163], time=[$t164], body=[$t165])\n CalciteEnumerableIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]], PushDownContext=[[FILTER->query_string(MAP('query', 'severityNumber:>15':VARCHAR))], OpenSearchRequestBuilder(sourceBuilder={\"from\":0,\"timeout\":\"1m\",\"query\":{\"query_string\":{\"query\":\"severityNumber:>15\",\"fields\":[],\"type\":\"best_fields\",\"default_operator\":\"or\",\"max_determinized_states\":10000,\"enable_position_increments\":true,\"fuzziness\":\"AUTO\",\"fuzzy_prefix_length\":0,\"fuzzy_max_expansions\":50,\"phrase_slop\":0,\"escape\":false,\"auto_generate_synonyms_phrase_query\":true,\"fuzzy_transpositions\":true,\"boost\":1.0}}}, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])\n"
3+
"logical": "LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])\n LogicalProject(spanId=[$0], traceId=[$1], @timestamp=[$2], instrumentationScope=[$3], severityText=[$7], resource=[$8], flags=[$23], attributes=[$24], droppedAttributesCount=[$162], severityNumber=[$163], time=[$164], body=[$165])\n LogicalFilter(condition=[>($163, 15)])\n CalciteLogicalIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]])\n",
4+
"physical": "EnumerableLimit(fetch=[10000])\n EnumerableCalc(expr#0..171=[{inputs}], expr#172=[15], expr#173=[>($t163, $t172)], proj#0..3=[{exprs}], severityText=[$t7], resource=[$t8], flags=[$t23], attributes=[$t24], droppedAttributesCount=[$t162], severityNumber=[$t163], time=[$t164], body=[$t165], $condition=[$t173])\n CalciteEnumerableIndexScan(table=[[OpenSearch, opensearch-sql_test_index_otel_logs]])\n"
55
}
6-
}
6+
}

ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLSearchTest.java

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,13 @@ protected Frameworks.ConfigBuilder config(CalciteAssert.SchemaSpec... schemaSpec
4444
public void testSearchWithFilter() {
4545
String ppl = "search source=EMP DEPTNO=20";
4646
RelNode root = getRelNode(ppl);
47+
// Numeric RHS values lower to a native RexCall — DataFusion and CalciteScriptEngine both
48+
// compile the comparison directly, so we bypass query_string for these.
4749
String expectedLogical =
48-
"LogicalFilter(condition=[query_string(MAP('query', 'DEPTNO:20':VARCHAR))])\n"
49-
+ " LogicalTableScan(table=[[scott, EMP]])\n";
50+
"LogicalFilter(condition=[=($7, 20)])\n LogicalTableScan(table=[[scott, EMP]])\n";
5051
verifyLogical(root, expectedLogical);
5152

52-
String expectedSparkSql =
53-
"SELECT *\nFROM `scott`.`EMP`\nWHERE QUERY_STRING(MAP ('query', 'DEPTNO:20'))";
53+
String expectedSparkSql = "SELECT *\nFROM `scott`.`EMP`\nWHERE `DEPTNO` = 20";
5454
verifyPPLToSparkSQL(root, expectedSparkSql);
5555
}
5656

0 commit comments

Comments
 (0)