Skip to content

[Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution#5460

Closed
RyanL1997 wants to merge 2 commits into
opensearch-project:mainfrom
RyanL1997:fix-analytics-exists-subquery
Closed

[Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution#5460
RyanL1997 wants to merge 2 commits into
opensearch-project:mainfrom
RyanL1997:fix-analytics-exists-subquery

Conversation

@RyanL1997

Copy link
Copy Markdown
Collaborator

Description

PPL queries that contain EXISTS / NOT EXISTS / IN / SOME / ANY subqueries — and the subsearch shapes that lower to them — currently crash on the analytics-engine route with:

HTTP/1.1 500 Internal Server Error
{
  "error": {
    "reason": "There was internal problem at backend",
    "details": "Unrecognized filter operator [EXISTS / EXISTS]",
    "type": "IllegalStateException"
  }
}

The whole class CalcitePPLExistsSubqueryIT (19 tests) fails this way against an analytics-engine-routed cluster. This PR fixes the root cause of that signature.

Root cause

The full failure path:

PPL `... | where [exists ...]`
  → CalciteRelNodeVisitor.analyze(...)
  → LogicalFilter(condition=[EXISTS({...subquery...})])         ← RexSubQuery of SqlKind.EXISTS
  → UnifiedQueryPlanner.plan() returns as-is
  → AnalyticsExecutionEngine.execute() → analytics-engine PlannerImpl
  → OpenSearchFilterRule.resolveViableBackends(predicate)
  → ScalarFunction.fromSqlOperatorWithFallback(EXISTS) returns null
  → throw IllegalStateException("Unrecognized filter operator [EXISTS / EXISTS]")

Where:

  • Error site: sandbox/plugins/analytics-engine/.../OpenSearchFilterRule.java:155-160 — analytics-engine resolves every leaf predicate through ScalarFunction.fromSqlOperatorWithFallback, which doesn't (and shouldn't) cover RexSubQuery operators. RexSubQuery extends RexCall, so the upstream instanceof RexCall check waves it through.

  • Why the legacy engine works: the legacy path reaches CalciteToolsHelper.RelRunner.run(...) → runner.prepareStatement(rel) which internally runs SubQueryRemoveRule + decorrelation inside Calcite's Volcano-driven physical conversion. The analytics path goes AnalyticsExecutionEngine → QueryPlanExecutor.execute, bypassing that pipeline entirely.

  • Search the SQL plugin tree for SubQueryRemoveRule, CoreRules.FILTER_SUB_QUERY_TO_CORRELATE, RelDecorrelator.decorrelateQuery — zero call sites. Nothing in the unified pipeline removes subqueries for the analytics route.

Fix

Run Calcite's three SubQueryToCorrelate rules followed by RelDecorrelator.decorrelateQuery on the RelNode just before handing it to the analytics executor (and the same in explain()):

private static final HepProgram SUBQUERY_REMOVE_PROGRAM =
    new HepProgramBuilder()
        .addRuleInstance(CoreRules.FILTER_SUB_QUERY_TO_CORRELATE)
        .addRuleInstance(CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE)
        .addRuleInstance(CoreRules.JOIN_SUB_QUERY_TO_CORRELATE)
        .build();

private static RelNode removeSubQueries(RelNode plan, CalcitePlanContext context) {
    HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM);
    planner.setRoot(plan);
    RelNode withCorrelates = planner.findBestExp();
    return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder);
}

Placement matters. I considered three locations:

Placement Affects legacy path? Verdict
UnifiedQueryPlanner.plan() Yes — every query runs the new pass Higher blast radius
AnalyticsExecutionEngine.execute() (this PR) No — only the analytics route Surgical
Per-rule guard in OpenSearchFilterRule Doesn't fix anything Not a fix

This PR uses the middle one: the rewrite fires only on the analytics route, so the legacy path's RelNode pipeline is byte-for-byte unchanged.

After-fix verification

Same CalcitePPLExistsSubqueryIT class, same cluster, same routing (-Dtests.analytics.parquet_indices=true):

Run Pass Fail Failure signature
Before 0 / 19 19 / 19 All IllegalStateException: Unrecognized filter operator [EXISTS / EXISTS]
After (this PR) 6 / 19 13 / 19 All AssertionError (result-row mismatches) — zero remaining IllegalStateExceptions

The "Unrecognized filter operator [EXISTS / EXISTS]" signature is gone from every test — the RCA is fully addressed. Per-test breakdown after the fix:

  • PASS (6): testEmptyExistsSubquery, testExistsSubqueryWithConjunction, testNotExistsSubquery, testNotExistsSubqueryInFilter, testSubsearchMaxOut2, testSubsearchMaxOut4
  • FAIL — result mismatch only (13): testSimpleExistsSubquery, testSimpleExistsSubqueryInFilter, testUncorrelatedExistsSubquery, testUncorrelatedExistsSubqueryCheckTheReturnContentOfInnerTableIsEmptyOrNot, testNestedExistsSubquery, testExistsSubqueryAndAggregation, testIssue3566, testSubsearchMaxOut1, testSubsearchMaxOut3, testSubsearchMaxOutNegativeMeansUnlimited, testSubsearchMaxOutUncorrelated, testCorrelatedSubsearchMaxOutZeroMeansUnlimited, testUncorrelatedSubsearchMaxOutZeroMeansUnlimited

The decorrelated plan for the failing tests looks correct on inspection — e.g. testSimpleExistsSubquery produces:

LogicalJoin(condition=[=(worker.id, $f0)], joinType=[inner])
  LogicalTableScan(worker)
  LogicalProject(uid=[$0], $f1=[true])
    LogicalAggregate(group=[{0}])           ← dedups inner side by uid
      LogicalProject(uid=[$3])
        LogicalFilter(condition=[IS NOT NULL($3)])
          LogicalTableScan(work_information)

— textbook EXISTS semantics (INNER join with the inner side group-by-deduplicated). The 2× row multiplication that's actually observed suggests the bare-group-by LogicalAggregate(group=[{0}]) isn't preserving its dedup semantic through the analytics-engine's join + aggregate execution path. That's a separate downstream issue from the subquery-removal RCA and will be tracked under its own PR.

Out of scope

Anything beyond the EXISTS / IN / SOME / ANY subquery-removal RCA:

  • Bare-group-by Aggregate dedup semantics inside a join — needs analytics-engine planner / executor changes. Tracked separately; this PR doesn't change those code paths.
  • RelDecorrelator will still raise for correlated subqueries Calcite can't decorrelate (e.g. correlated nondeterministic functions). Those queries used to crash with "Unrecognized filter operator"; with this PR they'll crash with a Calcite decorrelation message — net same outcome (failure → failure), but a different error string.

Tests

  • :core:compileJava — green
  • :opensearch-sql-plugin:publishToMavenLocal — green
  • CalcitePPLExistsSubqueryIT via :integ-test:integTestRemote against the analytics-engine-routed cluster — 6/19 pass, 13/19 fail with downstream result-mismatch (see verification table above)

PPL queries that contain EXISTS / IN / SOME / ANY subqueries (and the
`subsearch` shapes that lower to them) currently crash on the
analytics-engine route with:

    IllegalStateException: Unrecognized filter operator [EXISTS / EXISTS]

The analytics-engine's OpenSearchFilterRule resolves every leaf
predicate through ScalarFunction.fromSqlOperatorWithFallback, which
doesn't (and shouldn't) cover RexSubQuery operators. The legacy engine
path doesn't hit this because Calcite's RelRunner.prepareStatement
internally runs SubQueryRemoveRule + decorrelation before physical
conversion — the analytics path skips that pipeline entirely.

Fix: run the three SubQueryToCorrelate rules followed by
RelDecorrelator.decorrelateQuery on the RelNode just before handing it
to QueryPlanExecutor.execute (and the same in explain()). Placed on
AnalyticsExecutionEngine specifically so the legacy plan flow stays
byte-for-byte unchanged.

After: 19 -> 13 failures on CalcitePPLExistsSubqueryIT. The
"Unrecognized filter operator" is gone from every test (0/19 of that
signature remain); the remaining 13 failures are AssertionError result
mismatches that point at downstream analytics-engine semantics for
LEFT_ANTI / bare-group-by Aggregate inside the decorrelated join, which
are unrelated to subquery removal and need separate work.

Signed-off-by: Jialiang Liang <jialianl@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

(Review updated until commit 7de6215)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The containsSubQuery method uses a mutable array boolean[] found = {false} to track state across shuttle visits. If the shuttle's visit method is called concurrently (e.g., if Calcite's traversal becomes multi-threaded in a future version), this shared mutable state could cause race conditions where one thread's write to found[0] is not visible to another, leading to incorrect detection results.

static boolean containsSubQuery(RelNode plan) {
  boolean[] found = {false};
  RexShuttle rexFinder =
      new RexShuttle() {
        @Override
        public org.apache.calcite.rex.RexNode visitSubQuery(RexSubQuery sub) {
          found[0] = true;
          return sub;
        }
      };
  plan.accept(
      new RelHomogeneousShuttle() {
        @Override
        public RelNode visit(RelNode node) {
          if (found[0]) return node;
          node.accept(rexFinder);
          return found[0] ? node : super.visit(node);
        }
      });
  return found[0];
}
Incomplete Traversal

In containsSubQuery, the RelHomogeneousShuttle.visit returns early with return node when found[0] is true, but this bypasses super.visit(node), which means child nodes are not traversed. If a subquery exists only in a deeply nested child that hasn't been visited yet when the first found[0] check passes due to a false positive or unrelated state, the traversal stops prematurely. However, the current logic sets found[0] = true only in visitSubQuery, so this early return should be safe. The real issue is that after calling node.accept(rexFinder), if found[0] is still false, super.visit(node) is called to continue traversal, but if found[0] becomes true during node.accept(rexFinder), the method returns node without calling super.visit(node), which is correct for stopping traversal. This logic appears sound on closer inspection.

new RelHomogeneousShuttle() {
  @Override
  public RelNode visit(RelNode node) {
    if (found[0]) return node;
    node.accept(rexFinder);
    return found[0] ? node : super.visit(node);
  }

@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Latest suggestions up to 7de6215
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Replace mutable array with AtomicBoolean

The containsSubQuery method uses a mutable array to track state across nested
anonymous classes, which is error-prone. Consider using AtomicBoolean instead for
clearer intent and thread-safety, or refactor to throw an exception when a subquery
is found to avoid the mutable state pattern entirely.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [196-216]

 static boolean containsSubQuery(RelNode plan) {
-  boolean[] found = {false};
+  java.util.concurrent.atomic.AtomicBoolean found = new java.util.concurrent.atomic.AtomicBoolean(false);
   RexShuttle rexFinder =
       new RexShuttle() {
         @Override
         public org.apache.calcite.rex.RexNode visitSubQuery(RexSubQuery sub) {
-          found[0] = true;
+          found.set(true);
           return sub;
         }
       };
   plan.accept(
       new RelHomogeneousShuttle() {
         @Override
         public RelNode visit(RelNode node) {
-          if (found[0]) return node;
+          if (found.get()) return node;
           node.accept(rexFinder);
-          return found[0] ? node : super.visit(node);
+          return found.get() ? node : super.visit(node);
         }
       });
-  return found[0];
+  return found.get();
 }
Suggestion importance[1-10]: 4

__

Why: While using AtomicBoolean instead of a mutable array is a valid style improvement, the current implementation is a common Java pattern for capturing state in anonymous classes and works correctly. The suggestion doesn't address a bug or significant issue, and thread-safety is not a concern here since the traversal is single-threaded. The improvement is marginal.

Low

Previous suggestions

Suggestions up to commit 328ee1b
CategorySuggestion                                                                                                                                    Impact
General
Add exception handling for decorrelation

Wrap the decorrelation logic in a try-catch block to handle potential exceptions
from RelDecorrelator.decorrelateQuery(). This prevents unhandled exceptions from
propagating and provides better error context for debugging complex query plans.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [179-184]

 private static RelNode removeSubQueries(RelNode plan, CalcitePlanContext context) {
-  HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM);
-  planner.setRoot(plan);
-  RelNode withCorrelates = planner.findBestExp();
-  return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder);
+  try {
+    HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM);
+    planner.setRoot(plan);
+    RelNode withCorrelates = planner.findBestExp();
+    return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder);
+  } catch (Exception e) {
+    throw new RuntimeException("Failed to decorrelate subqueries in query plan", e);
+  }
 }
Suggestion importance[1-10]: 6

__

Why: Adding exception handling around RelDecorrelator.decorrelateQuery() provides better error context and prevents unhandled exceptions from propagating. This is useful for debugging complex query plans. However, wrapping with a generic RuntimeException may not add significant value if the calling methods already have error handling mechanisms in place (as seen in the execute and explain methods with their ActionListener callbacks).

Low
Possible issue
Add null parameter validation

Add null checks for plan and context parameters to prevent potential
NullPointerException. The method is called from multiple execution paths and should
validate inputs before processing.

core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [179-184]

 private static RelNode removeSubQueries(RelNode plan, CalcitePlanContext context) {
+  if (plan == null || context == null) {
+    throw new IllegalArgumentException("Plan and context must not be null");
+  }
   HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM);
   planner.setRoot(plan);
   RelNode withCorrelates = planner.findBestExp();
   return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder);
 }
Suggestion importance[1-10]: 5

__

Why: Adding null checks is a reasonable defensive programming practice. However, the method is private and called from controlled locations within the same class (execute and explain methods), where plan and context are already validated or guaranteed to be non-null by the calling context. The impact is moderate as it adds safety but may be unnecessary given the controlled usage.

Low

@RyanL1997 RyanL1997 changed the title [analytics-engine] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution [Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution May 21, 2026

@dai-chen dai-chen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in Analytics Engine instead? We may add a UnifiedQueryOptimizer in future but currently in AE, they do all optimizations there. Ref: Component Responsibilities section in #5246

Without the fast-path, the rewrite path's HepPlanner.setRoot(plan) calls
.getCluster() on the input — which is null on the mocked RelNodes the
existing AnalyticsExecutionEngineTest passes through execute(). Six of
the suite's nine tests started failing with 'NullPointerException:
cluster' after this PR's first commit.

Add a containsSubQuery(RelNode) pre-check that walks the plan with a
RelHomogeneousShuttle + RexShuttle, returns true the moment it sees a
RexSubQuery, and false otherwise. The rewrite chain only runs when the
discriminator says there's something to remove — restores compatibility
with the existing mock-based tests and skips HEP entirely for the
typical query that has no subqueries.

Adds two focused unit tests on the discriminator (package-private so the
test can call it directly):

* containsSubQuery_falseForPlanWithoutSubQuery — vanilla mocked plan
  reports false, preventing the HEP/decorrelator path from ever firing
  on cluster-less mocks.
* containsSubQuery_trueWhenRelNodeExposesRexSubQueryDuringTraversal —
  stub the RelNode.accept(RexShuttle) to surface a RexSubQuery and
  verify it's detected.

The HEP+decorrelator chain itself is Calcite-tested; the end-to-end
behavior is covered by CalcitePPLExistsSubqueryIT (0/19 -> 6/19 pass,
zero remaining 'Unrecognized filter operator' failures).

Signed-off-by: Jialiang Liang <jialianl@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
@github-actions

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 7de6215

@RyanL1997

Copy link
Copy Markdown
Collaborator Author

Can we do this in Analytics Engine instead? We may add a UnifiedQueryOptimizer in future but currently in AE, they do all optimizations there. Ref: Component Responsibilities section in #5246

Thanks @dai-chen for the pointer to #5246 — re-reading the Component Responsibilities table, I think it actually supports keeping this in the SQL plugin, but in a slightly different place than my current PR.

The table assigns "logical optimization" (HepPlanner rewrites like PPL dedup/stats simplification) to the SQL/PPL Plugin column, and "physical optimization" to AE. Subquery removal is a logical-plan rewrite — Filter(EXISTS(subq)) → semi-join via SubQueryRemove + Decorrelator — so by that split it falls on the SQL side.

The reason it feels like it should go AE-side today is that the unified path doesn't have a logical-optimizer stage yet — HEP_PROGRAM lives in CalciteToolsHelper but only runs on the legacy path. UnifiedQueryPlanner.plan() just returns the RelNode after the postAnalysisRules() shuttles. So the "logical optimization" slot for the unified path is empty.

My current placement (AnalyticsExecutionEngine.execute()) works, but it's really the execution bridge. I think UnifiedQueryPlanner.plan() is the better home — it's the central plan-build point, matches the table directly, and naturally becomes the seed for the UnifiedQueryOptimizer you mentioned later (same class, just an earlier checkpoint). Same diff size, just one method up the stack.

Happy to move it AE-side if you'd rather, but wanted to flag this option before we pick. WDYT?

@RyanL1997

Copy link
Copy Markdown
Collaborator Author

Transferring some of the internal convo with @dai-chen here: we decided to make the implementation of fix in core AE side as for now - same reason as I mentioned in the above that the unified path doesn't have a logical-optimizer stage yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants