[Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution by RyanL1997 · Pull Request #5460 · opensearch-project/sql

RyanL1997 · 2026-05-21T21:34:11Z

Description

PPL queries that contain EXISTS / NOT EXISTS / IN / SOME / ANY subqueries — and the subsearch shapes that lower to them — currently crash on the analytics-engine route with:

HTTP/1.1 500 Internal Server Error
{
  "error": {
    "reason": "There was internal problem at backend",
    "details": "Unrecognized filter operator [EXISTS / EXISTS]",
    "type": "IllegalStateException"
  }
}

The whole class CalcitePPLExistsSubqueryIT (19 tests) fails this way against an analytics-engine-routed cluster. This PR fixes the root cause of that signature.

Root cause

The full failure path:

PPL `... | where [exists ...]`
  → CalciteRelNodeVisitor.analyze(...)
  → LogicalFilter(condition=[EXISTS({...subquery...})])         ← RexSubQuery of SqlKind.EXISTS
  → UnifiedQueryPlanner.plan() returns as-is
  → AnalyticsExecutionEngine.execute() → analytics-engine PlannerImpl
  → OpenSearchFilterRule.resolveViableBackends(predicate)
  → ScalarFunction.fromSqlOperatorWithFallback(EXISTS) returns null
  → throw IllegalStateException("Unrecognized filter operator [EXISTS / EXISTS]")

Where:

Error site: sandbox/plugins/analytics-engine/.../OpenSearchFilterRule.java:155-160 — analytics-engine resolves every leaf predicate through ScalarFunction.fromSqlOperatorWithFallback, which doesn't (and shouldn't) cover RexSubQuery operators. RexSubQuery extends RexCall, so the upstream instanceof RexCall check waves it through.
Why the legacy engine works: the legacy path reaches CalciteToolsHelper.RelRunner.run(...) → runner.prepareStatement(rel) which internally runs SubQueryRemoveRule + decorrelation inside Calcite's Volcano-driven physical conversion. The analytics path goes AnalyticsExecutionEngine → QueryPlanExecutor.execute, bypassing that pipeline entirely.
Search the SQL plugin tree for SubQueryRemoveRule, CoreRules.FILTER_SUB_QUERY_TO_CORRELATE, RelDecorrelator.decorrelateQuery — zero call sites. Nothing in the unified pipeline removes subqueries for the analytics route.

Fix

Run Calcite's three SubQueryToCorrelate rules followed by RelDecorrelator.decorrelateQuery on the RelNode just before handing it to the analytics executor (and the same in explain()):

private static final HepProgram SUBQUERY_REMOVE_PROGRAM =
    new HepProgramBuilder()
        .addRuleInstance(CoreRules.FILTER_SUB_QUERY_TO_CORRELATE)
        .addRuleInstance(CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE)
        .addRuleInstance(CoreRules.JOIN_SUB_QUERY_TO_CORRELATE)
        .build();

private static RelNode removeSubQueries(RelNode plan, CalcitePlanContext context) {
    HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM);
    planner.setRoot(plan);
    RelNode withCorrelates = planner.findBestExp();
    return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder);
}

Placement matters. I considered three locations:

Placement	Affects legacy path?	Verdict
`UnifiedQueryPlanner.plan()`	Yes — every query runs the new pass	Higher blast radius
`AnalyticsExecutionEngine.execute()` (this PR)	No — only the analytics route	Surgical
Per-rule guard in `OpenSearchFilterRule`	Doesn't fix anything	Not a fix

This PR uses the middle one: the rewrite fires only on the analytics route, so the legacy path's RelNode pipeline is byte-for-byte unchanged.

After-fix verification

Same CalcitePPLExistsSubqueryIT class, same cluster, same routing (-Dtests.analytics.parquet_indices=true):

Run	Pass	Fail	Failure signature
Before	0 / 19	19 / 19	All `IllegalStateException: Unrecognized filter operator [EXISTS / EXISTS]`
After (this PR)	6 / 19	13 / 19	All `AssertionError` (result-row mismatches) — zero remaining `IllegalStateException`s

The "Unrecognized filter operator [EXISTS / EXISTS]" signature is gone from every test — the RCA is fully addressed. Per-test breakdown after the fix:

PASS (6): testEmptyExistsSubquery, testExistsSubqueryWithConjunction, testNotExistsSubquery, testNotExistsSubqueryInFilter, testSubsearchMaxOut2, testSubsearchMaxOut4
FAIL — result mismatch only (13): testSimpleExistsSubquery, testSimpleExistsSubqueryInFilter, testUncorrelatedExistsSubquery, testUncorrelatedExistsSubqueryCheckTheReturnContentOfInnerTableIsEmptyOrNot, testNestedExistsSubquery, testExistsSubqueryAndAggregation, testIssue3566, testSubsearchMaxOut1, testSubsearchMaxOut3, testSubsearchMaxOutNegativeMeansUnlimited, testSubsearchMaxOutUncorrelated, testCorrelatedSubsearchMaxOutZeroMeansUnlimited, testUncorrelatedSubsearchMaxOutZeroMeansUnlimited

The decorrelated plan for the failing tests looks correct on inspection — e.g. testSimpleExistsSubquery produces:

LogicalJoin(condition=[=(worker.id, $f0)], joinType=[inner])
  LogicalTableScan(worker)
  LogicalProject(uid=[$0], $f1=[true])
    LogicalAggregate(group=[{0}])           ← dedups inner side by uid
      LogicalProject(uid=[$3])
        LogicalFilter(condition=[IS NOT NULL($3)])
          LogicalTableScan(work_information)

— textbook EXISTS semantics (INNER join with the inner side group-by-deduplicated). The 2× row multiplication that's actually observed suggests the bare-group-by LogicalAggregate(group=[{0}]) isn't preserving its dedup semantic through the analytics-engine's join + aggregate execution path. That's a separate downstream issue from the subquery-removal RCA and will be tracked under its own PR.

Out of scope

Anything beyond the EXISTS / IN / SOME / ANY subquery-removal RCA:

Bare-group-by Aggregate dedup semantics inside a join — needs analytics-engine planner / executor changes. Tracked separately; this PR doesn't change those code paths.
RelDecorrelator will still raise for correlated subqueries Calcite can't decorrelate (e.g. correlated nondeterministic functions). Those queries used to crash with "Unrecognized filter operator"; with this PR they'll crash with a Calcite decorrelation message — net same outcome (failure → failure), but a different error string.

Tests

:core:compileJava — green
:opensearch-sql-plugin:publishToMavenLocal — green
CalcitePPLExistsSubqueryIT via :integ-test:integTestRemote against the analytics-engine-routed cluster — 6/19 pass, 13/19 fail with downstream result-mismatch (see verification table above)

PPL queries that contain EXISTS / IN / SOME / ANY subqueries (and the `subsearch` shapes that lower to them) currently crash on the analytics-engine route with: IllegalStateException: Unrecognized filter operator [EXISTS / EXISTS] The analytics-engine's OpenSearchFilterRule resolves every leaf predicate through ScalarFunction.fromSqlOperatorWithFallback, which doesn't (and shouldn't) cover RexSubQuery operators. The legacy engine path doesn't hit this because Calcite's RelRunner.prepareStatement internally runs SubQueryRemoveRule + decorrelation before physical conversion — the analytics path skips that pipeline entirely. Fix: run the three SubQueryToCorrelate rules followed by RelDecorrelator.decorrelateQuery on the RelNode just before handing it to QueryPlanExecutor.execute (and the same in explain()). Placed on AnalyticsExecutionEngine specifically so the legacy plan flow stays byte-for-byte unchanged. After: 19 -> 13 failures on CalcitePPLExistsSubqueryIT. The "Unrecognized filter operator" is gone from every test (0/19 of that signature remain); the remaining 13 failures are AssertionError result mismatches that point at downstream analytics-engine semantics for LEFT_ANTI / bare-group-by Aggregate inside the decorrelated join, which are unrelated to subquery removal and need separate work. Signed-off-by: Jialiang Liang <jialianl@amazon.com> Signed-off-by: Jialiang Liang <jiallian@amazon.com>

github-actions · 2026-05-21T21:35:05Z

PR Reviewer Guide 🔍

(Review updated until commit `7de6215`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Issue The `containsSubQuery` method uses a mutable array `boolean[] found = {false}` to track state across shuttle visits. If the shuttle's `visit` method is called concurrently (e.g., if Calcite's traversal becomes multi-threaded in a future version), this shared mutable state could cause race conditions where one thread's write to `found[0]` is not visible to another, leading to incorrect detection results. static boolean containsSubQuery(RelNode plan) { boolean[] found = {false}; RexShuttle rexFinder = new RexShuttle() { @Override public org.apache.calcite.rex.RexNode visitSubQuery(RexSubQuery sub) { found[0] = true; return sub; } }; plan.accept( new RelHomogeneousShuttle() { @Override public RelNode visit(RelNode node) { if (found[0]) return node; node.accept(rexFinder); return found[0] ? node : super.visit(node); } }); return found[0]; } Incomplete Traversal In `containsSubQuery`, the `RelHomogeneousShuttle.visit` returns early with `return node` when `found[0]` is true, but this bypasses `super.visit(node)`, which means child nodes are not traversed. If a subquery exists only in a deeply nested child that hasn't been visited yet when the first `found[0]` check passes due to a false positive or unrelated state, the traversal stops prematurely. However, the current logic sets `found[0] = true` only in `visitSubQuery`, so this early return should be safe. The real issue is that after calling `node.accept(rexFinder)`, if `found[0]` is still false, `super.visit(node)` is called to continue traversal, but if `found[0]` becomes true during `node.accept(rexFinder)`, the method returns `node` without calling `super.visit(node)`, which is correct for stopping traversal. This logic appears sound on closer inspection. new RelHomogeneousShuttle() { @Override public RelNode visit(RelNode node) { if (found[0]) return node; node.accept(rexFinder); return found[0] ? node : super.visit(node); }

github-actions · 2026-05-21T21:35:24Z

PR Code Suggestions ✨

Latest suggestions up to 7de6215
Explore these optional code suggestions:

Category	Suggestion	Impact
General	Replace mutable array with AtomicBoolean The `containsSubQuery` method uses a mutable array to track state across nested anonymous classes, which is error-prone. Consider using `AtomicBoolean` instead for clearer intent and thread-safety, or refactor to throw an exception when a subquery is found to avoid the mutable state pattern entirely. core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [196-216] static boolean containsSubQuery(RelNode plan) { - boolean[] found = {false}; + java.util.concurrent.atomic.AtomicBoolean found = new java.util.concurrent.atomic.AtomicBoolean(false); RexShuttle rexFinder = new RexShuttle() { @Override public org.apache.calcite.rex.RexNode visitSubQuery(RexSubQuery sub) { - found[0] = true; + found.set(true); return sub; } }; plan.accept( new RelHomogeneousShuttle() { @Override public RelNode visit(RelNode node) { - if (found[0]) return node; + if (found.get()) return node; node.accept(rexFinder); - return found[0] ? node : super.visit(node); + return found.get() ? node : super.visit(node); } }); - return found[0]; + return found.get(); } Suggestion importance[1-10]: 4 __ Why: While using `AtomicBoolean` instead of a mutable array is a valid style improvement, the current implementation is a common Java pattern for capturing state in anonymous classes and works correctly. The suggestion doesn't address a bug or significant issue, and thread-safety is not a concern here since the traversal is single-threaded. The improvement is marginal.	Low

Previous suggestions

Suggestions up to commit 328ee1b

Category	Suggestion	Impact
General	Add exception handling for decorrelation Wrap the decorrelation logic in a try-catch block to handle potential exceptions from `RelDecorrelator.decorrelateQuery()`. This prevents unhandled exceptions from propagating and provides better error context for debugging complex query plans. core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [179-184] private static RelNode removeSubQueries(RelNode plan, CalcitePlanContext context) { - HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM); - planner.setRoot(plan); - RelNode withCorrelates = planner.findBestExp(); - return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder); + try { + HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM); + planner.setRoot(plan); + RelNode withCorrelates = planner.findBestExp(); + return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder); + } catch (Exception e) { + throw new RuntimeException("Failed to decorrelate subqueries in query plan", e); + } } Suggestion importance[1-10]: 6 __ Why: Adding exception handling around `RelDecorrelator.decorrelateQuery()` provides better error context and prevents unhandled exceptions from propagating. This is useful for debugging complex query plans. However, wrapping with a generic `RuntimeException` may not add significant value if the calling methods already have error handling mechanisms in place (as seen in the `execute` and `explain` methods with their `ActionListener` callbacks).	Low
Possible issue	Add null parameter validation Add null checks for `plan` and `context` parameters to prevent potential `NullPointerException`. The method is called from multiple execution paths and should validate inputs before processing. core/src/main/java/org/opensearch/sql/executor/analytics/AnalyticsExecutionEngine.java [179-184] private static RelNode removeSubQueries(RelNode plan, CalcitePlanContext context) { + if (plan == null \|\| context == null) { + throw new IllegalArgumentException("Plan and context must not be null"); + } HepPlanner planner = new HepPlanner(SUBQUERY_REMOVE_PROGRAM); planner.setRoot(plan); RelNode withCorrelates = planner.findBestExp(); return RelDecorrelator.decorrelateQuery(withCorrelates, context.relBuilder); } Suggestion importance[1-10]: 5 __ Why: Adding null checks is a reasonable defensive programming practice. However, the method is private and called from controlled locations within the same class (`execute` and `explain` methods), where `plan` and `context` are already validated or guaranteed to be non-null by the calling context. The impact is moderate as it adds safety but may be unnecessary given the controlled usage.	Low

dai-chen

Can we do this in Analytics Engine instead? We may add a UnifiedQueryOptimizer in future but currently in AE, they do all optimizations there. Ref: Component Responsibilities section in #5246

Without the fast-path, the rewrite path's HepPlanner.setRoot(plan) calls .getCluster() on the input — which is null on the mocked RelNodes the existing AnalyticsExecutionEngineTest passes through execute(). Six of the suite's nine tests started failing with 'NullPointerException: cluster' after this PR's first commit. Add a containsSubQuery(RelNode) pre-check that walks the plan with a RelHomogeneousShuttle + RexShuttle, returns true the moment it sees a RexSubQuery, and false otherwise. The rewrite chain only runs when the discriminator says there's something to remove — restores compatibility with the existing mock-based tests and skips HEP entirely for the typical query that has no subqueries. Adds two focused unit tests on the discriminator (package-private so the test can call it directly): * containsSubQuery_falseForPlanWithoutSubQuery — vanilla mocked plan reports false, preventing the HEP/decorrelator path from ever firing on cluster-less mocks. * containsSubQuery_trueWhenRelNodeExposesRexSubQueryDuringTraversal — stub the RelNode.accept(RexShuttle) to surface a RexSubQuery and verify it's detected. The HEP+decorrelator chain itself is Calcite-tested; the end-to-end behavior is covered by CalcitePPLExistsSubqueryIT (0/19 -> 6/19 pass, zero remaining 'Unrecognized filter operator' failures). Signed-off-by: Jialiang Liang <jialianl@amazon.com> Signed-off-by: Jialiang Liang <jiallian@amazon.com>

github-actions · 2026-05-21T21:49:35Z

Persistent review updated to latest commit 7de6215

RyanL1997 · 2026-05-21T22:00:14Z

Can we do this in Analytics Engine instead? We may add a UnifiedQueryOptimizer in future but currently in AE, they do all optimizations there. Ref: Component Responsibilities section in #5246

Thanks @dai-chen for the pointer to #5246 — re-reading the Component Responsibilities table, I think it actually supports keeping this in the SQL plugin, but in a slightly different place than my current PR.

The table assigns "logical optimization" (HepPlanner rewrites like PPL dedup/stats simplification) to the SQL/PPL Plugin column, and "physical optimization" to AE. Subquery removal is a logical-plan rewrite — Filter(EXISTS(subq)) → semi-join via SubQueryRemove + Decorrelator — so by that split it falls on the SQL side.

The reason it feels like it should go AE-side today is that the unified path doesn't have a logical-optimizer stage yet — HEP_PROGRAM lives in CalciteToolsHelper but only runs on the legacy path. UnifiedQueryPlanner.plan() just returns the RelNode after the postAnalysisRules() shuttles. So the "logical optimization" slot for the unified path is empty.

My current placement (AnalyticsExecutionEngine.execute()) works, but it's really the execution bridge. I think UnifiedQueryPlanner.plan() is the better home — it's the central plan-build point, matches the table directly, and naturally becomes the seed for the UnifiedQueryOptimizer you mentioned later (same class, just an earlier checkpoint). Same diff size, just one method up the stack.

Happy to move it AE-side if you'd rather, but wanted to flag this option before we pick. WDYT?

RyanL1997 · 2026-05-21T22:29:46Z

Transferring some of the internal convo with @dai-chen here: we decided to make the implementation of fix in core AE side as for now - same reason as I mentioned in the above that the unified path doesn't have a logical-optimizer stage yet.

RyanL1997 changed the title ~~[analytics-engine] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution~~ [Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution May 21, 2026

RyanL1997 added the bugFix label May 21, 2026

dai-chen reviewed May 21, 2026

View reviewed changes

RyanL1997 mentioned this pull request May 21, 2026

[Analytics-Engine] Decorrelate RexSubQuery before marking in PlannerImpl opensearch-project/OpenSearch#21795

Merged

RyanL1997 closed this May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution#5460

[Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution#5460
RyanL1997 wants to merge 2 commits into
opensearch-project:mainfrom
RyanL1997:fix-analytics-exists-subquery

RyanL1997 commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

dai-chen left a comment •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

RyanL1997 commented May 21, 2026

Uh oh!

RyanL1997 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

RyanL1997 commented May 21, 2026

Description

Root cause

Fix

After-fix verification

Out of scope

Tests

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit 7de6215)

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

dai-chen left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

RyanL1997 commented May 21, 2026

Uh oh!

RyanL1997 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 21, 2026 •

edited

Loading

(Review updated until commit `7de6215`)

github-actions Bot commented May 21, 2026 •

edited

Loading

dai-chen left a comment •

edited

Loading