[Analytics-Engine] [BugFix] Decorrelate EXISTS / IN / SOME / ANY subqueries before execution#5460
Conversation
PPL queries that contain EXISTS / IN / SOME / ANY subqueries (and the
`subsearch` shapes that lower to them) currently crash on the
analytics-engine route with:
IllegalStateException: Unrecognized filter operator [EXISTS / EXISTS]
The analytics-engine's OpenSearchFilterRule resolves every leaf
predicate through ScalarFunction.fromSqlOperatorWithFallback, which
doesn't (and shouldn't) cover RexSubQuery operators. The legacy engine
path doesn't hit this because Calcite's RelRunner.prepareStatement
internally runs SubQueryRemoveRule + decorrelation before physical
conversion — the analytics path skips that pipeline entirely.
Fix: run the three SubQueryToCorrelate rules followed by
RelDecorrelator.decorrelateQuery on the RelNode just before handing it
to QueryPlanExecutor.execute (and the same in explain()). Placed on
AnalyticsExecutionEngine specifically so the legacy plan flow stays
byte-for-byte unchanged.
After: 19 -> 13 failures on CalcitePPLExistsSubqueryIT. The
"Unrecognized filter operator" is gone from every test (0/19 of that
signature remain); the remaining 13 failures are AssertionError result
mismatches that point at downstream analytics-engine semantics for
LEFT_ANTI / bare-group-by Aggregate inside the decorrelated join, which
are unrelated to subquery removal and need separate work.
Signed-off-by: Jialiang Liang <jialianl@amazon.com>
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
PR Reviewer Guide 🔍(Review updated until commit 7de6215)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 7de6215
Previous suggestionsSuggestions up to commit 328ee1b
|
Without the fast-path, the rewrite path's HepPlanner.setRoot(plan) calls .getCluster() on the input — which is null on the mocked RelNodes the existing AnalyticsExecutionEngineTest passes through execute(). Six of the suite's nine tests started failing with 'NullPointerException: cluster' after this PR's first commit. Add a containsSubQuery(RelNode) pre-check that walks the plan with a RelHomogeneousShuttle + RexShuttle, returns true the moment it sees a RexSubQuery, and false otherwise. The rewrite chain only runs when the discriminator says there's something to remove — restores compatibility with the existing mock-based tests and skips HEP entirely for the typical query that has no subqueries. Adds two focused unit tests on the discriminator (package-private so the test can call it directly): * containsSubQuery_falseForPlanWithoutSubQuery — vanilla mocked plan reports false, preventing the HEP/decorrelator path from ever firing on cluster-less mocks. * containsSubQuery_trueWhenRelNodeExposesRexSubQueryDuringTraversal — stub the RelNode.accept(RexShuttle) to surface a RexSubQuery and verify it's detected. The HEP+decorrelator chain itself is Calcite-tested; the end-to-end behavior is covered by CalcitePPLExistsSubqueryIT (0/19 -> 6/19 pass, zero remaining 'Unrecognized filter operator' failures). Signed-off-by: Jialiang Liang <jialianl@amazon.com> Signed-off-by: Jialiang Liang <jiallian@amazon.com>
|
Persistent review updated to latest commit 7de6215 |
Thanks @dai-chen for the pointer to #5246 — re-reading the Component Responsibilities table, I think it actually supports keeping this in the SQL plugin, but in a slightly different place than my current PR. The table assigns "logical optimization" (HepPlanner rewrites like PPL dedup/stats simplification) to the SQL/PPL Plugin column, and "physical optimization" to AE. Subquery removal is a logical-plan rewrite — The reason it feels like it should go AE-side today is that the unified path doesn't have a logical-optimizer stage yet — My current placement ( Happy to move it AE-side if you'd rather, but wanted to flag this option before we pick. WDYT? |
|
Transferring some of the internal convo with @dai-chen here: we decided to make the implementation of fix in core AE side as for now - same reason as I mentioned in the above that the unified path doesn't have a logical-optimizer stage yet. |
Description
PPL queries that contain
EXISTS/NOT EXISTS/IN/SOME/ANYsubqueries — and thesubsearchshapes that lower to them — currently crash on the analytics-engine route with:The whole class
CalcitePPLExistsSubqueryIT(19 tests) fails this way against an analytics-engine-routed cluster. This PR fixes the root cause of that signature.Root cause
The full failure path:
Where:
Error site:
sandbox/plugins/analytics-engine/.../OpenSearchFilterRule.java:155-160— analytics-engine resolves every leaf predicate throughScalarFunction.fromSqlOperatorWithFallback, which doesn't (and shouldn't) coverRexSubQueryoperators.RexSubQueryextendsRexCall, so the upstreaminstanceof RexCallcheck waves it through.Why the legacy engine works: the legacy path reaches
CalciteToolsHelper.RelRunner.run(...) → runner.prepareStatement(rel)which internally runsSubQueryRemoveRule+ decorrelation inside Calcite's Volcano-driven physical conversion. The analytics path goesAnalyticsExecutionEngine → QueryPlanExecutor.execute, bypassing that pipeline entirely.Search the SQL plugin tree for
SubQueryRemoveRule,CoreRules.FILTER_SUB_QUERY_TO_CORRELATE,RelDecorrelator.decorrelateQuery— zero call sites. Nothing in the unified pipeline removes subqueries for the analytics route.Fix
Run Calcite's three
SubQueryToCorrelaterules followed byRelDecorrelator.decorrelateQueryon theRelNodejust before handing it to the analytics executor (and the same inexplain()):Placement matters. I considered three locations:
UnifiedQueryPlanner.plan()AnalyticsExecutionEngine.execute()(this PR)OpenSearchFilterRuleThis PR uses the middle one: the rewrite fires only on the analytics route, so the legacy path's
RelNodepipeline is byte-for-byte unchanged.After-fix verification
Same
CalcitePPLExistsSubqueryITclass, same cluster, same routing (-Dtests.analytics.parquet_indices=true):IllegalStateException: Unrecognized filter operator [EXISTS / EXISTS]AssertionError(result-row mismatches) — zero remainingIllegalStateExceptionsThe "Unrecognized filter operator [EXISTS / EXISTS]" signature is gone from every test — the RCA is fully addressed. Per-test breakdown after the fix:
testEmptyExistsSubquery,testExistsSubqueryWithConjunction,testNotExistsSubquery,testNotExistsSubqueryInFilter,testSubsearchMaxOut2,testSubsearchMaxOut4testSimpleExistsSubquery,testSimpleExistsSubqueryInFilter,testUncorrelatedExistsSubquery,testUncorrelatedExistsSubqueryCheckTheReturnContentOfInnerTableIsEmptyOrNot,testNestedExistsSubquery,testExistsSubqueryAndAggregation,testIssue3566,testSubsearchMaxOut1,testSubsearchMaxOut3,testSubsearchMaxOutNegativeMeansUnlimited,testSubsearchMaxOutUncorrelated,testCorrelatedSubsearchMaxOutZeroMeansUnlimited,testUncorrelatedSubsearchMaxOutZeroMeansUnlimitedThe decorrelated plan for the failing tests looks correct on inspection — e.g.
testSimpleExistsSubqueryproduces:— textbook EXISTS semantics (INNER join with the inner side group-by-deduplicated). The 2× row multiplication that's actually observed suggests the bare-group-by
LogicalAggregate(group=[{0}])isn't preserving its dedup semantic through the analytics-engine's join + aggregate execution path. That's a separate downstream issue from the subquery-removal RCA and will be tracked under its own PR.Out of scope
Anything beyond the EXISTS / IN / SOME / ANY subquery-removal RCA:
Aggregatededup semantics inside a join — needs analytics-engine planner / executor changes. Tracked separately; this PR doesn't change those code paths.RelDecorrelatorwill still raise for correlated subqueries Calcite can't decorrelate (e.g. correlated nondeterministic functions). Those queries used to crash with "Unrecognized filter operator"; with this PR they'll crash with a Calcite decorrelation message — net same outcome (failure → failure), but a different error string.Tests
:core:compileJava— green:opensearch-sql-plugin:publishToMavenLocal— greenCalcitePPLExistsSubqueryITvia:integ-test:integTestRemoteagainst the analytics-engine-routed cluster — 6/19 pass, 13/19 fail with downstream result-mismatch (see verification table above)