Bring CalcitePPLJoinIT to parity on the analytics-engine route#5554
Merged
songkant-aws merged 1 commit intoJun 16, 2026
Conversation
5960ca0 to
41c5052
Compare
CalcitePPLJoinIT failed on the analytics-engine route (parquet/composite store + DataFusion backend, -Dtests.analytics.parquet_indices=true) for three distinct reasons, none of which are real query defects. This brings the class to parity without weakening the assertions. 1. Non-idempotent seed inflated/diverged the shared state_country index. init() runs before every test method (@before). After loadIndex() it unconditionally PUT _doc/5..8 to grow state_country from 4 to 8 rows. On the analytics-engine route the parquet/composite store is append-only and does not overwrite by _id, so re-running those PUTs every method accumulated duplicate rows and inflated downstream join counts (e.g. expected 6, got 60). Make the seed conditional, gated on the SAME `isIndexExist` check loadIndex uses: seed exactly when loadIndex (re)creates state_country, so seed and load stay in lockstep. (An earlier attempt used a static "seed once" latch, but that desyncs from the index — when the framework recreates state_country with only the 4 fixture rows, a latch already flipped true skips re-seeding and leaves the index at 4 rows, which broke the in-memory integTest on macOS/Windows CI. Gating on isIndexExist is correct regardless of the cluster's per-method index lifecycle.) 2. Column ordering. The analytics-engine route builds its scan schema from the serialized index mapping (getSourceAsMap), which OpenSearch returns in alphabetical field order, whereas the v2/Calcite path preserves declared order. Field-list/implicit-projection joins therefore returned the right rows with columns in a different order. Add explicit `| fields ...` to pin the projection order for the affected tests (testComplexSemiJoin, testComplexAntiJoin, testComplexSortPushDownForSMJWithMaxOptionAndFieldList). 3. Row ordering. The analytics-engine coordinator-reduce (RowProducingSink) appends Arrow batches in arrival order from the SEARCH-threadpool response handlers, so a query without ORDER BY has no guaranteed row order — unlike Calcite's deterministic enumerable execution. Cases: - testComplexRightJoin sorts by a column that is null for the right-only rows, leaving their relative order unspecified; switch verifyDataRowsInOrder -> verifyDataRows. - testInnerJoinWithRelationSubquery ends in `stats ... by` with no ORDER BY, so the two output groups come back in a route-dependent order (flaky); switch verifyDataRowsInOrder -> verifyDataRows. - The testCheckAccessTheReference* tests compare two alias-syntax variants to each other via assertJsonEquals on serialized datarows, which is order-sensitive. They only mean to assert the two variants return the same set of rows. Add MatcherUtils.assertJsonRowsEqualIgnoreOrder (multiset compare) and use it for those comparisons. Verified: standard route (integTest, in-memory) 43/43 pass; analytics-engine route drops from 33 failures to only the two remaining exact-equality-on-bare- text cases (testJoinComparing, testJoinSubsearchMaxOut), a separate route limitation (DYNAMIC_STRING_NO_KEYWORD) tracked elsewhere. Signed-off-by: Songkan Tang <songkant@amazon.com>
41c5052 to
00c96c9
Compare
LantaoJin
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
CalcitePPLJoinITfailed on the analytics-engine route (parquet/composite store + DataFusion backend,-Dtests.analytics.parquet_indices=true) for three distinct reasons, none of which are real query defects. This brings the class to parity without weakening the assertions.1. Non-idempotent seed inflated / diverged the shared
state_countryindex.init()runs before every test method (@Before). AfterloadIndex()it unconditionallyPUT _doc/5..8to growstate_countryfrom 4 to 8 rows. On the analytics-engine route the parquet/composite store is append-only and does not overwrite by_id, so re-running those PUTs every method accumulated duplicate rows and inflated downstream join counts (e.g. expected 6, got 60; self-joins far worse). Make the seed conditional, gated on the sameisIndexExistcheckloadIndexuses — seed exactly whenloadIndex(re)createsstate_country, so seed and load stay in lockstep regardless of the cluster's per-method index lifecycle.2. Column ordering.
The analytics-engine route builds its scan schema from the serialized index mapping (
getSourceAsMap), which OpenSearch returns in alphabetical field order, whereas the v2/Calcite path preserves declared order. Field-list / implicit-projection joins therefore returned the right rows with columns in a different order. Added explicit| fields ...to pin the projection order fortestComplexSemiJoin,testComplexAntiJoin, andtestComplexSortPushDownForSMJWithMaxOptionAndFieldList.3. Row ordering.
The analytics-engine coordinator-reduce (
RowProducingSink) appends Arrow batches in arrival order from the SEARCH-threadpool response handlers, so a query withoutORDER BYhas no guaranteed row order — unlike Calcite's deterministic enumerable execution. Cases:testComplexRightJoinsorts by a column that is null for the right-only rows, leaving their relative order unspecified; switchedverifyDataRowsInOrder->verifyDataRows.testInnerJoinWithRelationSubqueryends instats ... bywith noORDER BY, so the two output groups come back in a route-dependent order (this test was flaky); switchedverifyDataRowsInOrder->verifyDataRows.testCheckAccessTheReference*tests compare two alias-syntax variants to each other viaassertJsonEqualson serialized datarows, which is order-sensitive. They only mean to assert the two variants return the same set of rows. AddedMatcherUtils.assertJsonRowsEqualIgnoreOrder(multiset compare) and used it for those comparisons.Verification
integTest, in-memory): 43/43 pass.testJoinComparingandtestJoinSubsearchMaxOut. Both are exact=equality on a baretextfield (country, no.keywordsub-field), which the DataFusion scan cannot match; this is the separateDYNAMIC_STRING_NO_KEYWORDroute limitation and is intentionally left out of this PR.Related Issues
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.