Bring CalcitePPLJoinIT to parity on the analytics-engine route by songkant-aws · Pull Request #5554 · opensearch-project/sql

songkant-aws · 2026-06-16T06:41:56Z

Description

CalcitePPLJoinIT failed on the analytics-engine route (parquet/composite store + DataFusion backend, -Dtests.analytics.parquet_indices=true) for three distinct reasons, none of which are real query defects. This brings the class to parity without weakening the assertions.

1. Non-idempotent seed inflated / diverged the shared state_country index.
init() runs before every test method (@Before). After loadIndex() it unconditionally PUT _doc/5..8 to grow state_country from 4 to 8 rows. On the analytics-engine route the parquet/composite store is append-only and does not overwrite by _id, so re-running those PUTs every method accumulated duplicate rows and inflated downstream join counts (e.g. expected 6, got 60; self-joins far worse). Make the seed conditional, gated on the same isIndexExist check loadIndex uses — seed exactly when loadIndex (re)creates state_country, so seed and load stay in lockstep regardless of the cluster's per-method index lifecycle.

2. Column ordering.
The analytics-engine route builds its scan schema from the serialized index mapping (getSourceAsMap), which OpenSearch returns in alphabetical field order, whereas the v2/Calcite path preserves declared order. Field-list / implicit-projection joins therefore returned the right rows with columns in a different order. Added explicit | fields ... to pin the projection order for testComplexSemiJoin, testComplexAntiJoin, and testComplexSortPushDownForSMJWithMaxOptionAndFieldList.

3. Row ordering.
The analytics-engine coordinator-reduce (RowProducingSink) appends Arrow batches in arrival order from the SEARCH-threadpool response handlers, so a query without ORDER BY has no guaranteed row order — unlike Calcite's deterministic enumerable execution. Cases:

testComplexRightJoin sorts by a column that is null for the right-only rows, leaving their relative order unspecified; switched verifyDataRowsInOrder -> verifyDataRows.
testInnerJoinWithRelationSubquery ends in stats ... by with no ORDER BY, so the two output groups come back in a route-dependent order (this test was flaky); switched verifyDataRowsInOrder -> verifyDataRows.
The testCheckAccessTheReference* tests compare two alias-syntax variants to each other via assertJsonEquals on serialized datarows, which is order-sensitive. They only mean to assert the two variants return the same set of rows. Added MatcherUtils.assertJsonRowsEqualIgnoreOrder (multiset compare) and used it for those comparisons.

Verification

Standard route (integTest, in-memory): 43/43 pass.
Analytics-engine route: dropped from 33 failures to only 2 remaining — testJoinComparing and testJoinSubsearchMaxOut. Both are exact = equality on a bare text field (country, no .keyword sub-field), which the DataFusion scan cannot match; this is the separate DYNAMIC_STRING_NO_KEYWORD route limitation and is intentionally left out of this PR.

Related Issues

Check List

New functionality includes testing.
New functionality has been documented.
New functionality has javadoc added.
New functionality has a user manual doc added.
New PPL command checklist all confirmed.
API changes companion pull request created.
Commits are signed per the DCO using --signoff or -s.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

CalcitePPLJoinIT failed on the analytics-engine route (parquet/composite store + DataFusion backend, -Dtests.analytics.parquet_indices=true) for three distinct reasons, none of which are real query defects. This brings the class to parity without weakening the assertions. 1. Non-idempotent seed inflated/diverged the shared state_country index. init() runs before every test method (@before). After loadIndex() it unconditionally PUT _doc/5..8 to grow state_country from 4 to 8 rows. On the analytics-engine route the parquet/composite store is append-only and does not overwrite by _id, so re-running those PUTs every method accumulated duplicate rows and inflated downstream join counts (e.g. expected 6, got 60). Make the seed conditional, gated on the SAME `isIndexExist` check loadIndex uses: seed exactly when loadIndex (re)creates state_country, so seed and load stay in lockstep. (An earlier attempt used a static "seed once" latch, but that desyncs from the index — when the framework recreates state_country with only the 4 fixture rows, a latch already flipped true skips re-seeding and leaves the index at 4 rows, which broke the in-memory integTest on macOS/Windows CI. Gating on isIndexExist is correct regardless of the cluster's per-method index lifecycle.) 2. Column ordering. The analytics-engine route builds its scan schema from the serialized index mapping (getSourceAsMap), which OpenSearch returns in alphabetical field order, whereas the v2/Calcite path preserves declared order. Field-list/implicit-projection joins therefore returned the right rows with columns in a different order. Add explicit `| fields ...` to pin the projection order for the affected tests (testComplexSemiJoin, testComplexAntiJoin, testComplexSortPushDownForSMJWithMaxOptionAndFieldList). 3. Row ordering. The analytics-engine coordinator-reduce (RowProducingSink) appends Arrow batches in arrival order from the SEARCH-threadpool response handlers, so a query without ORDER BY has no guaranteed row order — unlike Calcite's deterministic enumerable execution. Cases: - testComplexRightJoin sorts by a column that is null for the right-only rows, leaving their relative order unspecified; switch verifyDataRowsInOrder -> verifyDataRows. - testInnerJoinWithRelationSubquery ends in `stats ... by` with no ORDER BY, so the two output groups come back in a route-dependent order (flaky); switch verifyDataRowsInOrder -> verifyDataRows. - The testCheckAccessTheReference* tests compare two alias-syntax variants to each other via assertJsonEquals on serialized datarows, which is order-sensitive. They only mean to assert the two variants return the same set of rows. Add MatcherUtils.assertJsonRowsEqualIgnoreOrder (multiset compare) and use it for those comparisons. Verified: standard route (integTest, in-memory) 43/43 pass; analytics-engine route drops from 33 failures to only the two remaining exact-equality-on-bare- text cases (testJoinComparing, testJoinSubsearchMaxOut), a separate route limitation (DYNAMIC_STRING_NO_KEYWORD) tracked elsewhere. Signed-off-by: Songkan Tang <songkant@amazon.com>

songkant-aws requested review from LantaoJin, RyanL1997, Swiddis, acarbonetto, ahkcs, anirudha, dai-chen, joshuali925, mengweieric, noCharger, penghuo, ps48, qianheng-aws, vamsimanohar, ykmr1224 and yuancu as code owners June 16, 2026 06:41

songkant-aws added bugFix infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. and removed bugFix labels Jun 16, 2026

songkant-aws force-pushed the fix/calcite-join-it-analytics-engine-parity branch from 5960ca0 to 41c5052 Compare June 16, 2026 07:19

songkant-aws force-pushed the fix/calcite-join-it-analytics-engine-parity branch from 41c5052 to 00c96c9 Compare June 16, 2026 07:30

LantaoJin approved these changes Jun 16, 2026

View reviewed changes

songkant-aws merged commit 396723b into opensearch-project:main Jun 16, 2026
28 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring CalcitePPLJoinIT to parity on the analytics-engine route#5554

Bring CalcitePPLJoinIT to parity on the analytics-engine route#5554
songkant-aws merged 1 commit into
opensearch-project:mainfrom
songkant-aws:fix/calcite-join-it-analytics-engine-parity

songkant-aws commented Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

songkant-aws commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

songkant-aws commented Jun 16, 2026 •

edited

Loading