Skip to content

Bring CalcitePPLJoinIT to parity on the analytics-engine route#5554

Merged
songkant-aws merged 1 commit into
opensearch-project:mainfrom
songkant-aws:fix/calcite-join-it-analytics-engine-parity
Jun 16, 2026
Merged

Bring CalcitePPLJoinIT to parity on the analytics-engine route#5554
songkant-aws merged 1 commit into
opensearch-project:mainfrom
songkant-aws:fix/calcite-join-it-analytics-engine-parity

Conversation

@songkant-aws

@songkant-aws songkant-aws commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Description

CalcitePPLJoinIT failed on the analytics-engine route (parquet/composite store + DataFusion backend, -Dtests.analytics.parquet_indices=true) for three distinct reasons, none of which are real query defects. This brings the class to parity without weakening the assertions.

1. Non-idempotent seed inflated / diverged the shared state_country index.
init() runs before every test method (@Before). After loadIndex() it unconditionally PUT _doc/5..8 to grow state_country from 4 to 8 rows. On the analytics-engine route the parquet/composite store is append-only and does not overwrite by _id, so re-running those PUTs every method accumulated duplicate rows and inflated downstream join counts (e.g. expected 6, got 60; self-joins far worse). Make the seed conditional, gated on the same isIndexExist check loadIndex uses — seed exactly when loadIndex (re)creates state_country, so seed and load stay in lockstep regardless of the cluster's per-method index lifecycle.

2. Column ordering.
The analytics-engine route builds its scan schema from the serialized index mapping (getSourceAsMap), which OpenSearch returns in alphabetical field order, whereas the v2/Calcite path preserves declared order. Field-list / implicit-projection joins therefore returned the right rows with columns in a different order. Added explicit | fields ... to pin the projection order for testComplexSemiJoin, testComplexAntiJoin, and testComplexSortPushDownForSMJWithMaxOptionAndFieldList.

3. Row ordering.
The analytics-engine coordinator-reduce (RowProducingSink) appends Arrow batches in arrival order from the SEARCH-threadpool response handlers, so a query without ORDER BY has no guaranteed row order — unlike Calcite's deterministic enumerable execution. Cases:

  • testComplexRightJoin sorts by a column that is null for the right-only rows, leaving their relative order unspecified; switched verifyDataRowsInOrder -> verifyDataRows.
  • testInnerJoinWithRelationSubquery ends in stats ... by with no ORDER BY, so the two output groups come back in a route-dependent order (this test was flaky); switched verifyDataRowsInOrder -> verifyDataRows.
  • The testCheckAccessTheReference* tests compare two alias-syntax variants to each other via assertJsonEquals on serialized datarows, which is order-sensitive. They only mean to assert the two variants return the same set of rows. Added MatcherUtils.assertJsonRowsEqualIgnoreOrder (multiset compare) and used it for those comparisons.

Verification

  • Standard route (integTest, in-memory): 43/43 pass.
  • Analytics-engine route: dropped from 33 failures to only 2 remaining — testJoinComparing and testJoinSubsearchMaxOut. Both are exact = equality on a bare text field (country, no .keyword sub-field), which the DataFusion scan cannot match; this is the separate DYNAMIC_STRING_NO_KEYWORD route limitation and is intentionally left out of this PR.

Related Issues

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@songkant-aws songkant-aws added bugFix infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc. and removed bugFix labels Jun 16, 2026
@songkant-aws songkant-aws force-pushed the fix/calcite-join-it-analytics-engine-parity branch from 5960ca0 to 41c5052 Compare June 16, 2026 07:19
CalcitePPLJoinIT failed on the analytics-engine route (parquet/composite
store + DataFusion backend, -Dtests.analytics.parquet_indices=true) for three
distinct reasons, none of which are real query defects. This brings the class
to parity without weakening the assertions.

1. Non-idempotent seed inflated/diverged the shared state_country index.
   init() runs before every test method (@before). After loadIndex() it
   unconditionally PUT _doc/5..8 to grow state_country from 4 to 8 rows. On the
   analytics-engine route the parquet/composite store is append-only and does
   not overwrite by _id, so re-running those PUTs every method accumulated
   duplicate rows and inflated downstream join counts (e.g. expected 6, got 60).
   Make the seed conditional, gated on the SAME `isIndexExist` check loadIndex
   uses: seed exactly when loadIndex (re)creates state_country, so seed and load
   stay in lockstep. (An earlier attempt used a static "seed once" latch, but
   that desyncs from the index — when the framework recreates state_country with
   only the 4 fixture rows, a latch already flipped true skips re-seeding and
   leaves the index at 4 rows, which broke the in-memory integTest on
   macOS/Windows CI. Gating on isIndexExist is correct regardless of the
   cluster's per-method index lifecycle.)

2. Column ordering. The analytics-engine route builds its scan schema from the
   serialized index mapping (getSourceAsMap), which OpenSearch returns in
   alphabetical field order, whereas the v2/Calcite path preserves declared
   order. Field-list/implicit-projection joins therefore returned the right
   rows with columns in a different order. Add explicit `| fields ...` to pin
   the projection order for the affected tests (testComplexSemiJoin,
   testComplexAntiJoin, testComplexSortPushDownForSMJWithMaxOptionAndFieldList).

3. Row ordering. The analytics-engine coordinator-reduce (RowProducingSink)
   appends Arrow batches in arrival order from the SEARCH-threadpool response
   handlers, so a query without ORDER BY has no guaranteed row order — unlike
   Calcite's deterministic enumerable execution. Cases:
   - testComplexRightJoin sorts by a column that is null for the right-only
     rows, leaving their relative order unspecified; switch
     verifyDataRowsInOrder -> verifyDataRows.
   - testInnerJoinWithRelationSubquery ends in `stats ... by` with no ORDER BY,
     so the two output groups come back in a route-dependent order (flaky);
     switch verifyDataRowsInOrder -> verifyDataRows.
   - The testCheckAccessTheReference* tests compare two alias-syntax variants to
     each other via assertJsonEquals on serialized datarows, which is
     order-sensitive. They only mean to assert the two variants return the same
     set of rows. Add MatcherUtils.assertJsonRowsEqualIgnoreOrder (multiset
     compare) and use it for those comparisons.

Verified: standard route (integTest, in-memory) 43/43 pass; analytics-engine
route drops from 33 failures to only the two remaining exact-equality-on-bare-
text cases (testJoinComparing, testJoinSubsearchMaxOut), a separate route
limitation (DYNAMIC_STRING_NO_KEYWORD) tracked elsewhere.

Signed-off-by: Songkan Tang <songkant@amazon.com>
@songkant-aws songkant-aws force-pushed the fix/calcite-join-it-analytics-engine-parity branch from 41c5052 to 00c96c9 Compare June 16, 2026 07:30
@songkant-aws songkant-aws merged commit 396723b into opensearch-project:main Jun 16, 2026
28 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

infrastructure Changes to infrastructure, testing, CI/CD, pipelines, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants