You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bring CalcitePPLJoinIT to parity on the analytics-engine route (#5554)
CalcitePPLJoinIT failed on the analytics-engine route (parquet/composite
store + DataFusion backend, -Dtests.analytics.parquet_indices=true) for three
distinct reasons, none of which are real query defects. This brings the class
to parity without weakening the assertions.
1. Non-idempotent seed inflated/diverged the shared state_country index.
init() runs before every test method (@before). After loadIndex() it
unconditionally PUT _doc/5..8 to grow state_country from 4 to 8 rows. On the
analytics-engine route the parquet/composite store is append-only and does
not overwrite by _id, so re-running those PUTs every method accumulated
duplicate rows and inflated downstream join counts (e.g. expected 6, got 60).
Make the seed conditional, gated on the SAME `isIndexExist` check loadIndex
uses: seed exactly when loadIndex (re)creates state_country, so seed and load
stay in lockstep. (An earlier attempt used a static "seed once" latch, but
that desyncs from the index — when the framework recreates state_country with
only the 4 fixture rows, a latch already flipped true skips re-seeding and
leaves the index at 4 rows, which broke the in-memory integTest on
macOS/Windows CI. Gating on isIndexExist is correct regardless of the
cluster's per-method index lifecycle.)
2. Column ordering. The analytics-engine route builds its scan schema from the
serialized index mapping (getSourceAsMap), which OpenSearch returns in
alphabetical field order, whereas the v2/Calcite path preserves declared
order. Field-list/implicit-projection joins therefore returned the right
rows with columns in a different order. Add explicit `| fields ...` to pin
the projection order for the affected tests (testComplexSemiJoin,
testComplexAntiJoin, testComplexSortPushDownForSMJWithMaxOptionAndFieldList).
3. Row ordering. The analytics-engine coordinator-reduce (RowProducingSink)
appends Arrow batches in arrival order from the SEARCH-threadpool response
handlers, so a query without ORDER BY has no guaranteed row order — unlike
Calcite's deterministic enumerable execution. Cases:
- testComplexRightJoin sorts by a column that is null for the right-only
rows, leaving their relative order unspecified; switch
verifyDataRowsInOrder -> verifyDataRows.
- testInnerJoinWithRelationSubquery ends in `stats ... by` with no ORDER BY,
so the two output groups come back in a route-dependent order (flaky);
switch verifyDataRowsInOrder -> verifyDataRows.
- The testCheckAccessTheReference* tests compare two alias-syntax variants to
each other via assertJsonEquals on serialized datarows, which is
order-sensitive. They only mean to assert the two variants return the same
set of rows. Add MatcherUtils.assertJsonRowsEqualIgnoreOrder (multiset
compare) and use it for those comparisons.
Verified: standard route (integTest, in-memory) 43/43 pass; analytics-engine
route drops from 33 failures to only the two remaining exact-equality-on-bare-
text cases (testJoinComparing, testJoinSubsearchMaxOut), a separate route
limitation (DYNAMIC_STRING_NO_KEYWORD) tracked elsewhere.
Signed-off-by: Songkan Tang <songkant@amazon.com>
0 commit comments