You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Bring CalcitePPLJoinIT to parity on the analytics-engine route
CalcitePPLJoinIT failed on the analytics-engine route (parquet/composite
store + DataFusion backend, -Dtests.analytics.parquet_indices=true) for three
distinct reasons, none of which are real query defects. This brings the class
to parity without weakening the assertions.
1. Shared-index pollution from a non-idempotent seed (the big one).
init() runs before every test method (@before) and, with
preserveClusterUponCompletion()=true, the state_country index is created
once and reused across all methods. init() unconditionally PUT _doc/5..8
after loadIndex(). On the standard route those PUTs overwrite by _id and are
harmless to repeat; on the analytics-engine route the parquet/composite
store is append-only and does not overwrite by _id, so every method's init()
appended 4 more duplicate rows. The shared index grew unboundedly and joins
over it inflated (e.g. expected 6, got 60; self-joins far worse). Guard the
seed with a static flag so it runs once per class load — the standard route
ends at the same stable 8-row state_country it always did.
2. Column ordering. The analytics-engine route builds its scan schema from the
serialized index mapping (getSourceAsMap), which OpenSearch returns in
alphabetical field order, whereas the v2/Calcite path preserves declared
order. Field-list/implicit-projection joins therefore returned the right
rows with columns in a different order. Add explicit `| fields ...` to pin
the projection order for the affected tests (testComplexSemiJoin,
testComplexAntiJoin, testComplexSortPushDownForSMJWithMaxOptionAndFieldList).
3. Row ordering. The analytics-engine coordinator-reduce (RowProducingSink)
appends Arrow batches in arrival order from the SEARCH-threadpool response
handlers, so a query without ORDER BY has no guaranteed row order — unlike
Calcite's deterministic enumerable execution. Cases:
- testComplexRightJoin sorts by a column that is null for the right-only
rows, leaving their relative order unspecified; switch
verifyDataRowsInOrder -> verifyDataRows.
- testInnerJoinWithRelationSubquery ends in `stats ... by` with no ORDER BY,
so the two output groups come back in a route-dependent order (flaky);
switch verifyDataRowsInOrder -> verifyDataRows.
- The testCheckAccessTheReference* tests compare two alias-syntax variants to
each other via assertJsonEquals on serialized datarows, which is
order-sensitive. They only mean to assert the two variants return the same
set of rows. Add MatcherUtils.assertJsonRowsEqualIgnoreOrder (multiset
compare) and use it for those comparisons.
Verified: standard route 43/43 pass (no regression); analytics-engine route
drops from 33 failures to only the two remaining exact-equality-on-bare-text
cases (testJoinComparing, testJoinSubsearchMaxOut), a separate route limitation
(DYNAMIC_STRING_NO_KEYWORD) tracked elsewhere.
Signed-off-by: Songkan Tang <songkant@amazon.com>
0 commit comments