Stabilize order-dependent PPL ITs with explicit sort for multi-shard analytics runs#5537
Merged
ahkcs merged 1 commit intoJun 10, 2026
Conversation
…analytics runs
The analytics engine merges per-shard Arrow batches in arrival order (no
shard-ordinal tiebreaker), so any IT whose asserted rows depend on document
encounter order fails when test indices have more than one primary shard.
Add an explicit sort on a unique key to the affected queries so results are
deterministic on every route; where the sorted result differs from insertion
order (accounts.json is not account_number-ascending), update expectations
to the sorted rows.
Also add the tests.analytics.num_shards system property (default 1, wired
through integTestRemote and TestUtils.AnalyticsIndexConfig) so multi-shard
coverage runs can be reproduced:
./gradlew :integ-test:integTestRemote \
-Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9300 \
-Dtests.clustername=runTask \
-Dtests.analytics.parquet_indices=true -Dtests.analytics.num_shards=3
Verified on three routes: 3-shard analytics, 1-shard analytics, and the
regular Calcite route (self-managed cluster, all green).
Signed-off-by: Kai Huang <ahkcs@amazon.com>
Contributor
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
Contributor
PR Code Suggestions ✨Explore these optional code suggestions:
|
Swiddis
approved these changes
Jun 10, 2026
vinaykpud
approved these changes
Jun 10, 2026
RyanL1997
approved these changes
Jun 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When PPL integration-test indices are created with more than one primary shard on the analytics-engine route, the engine merges per-shard Arrow batches in arrival order (
RowProducingSink.feed()has no shard-ordinal tiebreaker, unlike the Lucene path'sTopDocs.merge()shardIndex ordering). Any test whose asserted rows depend on document encounter order —headwithoutsort, first-row assertions afterpatterns/grok/eval,trendlineover unsorted input, result-set truncation — fails on a 3-shard run even though it passes on 1 shard.This PR stabilizes every such test that can be fixed by adding an explicit
| sort <unique key>to the query:| sortplaced before anyfieldsprojection that drops the sort key (verified deterministic through the projection on the multi-shard route).It also adds a
tests.analytics.num_shardssystem property (default1, so existing runs are unchanged) wired throughintegTestRemoteandTestUtils.AnalyticsIndexConfig, to make multi-shard coverage runs reproducible:Results (suite failures on the affected classes)
"+1" = same-family tests that passed the baseline run only by lucky arrival order and were fixed here too. All remaining failures pre-exist this PR (most also fail on 1 shard) or are multi-shard engine gaps that cannot be fixed test-side:
stats first()/last()/take()stay non-deterministic even with a precedingsort— per-shard partial accumulators merge in arrival order at the coordinator. The existingOrdinalAppendingSinkshard ordinals are not used by the gather stage; an engine-side ordinal merge would make those tests pass unmodified.streamstatscomputes windows per shard fragment even with a globalsortpreceding it.| multisearch [where M] [where F] | stats count()returns 2×M on the analytics route).derive_schema_from_partial_planprotobuf decode failure, Substrait Float64/Int64 schema mismatch foreval unix_timestamp(...) | stats ... by,TRY_CAST List(Struct) → Utf8for BRAIN patterns.query_stringintermittently drops one row across shards (16→15).These are tracked separately for upstream engine fixes.
Related Issues
Part of the analytics-engine PPL compatibility effort.
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.