Skip to content

Add regression tests for ordered array_agg after unnest with window functions (issue #20788)#21472

Draft
kosiew wants to merge 8 commits intoapache:mainfrom
kosiew:memory_unnesting-01-20788
Draft

Add regression tests for ordered array_agg after unnest with window functions (issue #20788)#21472
kosiew wants to merge 8 commits intoapache:mainfrom
kosiew:memory_unnesting-01-20788

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 8, 2026

Which issue does this PR close?


Rationale for this change

This PR introduces targeted regression coverage for a query pattern involving unnest, row_number window functions, and array_agg with ORDER BY. This pattern is known to trigger a different execution path in DataFusion compared to unordered array_agg, potentially leading to higher memory usage.

Existing tests only covered the unordered variant of array_agg, which uses the optimized GroupsAccumulator fast path. However, when ORDER BY is introduced inside array_agg, DataFusion switches to an order-sensitive accumulation strategy that does not use this fast path.

This change ensures that:

  • The execution plan shape for ordered array_agg is explicitly captured and validated.
  • The behavioral difference between ordered and unordered aggregation is documented and tested.
  • A stable baseline is established for future optimizations and regression detection.

What changes are included in this PR?

  • Added a new Rust regression test (ordered_array_agg_after_unnest_regression) that:

    • Validates correctness of ordered array_agg after unnest and window functions.

    • Asserts expected physical plan structure, including:

      • UnnestExec
      • BoundedWindowAggExec
      • SortExec
      • AggregateExec in SinglePartitioned mode
    • Verifies that ordered aggregation avoids partial/final aggregation modes.

  • Added a comparison with the unordered variant to highlight execution differences:

    • Confirms use of Partial and FinalPartitioned aggregation for unordered case.
    • Demonstrates optimizer pruning of unused window expressions.
  • Added an EXPLAIN ANALYZE test (ordered_array_agg_after_unnest_explain_analyze_metrics) that:

    • Captures execution metrics such as output rows and memory usage.
    • Establishes a baseline for performance and memory tracking.
  • Extended sqllogictest coverage (unnest.slt) with:

    • A new EXPLAIN test showcasing the logical and physical plan for ordered array_agg.
    • A correctness test validating deterministic ordering within aggregated arrays.
    • Inline documentation explaining why unordered array_agg is insufficient coverage.
  • Introduced helper utilities in tests:

    • ArrayAggOrder enum to toggle ordered vs unordered queries.
    • SQL builder for consistent query generation.
    • Plan assertion helpers to validate execution structure.

Are these changes tested?

Yes.

The PR includes multiple layers of test coverage:

  1. Correctness Tests

    • Validate that ordered array_agg preserves element ordering after unnest.
    • Use deterministic synthetic datasets.
  2. Plan Shape Assertions

    • Ensure the physical plan includes required operators for ordered aggregation.
    • Ensure forbidden operators (e.g., partial aggregation) are not present.
  3. Explain Plan Tests (SQLLogicTest)

    • Capture both logical and physical plans for regression tracking.
  4. Execution Metrics Validation

    • Use EXPLAIN ANALYZE to assert row counts and memory-related metrics.

These tests collectively ensure both functional correctness and execution stability.


Are there any user-facing changes?

No.

This PR only adds test coverage and documentation. There are no changes to public APIs or user-facing behavior.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 8 commits April 8, 2026 22:30
Verify query result for unnest -> row_number ->
group by -> array_agg with ordered output. Ensure
physical plan includes UnnestExec, BoundedWindowAggExec,
and ordered array_agg path.
Introduce an ordered EXPLAIN regression case next to the existing
unordered coverage in unnest.slt. Add a comment clarifying why
unordered array_agg does not suffice for issue 20788. Update the
Rust plan assertion in basic.rs to be less brittle by verifying the
aggregate shape instead of relying on the exact alias-qualified
ORDER BY expression text.
Inline single-use dataframe binding and reduce repetition
in assert_contains! calls using a loop. Maintain all
public interfaces unchanged.
Rebuild arrays based on original element position rather than
value order. Update basic.rs and unnest.slt by replacing
synthetic ROW_NUMBER() with a paired unnest for ordinal
generation. Switch input rows to explicit row_idx literals
to eliminate dependency on ROW_NUMBER(). Add a correctness
query to the SLT case in addition to explain coverage.
Reintroduce explicit row ids, unnest, and generated
original element ordinals. Implement ROW_NUMBER()
over partitioned indices for accurate ordering.
Adjust Rust-side plan assertion for improved stability
by focusing on higher-signal operators and sort shape
instead of exact normalized casts.
Extract shared array_agg_after_unnest_sql helper.
Enhance ordered_array_agg tests to compare plan shapes.
Add ordered_array_agg_after_unnest_explain_analyze_metrics
to verify runtime metrics and memory usage.
Adjust unordered test assertions to reflect optimizer behavior
and demonstrate efficiency of unordered aggregation path.
…der handling

- Updated the `array_agg_after_unnest_sql` function to accept an `ArrayAggOrder` enum instead of a boolean to improve readability.
- Enhanced code clarity by explicitly defining ordered and unordered aggregation cases.
- Updated usage and related SQL execution calls in tests for better consistency.
Move ArrayAggOrder SQL fragment selection to enum method.
Extract common functionality for SessionContext, EXPLAIN ANALYZE
wrapping, physical-plan generation, and grouped substring assertions.
Simplify regression tests by removing repeated plan-building and
assertion boilerplate.
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant