Skip to content

Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315

Open
kosiew wants to merge 6 commits intoapache:mainfrom
kosiew:dictionary-coercion-21150
Open

Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315
kosiew wants to merge 6 commits intoapache:mainfrom
kosiew:dictionary-coercion-21150

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 2, 2026

Which issue does this PR close?


Rationale for this change

The current implementation of MIN/MAX aggregate functions in DataFusion does not properly support dictionary-encoded arrays. This limitation prevents efficient execution on dictionary-backed data and may require unnecessary materialization or coercion.

This PR enables native handling of dictionary-encoded values by operating directly on their underlying scalar representations. This avoids flattening dictionaries into dense arrays, preserving performance and memory efficiency.


What changes are included in this PR?

  • Added support for dictionary scalar comparison in min_max! macro

  • Introduced dictionary_scalar_parts helper to unwrap dictionary scalars safely

  • Ensured dictionary key types are preserved when returning results

  • Replaced specialized complex-type handling with a unified scalar_batch_extreme implementation

  • Added is_row_wise_batch_type helper to generalize handling of Struct, List, and Dictionary types

  • Simplified min_batch and max_batch logic by removing min_max_batch_generic

  • Implemented consistent behavior for null handling and comparisons

  • Added extensive test coverage for:

    • Dictionary arrays without coercion
    • Null handling in dictionary arrays
    • Ignoring unreferenced dictionary values
    • Multi-batch dictionary aggregation
    • Float dictionary arrays including NaN handling

Are these changes tested?

Yes. This PR introduces comprehensive unit tests covering dictionary-encoded arrays across multiple scenarios, including null values, multi-batch aggregation, and floating-point edge cases (e.g., NaN and infinity).

These tests ensure correctness, prevent regressions, and document expected behavior.


Are there any user-facing changes?

Yes.

  • MIN and MAX now support dictionary-encoded arrays natively
  • Results preserve dictionary semantics where applicable
  • Improved performance by avoiding unnecessary dictionary flattening

No breaking API changes are introduced.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 4 commits April 2, 2026 16:32
Return ScalarValue::Dictionary(...) in dictionary batches instead of
unwrapping to inner scalars. Enhance min_max! logic to safely handle
dictionary-vs-dictionary and dictionary-vs-non-dictionary comparisons.
Add regression tests for raw-dictionary covering no-coercion,
null-containing, and multi-batch scenarios.
Centralize dictionary batch handling for min/max operations.
Streamline min_max_batch_generic to initialize from the first
non-null element. Implement shared setup/assert helpers in
dictionary tests to reduce repetition while preserving test
coverage.
Refactor dictionary min/max flow by removing the wrap macro arm,
making re-wrapping explicit through a private helper. This
separates the "choose inner winner" from the "wrap as
dictionary" step for easier auditing.

In `datafusion/functions-aggregate/src/min_max.rs`, update
`string_dictionary_batch` to accept slices instead of owned
Vecs, and introduce a small `evaluate_dictionary_accumulator`
helper to streamline min/max assertions with a shared
accumulator execution path, reducing repeated setup.
Update min_max.rs to ensure dictionary batches iterate
actual array rows, comparing referenced scalar values.
Unreferenced dictionary entries no longer affect MIN/MAX,
and referenced null values are correctly skipped.
Expanded tests to cover these changes and updated
expectations
Added regression tests for unreferenced and referenced
null dictionary values.
@github-actions github-actions bot added core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions labels Apr 2, 2026
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from 08a20c4 to 5002677 Compare April 2, 2026 13:46
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) and removed logical-expr Logical plan and expressions labels Apr 2, 2026
@kosiew kosiew changed the title Support dictionary-encoded arrays in MIN/MAX aggregates and preserve dictionary types Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion Apr 2, 2026
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from f2330b9 to b4938c1 Compare April 2, 2026 14:37
@github-actions github-actions bot added logical-expr Logical plan and expressions and removed sqllogictest SQL Logic Tests (.slt) labels Apr 2, 2026
@kosiew kosiew changed the title Support Dictionary Arrays in MIN/MAX Aggregates with Direct Scalar Coercion Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering Apr 2, 2026
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Apr 3, 2026
kosiew added 2 commits April 7, 2026 11:47
Consolidate row-wise min/max scan logic into a single helper
in min_max.rs to ensure consistency between dictionary
and generic complex-type paths. Add regression test for
the float dictionary handling NaN and -inf cases,
validating ordering semantics across batches.
Remove the no-op dictionary macro and single-use wrapper.
Collapse dictionary handling into a normalized path and seed
scalar_batch_extreme from the first non-null value.
Unify row-wise batch dispatch behind a shared predicate.
Apply formatting adjustments in min_max.rs as per cargo fmt.
@kosiew kosiew force-pushed the dictionary-coercion-21150 branch from 0e7a98d to dad6e02 Compare April 7, 2026 03:56
@github-actions github-actions bot removed the core Core DataFusion crate label Apr 7, 2026
@kosiew kosiew marked this pull request as ready for review April 7, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dictionary coercion on min/max

2 participants