Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315
Open
kosiew wants to merge 6 commits intoapache:mainfrom
Open
Support Dictionary Arrays in MIN/MAX Aggregates and Stabilize PG JSON Output Ordering#21315kosiew wants to merge 6 commits intoapache:mainfrom
kosiew wants to merge 6 commits intoapache:mainfrom
Conversation
Return ScalarValue::Dictionary(...) in dictionary batches instead of unwrapping to inner scalars. Enhance min_max! logic to safely handle dictionary-vs-dictionary and dictionary-vs-non-dictionary comparisons. Add regression tests for raw-dictionary covering no-coercion, null-containing, and multi-batch scenarios.
Centralize dictionary batch handling for min/max operations. Streamline min_max_batch_generic to initialize from the first non-null element. Implement shared setup/assert helpers in dictionary tests to reduce repetition while preserving test coverage.
Refactor dictionary min/max flow by removing the wrap macro arm, making re-wrapping explicit through a private helper. This separates the "choose inner winner" from the "wrap as dictionary" step for easier auditing. In `datafusion/functions-aggregate/src/min_max.rs`, update `string_dictionary_batch` to accept slices instead of owned Vecs, and introduce a small `evaluate_dictionary_accumulator` helper to streamline min/max assertions with a shared accumulator execution path, reducing repeated setup.
Update min_max.rs to ensure dictionary batches iterate actual array rows, comparing referenced scalar values. Unreferenced dictionary entries no longer affect MIN/MAX, and referenced null values are correctly skipped. Expanded tests to cover these changes and updated expectations Added regression tests for unreferenced and referenced null dictionary values.
08a20c4 to
5002677
Compare
f2330b9 to
b4938c1
Compare
Consolidate row-wise min/max scan logic into a single helper in min_max.rs to ensure consistency between dictionary and generic complex-type paths. Add regression test for the float dictionary handling NaN and -inf cases, validating ordering semantics across batches.
Remove the no-op dictionary macro and single-use wrapper. Collapse dictionary handling into a normalized path and seed scalar_batch_extreme from the first non-null value. Unify row-wise batch dispatch behind a shared predicate. Apply formatting adjustments in min_max.rs as per cargo fmt.
0e7a98d to
dad6e02
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
The current implementation of
MIN/MAXaggregate functions in DataFusion does not properly support dictionary-encoded arrays. This limitation prevents efficient execution on dictionary-backed data and may require unnecessary materialization or coercion.This PR enables native handling of dictionary-encoded values by operating directly on their underlying scalar representations. This avoids flattening dictionaries into dense arrays, preserving performance and memory efficiency.
What changes are included in this PR?
Added support for dictionary scalar comparison in
min_max!macroIntroduced
dictionary_scalar_partshelper to unwrap dictionary scalars safelyEnsured dictionary key types are preserved when returning results
Replaced specialized complex-type handling with a unified
scalar_batch_extremeimplementationAdded
is_row_wise_batch_typehelper to generalize handling of Struct, List, and Dictionary typesSimplified
min_batchandmax_batchlogic by removingmin_max_batch_genericImplemented consistent behavior for null handling and comparisons
Added extensive test coverage for:
Are these changes tested?
Yes. This PR introduces comprehensive unit tests covering dictionary-encoded arrays across multiple scenarios, including null values, multi-batch aggregation, and floating-point edge cases (e.g., NaN and infinity).
These tests ensure correctness, prevent regressions, and document expected behavior.
Are there any user-facing changes?
Yes.
MINandMAXnow support dictionary-encoded arrays nativelyNo breaking API changes are introduced.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.