[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced attributes#56930
Open
vranes wants to merge 1 commit into
Open
[SPARK-57858][SQL] Emit BIN BY scaled DISTRIBUTE columns as produced attributes#56930vranes wants to merge 1 commit into
vranes wants to merge 1 commit into
Conversation
e417f4b to
6be4341
Compare
6be4341 to
7e4c5c2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
The
BIN BYrelation operator (SPARK-57133) proportionally rescales itsDISTRIBUTE UNIFORMcolumns. The logicalBinBynode carried those columns throughchild.outputwith the child's ownExprId, even though execution rewrites their values.This PR makes the rescaled
DISTRIBUTEcolumns produced attributes with freshExprIds (same names, types, nullability, and positions), shadowing the inputs, mirroringGenerate.generatorOutput:BinBygains ascaledDistributeColumnsfield;outputswaps eachDISTRIBUTEinput slot for its scaled counterpart in place, andproducedAttributesincludes them. The inputdistributeColumnsstay on the node as the executor's read inputs but leaveoutput.BinBy.scaledDistributeAttributesmints the fresh attributes (qualifier and metadata dropped, matchingexpr AS valuecomputed-value semantics).ResolveBinBymints them;DeduplicateRelationsrenews them in both phases so self-joins over a sharedBinBysubtree resolve.Why are the changes needed?
Catalyst relies on the invariant that the same
ExprIdeverywhere implies the same value. No other operator edits a value under a retained child attribute (Generate/Window/Expand/Aggregateall mint fresh ids for changed columns). Carrying the rescaledDISTRIBUTEcolumn under the child'sExprIdviolated that: any rule reasoning onExprId(predicate pushdown, constraint propagation, common-subexpression elimination) could read the pre-scale value. It is harmless today only because no such rule listsBinBy, but that safety is incidental, not designed. Minting fresh identities restores the invariant and lets a filter or sort on aDISTRIBUTEcolumn bind to the scaled output.Does this PR introduce any user-facing change?
No.
BIN BYis gated off by default (SPARK-57440) and its physical execution is still stubbed, so the operator is not usable end-to-end yet; this is an internal analyzer / plan-shape change. The output schema (column names, types, positions) is unchanged.How was this patch tested?
ResolveBinBySuite(20 tests), including new cases that the rescaledDISTRIBUTEcolumns are produced attributes shadowing the input, that multipleDISTRIBUTEcolumns are each replaced in place with distinct fresh ids, and that qualifier/metadata are dropped on the produced column; plus the existing self-join deduplication regression.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic)