feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044
feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044yongchul wants to merge 4 commits intosubstrait-io:mainfrom
Conversation
| // An optional boolean filter applied to each output record after join | ||
| // matching and null-padding (for outer/mark joins) are complete. | ||
| // Semantically equivalent to placing a FilterRel directly above this join. | ||
| Expression post_join_filter = 6; |
There was a problem hiding this comment.
Is there any known engine that has this feature?
There was a problem hiding this comment.
Great question. I know one that I can fix (it is under my control lol). I will check data fusion, duckdb, and velox.
@benbellick @vbarua @nielspardon any impact on your sides?
There was a problem hiding this comment.
- datafusion: not supported
- duckdb: does not populate post_join_filter
- gluten: does use post_join_filter but not sure whether it is consistent with this PR.
There was a problem hiding this comment.
The engines we are currently dealing with would also handle this as a separate FilterRel instead of merging this into the Join operation. We may have some use cases in the future for representing fused operators in Substrait though I would look at this separately.
For me I'm fine with both this improved documentation or removing it.
There was a problem hiding this comment.
I just checked and we do not use it internally.
Clarifying the documentation is a good idea, though I would prefer removing it if no user of substrait is actually using it.
There was a problem hiding this comment.
@nielspardon @benbellick any final thoughts on the PR? This one I can wait until next week meeting.
benbellick
left a comment
There was a problem hiding this comment.
Generally looks good to me, just a few small comments!
|
|
||
| // Specifies which side of input to build the hash table for this hash join. Default is `BUILD_INPUT_RIGHT`. | ||
| // Specifies which side of input to build the hash table for this hash join. | ||
| // Defaults to `BUILD_INPUT_RIGHT`. |
There was a problem hiding this comment.
Does this mean that BUILD_INPUT_UNSPECIFIED => BUILD_INPUT_RIGHT?
I assume that reason for the presence of UNSPECIFIED is to follow the protobuf best practices, but the protobuf docs are pretty explicit that the unspecified value should not have any meaningful semantics:
It may be tempting to declare this default as a semantically meaningful value but as a general rule, do not, to aid in the evolution of your protocol as new enum values are added over time.
This doesn't have to be for this PR, but I think we should should force producers to specify what they mean explicitly. Maybe we do that in one larger PR next.
There was a problem hiding this comment.
Does this mean that BUILD_INPUT_UNSPECIFIED => BUILD_INPUT_RIGHT?
Yes, for the backward compatibility. Mandating would be a breaking change and we can do it in another PR as we discussed in the last meeting.
| @@ -23,7 +23,8 @@ The hash equijoin join operator will build a hash table out of one input (defaul | |||
| | Build Input | Specifies which input is the `Build`. | Optional, defaults to build `Right`, probe `Left`. | | |||
There was a problem hiding this comment.
FYI these tables still only discuss left / right keys (deprecated) rather than comparison join keys.
We can address in another PR if you'd like. Though I would love to just get rid of the deprecated keys instead.
There was a problem hiding this comment.
Let me see whether I make the change easily in this PR. otherwise, I'll follow up with another one.
There was a problem hiding this comment.
Looked into it and looks like better be handled in a separate PR.
972ccfa to
1cd5275
Compare
1cd5275 to
5ae9c17
Compare
cff0ed1 to
1a3c564
Compare
Background
JoinRel, HashJoinRel, and MergeJoinRel have a field named
post_join_filter, causing confusion. The confusing point is whetherpost_join_filteris part of the join predicate (i.e., two rows are matched when it evaluates totrue) or not (i.e., as the name says, this is post join filter, thusfilterevaluates after the join operation).Substrait FAQ calls out that the
post_join_filteris part of join predicate but @yongchul was deeply discontent about the explanation and the naming.After lengthy discussion in the Slack channel, #807 was created. After more back-and-forth, the original intent of
post_join_filterwas indeed post join filter, not part of join predicate.Also, it appears that Calcite has precisely the same name and same semantic, post join filter. Following code point shows placing the filter on top of the join.
https://github.com/apache/calcite/blob/0a39568b167592ded8db1128b5838982ffe264f3/core/src/main/java/org/apache/calcite/rel/rules/LoptOptimizeJoinRule.java#L553-L559
What this changes do?
TL;DR; Make
post_join_filteraspost_join_filteragain.post_join_filteris not join condition but semantically filter on top of join.residual_expressionto HashJoinRel and MergeJoinRel so that they can express non-equality join predicate -- I guess this was the confusion who tried to shove non-equality join predicate toHashJoinandMergeJoinsomewhere and distorted the meaning without much thought.equifrom hash join and merge join documentation as withresidual_expressionthey are capable of supporting arbitrary join predicate.Breaking change?
This is debatable. According to previous FAQ,
post_join_filterwasresidual_expressionintroduced in this PR. If we take the FAQ as part of the "spec", then this is a breaking change.If we don't consider FAQ as a spec, then this is not a breaking change but clarification and adding a new feature -- HashJoinRel, MergeJoinRel now supports arbitrary join predicate.
AI disclaimer
The PR is assisted by Claud Opus 4.6.
This change is