Skip to content

feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044

Open
yongchul wants to merge 4 commits intosubstrait-io:mainfrom
yongchul:residual_join_expression
Open

feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044
yongchul wants to merge 4 commits intosubstrait-io:mainfrom
yongchul:residual_join_expression

Conversation

@yongchul
Copy link
Copy Markdown
Contributor

@yongchul yongchul commented Apr 13, 2026

Background

JoinRel, HashJoinRel, and MergeJoinRel have a field namedpost_join_filter, causing confusion. The confusing point is whether post_join_filter is part of the join predicate (i.e., two rows are matched when it evaluates to true) or not (i.e., as the name says, this is post join filter, thus filter evaluates after the join operation).

Substrait FAQ calls out that the post_join_filter is part of join predicate but @yongchul was deeply discontent about the explanation and the naming.

After lengthy discussion in the Slack channel, #807 was created. After more back-and-forth, the original intent of post_join_filter was indeed post join filter, not part of join predicate.

Also, it appears that Calcite has precisely the same name and same semantic, post join filter. Following code point shows placing the filter on top of the join.
https://github.com/apache/calcite/blob/0a39568b167592ded8db1128b5838982ffe264f3/core/src/main/java/org/apache/calcite/rel/rules/LoptOptimizeJoinRule.java#L553-L559

What this changes do?

TL;DR; Make post_join_filter as post_join_filter again.

  1. Documents post_join_filter is not join condition but semantically filter on top of join.
  2. Introduce residual_expression to HashJoinRel and MergeJoinRel so that they can express non-equality join predicate -- I guess this was the confusion who tried to shove non-equality join predicate to HashJoin and MergeJoin somewhere and distorted the meaning without much thought.
  3. Drop equi from hash join and merge join documentation as with residual_expression they are capable of supporting arbitrary join predicate.
  4. Fix the FAQ.

Breaking change?

This is debatable. According to previous FAQ, post_join_filter was residual_expression introduced in this PR. If we take the FAQ as part of the "spec", then this is a breaking change.

If we don't consider FAQ as a spec, then this is not a breaking change but clarification and adding a new feature -- HashJoinRel, MergeJoinRel now supports arbitrary join predicate.

AI disclaimer

The PR is assisted by Claud Opus 4.6.


This change is Reviewable

Comment thread site/docs/faq.md
Comment thread site/docs/relations/physical_relations.md Outdated
@yongchul yongchul added the PMC Ready PRs ready for review by PMCs label Apr 13, 2026
Comment thread proto/substrait/algebra.proto Outdated
Comment on lines 866 to 869
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 6;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any known engine that has this feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I know one that I can fix (it is under my control lol). I will check data fusion, duckdb, and velox.

@benbellick @vbarua @nielspardon any impact on your sides?

Copy link
Copy Markdown
Contributor Author

@yongchul yongchul Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • datafusion: not supported
  • duckdb: does not populate post_join_filter
  • gluten: does use post_join_filter but not sure whether it is consistent with this PR.

Copy link
Copy Markdown
Member

@nielspardon nielspardon Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engines we are currently dealing with would also handle this as a separate FilterRel instead of merging this into the Join operation. We may have some use cases in the future for representing fused operators in Substrait though I would look at this separately.

For me I'm fine with both this improved documentation or removing it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and we do not use it internally.

Clarifying the documentation is a good idea, though I would prefer removing it if no user of substrait is actually using it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielspardon @benbellick any final thoughts on the PR? This one I can wait until next week meeting.

Copy link
Copy Markdown
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, just a few small comments!


// Specifies which side of input to build the hash table for this hash join. Default is `BUILD_INPUT_RIGHT`.
// Specifies which side of input to build the hash table for this hash join.
// Defaults to `BUILD_INPUT_RIGHT`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that BUILD_INPUT_UNSPECIFIED => BUILD_INPUT_RIGHT?

I assume that reason for the presence of UNSPECIFIED is to follow the protobuf best practices, but the protobuf docs are pretty explicit that the unspecified value should not have any meaningful semantics:

It may be tempting to declare this default as a semantically meaningful value but as a general rule, do not, to aid in the evolution of your protocol as new enum values are added over time.

This doesn't have to be for this PR, but I think we should should force producers to specify what they mean explicitly. Maybe we do that in one larger PR next.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue to track this: #1069

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that BUILD_INPUT_UNSPECIFIED => BUILD_INPUT_RIGHT?

Yes, for the backward compatibility. Mandating would be a breaking change and we can do it in another PR as we discussed in the last meeting.

Comment thread site/docs/relations/logical_relations.md
Comment thread proto/substrait/algebra.proto Outdated
@@ -23,7 +23,8 @@ The hash equijoin join operator will build a hash table out of one input (defaul
| Build Input | Specifies which input is the `Build`. | Optional, defaults to build `Right`, probe `Left`. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI these tables still only discuss left / right keys (deprecated) rather than comparison join keys.

We can address in another PR if you'd like. Though I would love to just get rid of the deprecated keys instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see whether I make the change easily in this PR. otherwise, I'll follow up with another one.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into it and looks like better be handled in a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #1070

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 5, 2026

CLA assistant check
All committers have signed the CLA.

@yongchul yongchul force-pushed the residual_join_expression branch from 1cd5275 to 5ae9c17 Compare May 5, 2026 22:57
@yongchul yongchul force-pushed the residual_join_expression branch 2 times, most recently from cff0ed1 to 1a3c564 Compare May 5, 2026 23:31
Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants