feat!: add id-based outer reference resolution for DAG plans#1031
Conversation
Add optional RelCommon.id field and OuterReference.id_reference to support unambiguous outer reference resolution in plans with common subexpressions (ReferenceRel). The existing steps_out offset-based mechanism remains for backward compatibility with tree-shaped plans. Closes substrait-io#1024
- Add bool lateral field to CrossRel for lateral cross product semantics - Add RelCommon.id and OuterReference.id_reference from PR substrait-io#1031 - Update JoinRel and CrossRel lateral comments to use id_reference - Update logical_relations.md with lateral docs for both JoinRel and CrossRel Depends-on: substrait-io#1031
|
Am I understanding correctly that this problem only arises when a My interpretation of I feel like we may need a new construct here rather than tacking on the ability to reference outside of the scope of a particular root. This feels like some kind of "parameterized" relation notion, where we would want to have call sites providing values to pass in dynamically. I'm not sure yet what the right way to represent this would be but wanted to get my thoughts down for now. What do you think? |
|
I love github missing thread in the PR comment... 😠
Yes.
Yes, you could put that restriction, you have to have each relation context independent. However, this proposal is actually making it context independent because the specific outer reference is bound to one consistent value without ambiguity [modulo there is still restriction where that particular value can coming from].
This goes to inline table function (or parameterized view), which is a generic solution (or sledge hammer) and I actually feel like it is an overkill. Yes, I did think initially this is the way to go when I confine myself in existing Substrait outer reference structure, but this quickly went out of control. If your binding actually does not change in the plan, why do you have to go through indirect mappings only to point to the same thing? To truly use inline table function, either you have a) a fairly sophisticated common subexpression engine to extract the common relational template or b) support for the parameterized views. I'm not aware of (a) but many systems do support (b). In (b), you won't actually have an outer reference if you inline the view definition or if you model it actual function call. Because you already have the common definition, it is straight forward to implement in producer and they are in fact used multiple times with different binding points. For (a), if you don't have such detection, it's pretty challenging for producer extract the common pattern, and generate the definition. It's nice to have such fancy decomposition for our system to have, but I don't see us doing this anytime soon... So, I'm not arguing we don't need inline table function. We will need that and useful for parameterized views. But I don't feel like we have to use inline table function for outer reference, because we can represent outer reference in more generic way. I'm biased. 😆 Happy to chat more about this but I strongly prefer anything else except github PR-level comment. |
|
Could there be a common mechanism with #726? or is this change totally independent? I don't have context on why an outer reference and a reference rel are both necessary. This seems to be the reason, but I'm not sure what context an outer reference has that can't be included in a reference rel:
(Of course, I would have to reopen #726 or maybe you can combine the change in this one since it's small) |
I believe they are independent. If you squint, you could use the
reference rel is about sharing a relational subtree (e.g., common table expressions, views). outer reference is a concept that you can correlate one relational tree to another (see lateral join #973 for example). They are orthogonal.
I wouldn't combine them because they are different concept. Please let me know if there is some misunderstanding. Again, I hate PR-level comment to discuss something. If you want to discuss, let's create a comment in a file, then starts a thread. |
|
@yongchul , have you looked at all at how to change this in Datafusion/Duckdb? I think maybe one/both have implemented subqueries (maybe?) and am curious the complexity to update to new pattern. |
@jacques-n conceptually, if the implementer already maintains schema stack, the new approach can be implemented by look up by the id. So the content of stack changes from I schemed datafusion. With my inexperienced rusty eyes, it does seem to follow above structure so fairly easy to extend. I couldn't find handling of I admit that I used a little loose definition of I will look into duckdb as well. |
benbellick
left a comment
There was a problem hiding this comment.
I generally think this is good now that I properly understand the problem, but just have some things I'd like clarified.
Thanks!
|
|
||
| #### `id_reference` (id-based) | ||
|
|
||
| When a plan contains **common subexpressions** via `ReferenceRel`, the same relation can be reached through multiple paths with different depths. In that case, offset-based resolution is ambiguous because `steps_out` depends on *which* path is followed. |
There was a problem hiding this comment.
I think that this section may read a bit better if instead it was
#### `steps_out` (offset-based)
<explanation of WHAT steps_out is with an example>
#### `id_reference` (id-based)
<explanation of WHAT id_reference is with an example>
#### When `id_reference` is required
example of case where it is required. I think it would read clearer to people without the context of the problem this is solving to just see the concept in isolation. Then in the subsequent section, they see why you need to use one vs the other. WDYT?
There was a problem hiding this comment.
Revised accordingly. PTAL
|
|
||
| // Plan-wide unique id of the relation whose output this reference is | ||
| // rooted on. Must match an id defined on a RelCommon in the plan. | ||
| // Must be > 0 when set. Recommended when the plan contains shared |
There was a problem hiding this comment.
mentioned elsewhere, but could we just make it required in this case?
There was a problem hiding this comment.
I am trying to avoid presenting this as a breaking change. Perhaps it is fine. Let's see how others think.
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
e36642e to
4c94434
Compare
| └── ReferenceRel(0) | ||
| ``` | ||
|
|
||
| The shared relation `x` contains an outer reference to `tableA.a`. It is reached through two paths — one via subquery (2), one directly — so `steps_out` would need a different value depending on which path is followed. With `id_reference = 1`, the reference inside `x` unambiguously names `ProjectRel` to resolve the outer reference regardless of path. |
There was a problem hiding this comment.
It would be nice to show the example of the "good" case right after, i.e. the first relation but using an id reference.
There was a problem hiding this comment.
Hmm... Do you mean making the example more explicit using steps out and id? I'll make the change.
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
benbellick
left a comment
There was a problem hiding this comment.
Just some documentation suggestions I have suggested, feel free to take or leave them, they are mostly nits. Otherwise LGTM!
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
…it-io#1031) BREAKING CHANGE: NOT A REAL BREAKING CHANGE. OuterReference.steps_out field is now under oneof -- implicit optional to explicit optional. This is not a real breaking change as Substrait always required non-zero value. Add optional RelCommon.id field and OuterReference.id_reference to support unambiguous outer reference resolution in plans with common subexpressions (ReferenceRel). The existing steps_out offset-based mechanism remains for backward compatibility with tree-shaped plans. The `RelCommon.rel_anchor` can be used for other purpose but for now only required when it is used to resolve outer reference. See substrait-io#1024 for detailed motivating example and discussion. # AI disclaimer This PR was assisted by Claude Opus 4.6. # NOT A REAL BREAKING CHANGE `OuterReference.steps_out` field is moved into `oneof`. The field requires non-zero values thus there is no wire format change (the value has been *always* present on the wire for `steps_out` when used). <!-- Reviewable:start --> - - - This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/substrait-io/substrait/1031) <!-- Reviewable:end --> --------- Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
- Add bool lateral field to CrossRel for lateral cross product semantics - Add RelCommon.id and OuterReference.id_reference from PR substrait-io#1031 - Update JoinRel and CrossRel lateral comments to use id_reference - Update logical_relations.md with lateral docs for both JoinRel and CrossRel Depends-on: substrait-io#1031
BREAKING CHANGE: NOT A REAL BREAKING CHANGE. OuterReference.steps_out field is now under oneof -- implicit optional to explicit optional. This is not a real breaking change as Substrait always required non-zero value.
Add optional RelCommon.id field and OuterReference.id_reference to support unambiguous outer reference resolution in plans with common subexpressions (ReferenceRel). The existing steps_out offset-based mechanism remains for backward compatibility with tree-shaped plans.
The
RelCommon.rel_anchorcan be used for other purpose but for now only required when it is used to resolve outer reference.See #1024 for detailed motivating example and discussion.
AI disclaimer
This PR was assisted by Claude Opus 4.6.
NOT A REAL BREAKING CHANGE
OuterReference.steps_outfield is moved intooneof. The field requires non-zero values thus there is no wire format change (the value has been always present on the wire forsteps_outwhen used).This change is