Skip to content

feat!: add id-based outer reference resolution for DAG plans#1031

Merged
yongchul merged 9 commits into
substrait-io:mainfrom
yongchul:correlated_reference
May 1, 2026
Merged

feat!: add id-based outer reference resolution for DAG plans#1031
yongchul merged 9 commits into
substrait-io:mainfrom
yongchul:correlated_reference

Conversation

@yongchul
Copy link
Copy Markdown
Contributor

@yongchul yongchul commented Apr 2, 2026

BREAKING CHANGE: NOT A REAL BREAKING CHANGE. OuterReference.steps_out field is now under oneof -- implicit optional to explicit optional. This is not a real breaking change as Substrait always required non-zero value.

Add optional RelCommon.id field and OuterReference.id_reference to support unambiguous outer reference resolution in plans with common subexpressions (ReferenceRel). The existing steps_out offset-based mechanism remains for backward compatibility with tree-shaped plans.

The RelCommon.rel_anchor can be used for other purpose but for now only required when it is used to resolve outer reference.

See #1024 for detailed motivating example and discussion.

AI disclaimer

This PR was assisted by Claude Opus 4.6.

NOT A REAL BREAKING CHANGE

OuterReference.steps_out field is moved into oneof. The field requires non-zero values thus there is no wire format change (the value has been always present on the wire for steps_out when used).


This change is Reviewable

Add optional RelCommon.id field and OuterReference.id_reference to support
unambiguous outer reference resolution in plans with common subexpressions
(ReferenceRel). The existing steps_out offset-based mechanism remains for
backward compatibility with tree-shaped plans.

Closes substrait-io#1024
yongchul added a commit to yongchul/substrait that referenced this pull request Apr 3, 2026
- Add bool lateral field to CrossRel for lateral cross product semantics
- Add RelCommon.id and OuterReference.id_reference from PR substrait-io#1031
- Update JoinRel and CrossRel lateral comments to use id_reference
- Update logical_relations.md with lateral docs for both JoinRel and CrossRel

Depends-on: substrait-io#1031
@yongchul yongchul added the PMC Ready PRs ready for review by PMCs label Apr 3, 2026
@benbellick
Copy link
Copy Markdown
Member

Am I understanding correctly that this problem only arises when a Plan.relations entry contains outer references that reach outside the bounds of that particular root?

My interpretation of Plan.relations is that each entry is a standalone thing. Having a relation use steps_out to reference something outside its own scope feels like a misuse of Plan.relations to me. Of course, we can always use a ReferenceRel to reference one of the other relations, but their evaluation wouldn't be context dependent.

I feel like we may need a new construct here rather than tacking on the ability to reference outside of the scope of a particular root. This feels like some kind of "parameterized" relation notion, where we would want to have call sites providing values to pass in dynamically. I'm not sure yet what the right way to represent this would be but wanted to get my thoughts down for now.

What do you think?

@yongchul
Copy link
Copy Markdown
Contributor Author

yongchul commented Apr 6, 2026

I love github missing thread in the PR comment... 😠

Am I understanding correctly that this problem only arises when a Plan.relations entry contains outer references that reach outside the bounds of that particular root?

Yes.

My interpretation of Plan.relations is that each entry is a standalone thing. Having a relation use steps_out to reference something outside its own scope feels like a misuse of Plan.relations to me. Of course, we can always use a ReferenceRel to reference one of the other relations, but their evaluation wouldn't be context dependent.

Yes, you could put that restriction, you have to have each relation context independent. However, this proposal is actually making it context independent because the specific outer reference is bound to one consistent value without ambiguity [modulo there is still restriction where that particular value can coming from].

I feel like we may need a new construct here rather than tacking on the ability to reference outside of the scope of a particular root. This feels like some kind of "parameterized" relation notion, where we would want to have call sites providing values to pass in dynamically. I'm not sure yet what the right way to represent this would be but wanted to get my thoughts down for now.

This goes to inline table function (or parameterized view), which is a generic solution (or sledge hammer) and I actually feel like it is an overkill. Yes, I did think initially this is the way to go when I confine myself in existing Substrait outer reference structure, but this quickly went out of control. If your binding actually does not change in the plan, why do you have to go through indirect mappings only to point to the same thing?

To truly use inline table function, either you have a) a fairly sophisticated common subexpression engine to extract the common relational template or b) support for the parameterized views. I'm not aware of (a) but many systems do support (b). In (b), you won't actually have an outer reference if you inline the view definition or if you model it actual function call. Because you already have the common definition, it is straight forward to implement in producer and they are in fact used multiple times with different binding points. For (a), if you don't have such detection, it's pretty challenging for producer extract the common pattern, and generate the definition. It's nice to have such fancy decomposition for our system to have, but I don't see us doing this anytime soon...

So, I'm not arguing we don't need inline table function. We will need that and useful for parameterized views. But I don't feel like we have to use inline table function for outer reference, because we can represent outer reference in more generic way.

I'm biased. 😆 Happy to chat more about this but I strongly prefer anything else except github PR-level comment.

@drin
Copy link
Copy Markdown
Member

drin commented Apr 7, 2026

Could there be a common mechanism with #726? or is this change totally independent?

I don't have context on why an outer reference and a reference rel are both necessary. This seems to be the reason, but I'm not sure what context an outer reference has that can't be included in a reference rel:

Of course, we can always use a ReferenceRel to reference one of the other relations, but their evaluation wouldn't be context dependent.

(Of course, I would have to reopen #726 or maybe you can combine the change in this one since it's small)

@yongchul
Copy link
Copy Markdown
Contributor Author

yongchul commented Apr 8, 2026

@drin

Could there be a common mechanism with #726? or is this change totally independent?

I believe they are independent. If you squint, you could use the rel_id of the root rel of PlanRel but it is indirect while you seemed to look into more direct reference in #726. Or, rel_id value is sufficient for PlanRel anchor value you are looking for but not necessary, and I expect you want more direct reference at PlanRel message level.

I don't have context on why an outer reference and a reference rel are both necessary. This seems to be the reason, but I'm not sure what context an outer reference has that can't be included in a reference rel:

reference rel is about sharing a relational subtree (e.g., common table expressions, views). outer reference is a concept that you can correlate one relational tree to another (see lateral join #973 for example). They are orthogonal.

Of course, we can always use a ReferenceRel to reference one of the other relations, but their evaluation wouldn't be context dependent.
(Of course, I would have to reopen #726 or maybe you can combine the change in this one since it's small)

I wouldn't combine them because they are different concept.

Please let me know if there is some misunderstanding. Again, I hate PR-level comment to discuss something. If you want to discuss, let's create a comment in a file, then starts a thread.

@jacques-n
Copy link
Copy Markdown
Contributor

@yongchul , have you looked at all at how to change this in Datafusion/Duckdb? I think maybe one/both have implemented subqueries (maybe?) and am curious the complexity to update to new pattern.

@yongchul
Copy link
Copy Markdown
Contributor Author

yongchul commented Apr 8, 2026

@yongchul , have you looked at all at how to change this in Datafusion/Duckdb? I think maybe one/both have implemented subqueries (maybe?) and am curious the complexity to update to new pattern.

@jacques-n conceptually, if the implementer already maintains schema stack, the new approach can be implemented by look up by the id. So the content of stack changes from (schema) to (id, schema), then you can simply go over the stack until you find the schema with matching id.

I schemed datafusion. With my inexperienced rusty eyes, it does seem to follow above structure so fairly easy to extend. I couldn't find handling of RelCommon but this could be done easily as well when you push schema to the stack.

I admit that I used a little loose definition of uniqueness of id here though. I don't know how hard to check the uniqueness of non-zero ids across the plan... maybe this is difficult in datafusion?

I will look into duckdb as well.

Copy link
Copy Markdown
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me.

Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally think this is good now that I properly understand the problem, but just have some things I'd like clarified.

Thanks!

Comment thread site/docs/expressions/field_references.md Outdated
Comment thread site/docs/expressions/field_references.md Outdated

#### `id_reference` (id-based)

When a plan contains **common subexpressions** via `ReferenceRel`, the same relation can be reached through multiple paths with different depths. In that case, offset-based resolution is ambiguous because `steps_out` depends on *which* path is followed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this section may read a bit better if instead it was

#### `steps_out` (offset-based)
<explanation of WHAT steps_out is with an example>
#### `id_reference` (id-based)
<explanation of WHAT id_reference is with an example>

#### When `id_reference` is required
example of case where it is required. 

I think it would read clearer to people without the context of the problem this is solving to just see the concept in isolation. Then in the subsequent section, they see why you need to use one vs the other. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised accordingly. PTAL

Comment thread site/docs/expressions/field_references.md Outdated
Comment thread site/docs/expressions/subqueries.md Outdated
Comment thread site/docs/expressions/subqueries.md Outdated
Comment thread site/docs/expressions/field_references.md Outdated
Comment thread site/docs/expressions/field_references.md
Comment thread proto/substrait/algebra.proto Outdated

// Plan-wide unique id of the relation whose output this reference is
// rooted on. Must match an id defined on a RelCommon in the plan.
// Must be > 0 when set. Recommended when the plan contains shared
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mentioned elsewhere, but could we just make it required in this case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to avoid presenting this as a breaking change. Perhaps it is fine. Let's see how others think.

Comment thread site/docs/expressions/field_references.md Outdated
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
@yongchul yongchul requested a review from nielspardon as a code owner April 15, 2026 20:50
@yongchul yongchul force-pushed the correlated_reference branch from e36642e to 4c94434 Compare April 16, 2026 07:34
@yongchul yongchul changed the title feat: add id-based outer reference resolution for DAG plans feat!: add id-based outer reference resolution for DAG plans Apr 22, 2026
Comment thread site/docs/expressions/subqueries.md Outdated
Comment thread site/docs/expressions/subqueries.md Outdated
└── ReferenceRel(0)
```

The shared relation `x` contains an outer reference to `tableA.a`. It is reached through two paths — one via subquery (2), one directly — so `steps_out` would need a different value depending on which path is followed. With `id_reference = 1`, the reference inside `x` unambiguously names `ProjectRel` to resolve the outer reference regardless of path.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to show the example of the "good" case right after, i.e. the first relation but using an id reference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Do you mean making the example more explicit using steps out and id? I'll make the change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised. PTAL

yongchul and others added 2 commits April 23, 2026 14:45
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some documentation suggestions I have suggested, feel free to take or leave them, they are mostly nits. Otherwise LGTM!

Comment thread site/docs/expressions/field_references.md Outdated
Comment thread site/docs/expressions/field_references.md Outdated
Comment thread site/docs/expressions/field_references.md Outdated
Comment thread site/docs/expressions/subqueries.md Outdated
Comment thread proto/substrait/algebra.proto Outdated
Comment thread proto/substrait/algebra.proto Outdated
yongchul and others added 2 commits April 30, 2026 14:43
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
@yongchul yongchul merged commit ad664dc into substrait-io:main May 1, 2026
15 checks passed
yongchul added a commit to yongchul/substrait that referenced this pull request May 5, 2026
…it-io#1031)

BREAKING CHANGE: NOT A REAL BREAKING CHANGE. OuterReference.steps_out
field is now under oneof -- implicit optional to explicit optional. This
is not a real breaking change as Substrait always required non-zero
value.

Add optional RelCommon.id field and OuterReference.id_reference to
support unambiguous outer reference resolution in plans with common
subexpressions (ReferenceRel). The existing steps_out offset-based
mechanism remains for backward compatibility with tree-shaped plans.

The `RelCommon.rel_anchor` can be used for other purpose but for now
only required when it is used to resolve outer reference.

See substrait-io#1024 for detailed motivating example and discussion.

# AI disclaimer
This PR was assisted by Claude Opus 4.6.

# NOT A REAL BREAKING CHANGE

`OuterReference.steps_out` field is moved into `oneof`. The field
requires non-zero values thus there is no wire format change (the value
has been *always* present on the wire for `steps_out` when used).

<!-- Reviewable:start -->
- - -
This change is [<img src="https://reviewable.io/review_button.svg"
height="34" align="absmiddle"
alt="Reviewable"/>](https://reviewable.io/reviews/substrait-io/substrait/1031)
<!-- Reviewable:end -->

---------

Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
yongchul added a commit to yongchul/substrait that referenced this pull request May 5, 2026
- Add bool lateral field to CrossRel for lateral cross product semantics
- Add RelCommon.id and OuterReference.id_reference from PR substrait-io#1031
- Update JoinRel and CrossRel lateral comments to use id_reference
- Update logical_relations.md with lateral docs for both JoinRel and CrossRel

Depends-on: substrait-io#1031
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants