Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,12 @@ message RelCommon {
Hint hint = 3;
substrait.extensions.AdvancedExtension advanced_extension = 4;

// Optional plan-wide unique identifier for this relation. Required when
// this relation is the binding point for an OuterReference using
// rel_reference. Must be unique across all rels within a Plan.
// Must be >= 1 when set.
optional uint32 rel_anchor = 5;

// Direct indicates no change on presence and ordering of fields in the output
message Direct {}

Expand Down Expand Up @@ -1616,12 +1622,20 @@ message Expression {
// incoming record type
message RootReference {}

// A root reference for the outer relation's subquery
// A root reference for the outer relation's subquery.
message OuterReference {
// number of subquery boundaries to traverse up for this field's reference
//
// This value must be >= 1
uint32 steps_out = 1;
oneof outer_reference_type {
// Number of subquery boundaries to traverse up for this field's
// reference. Must be >= 1.
uint32 steps_out = 1;

// References the plan-wide unique rel_anchor of the relation that this
// field reference is rooted on. Must match a rel_anchor defined on a
// RelCommon in the plan. Must be >= 1. Must be used instead of steps_out
// when this outer reference appears inside a relation shared via
// ReferenceRel and offset-based resolution via steps_out would be ambiguous.
uint32 rel_reference = 2;
}
}

// A reference to a lambda parameter within a lambda body expression.
Expand Down
59 changes: 58 additions & 1 deletion site/docs/expressions/field_references.md
Comment thread
yongchul marked this conversation as resolved.
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ In Substrait, all fields are dealt with on a positional basis. Field names are o
Field references can originate from different root types:

- **RootReference**: References the incoming record from the relation
- **OuterReference**: References outer query records in correlated subqueries
- **OuterReference**: References outer query records in correlated subqueries, supporting either offset-based (`steps_out`) or id-based (`rel_reference`) resolution (see [Outer References](#outer-references))
- **Expression**: References the result of evaluating an expression
- **LambdaParameterReference**: References lambda parameters within lambda body expressions (see [Lambda Expressions](lambda_expressions.md))

Expand Down Expand Up @@ -155,3 +155,60 @@ By default, when only a single field is selected from a struct, that struct is r

* Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.)

### Outer References

Outer references allow expressions inside a subquery to access records from an enclosing relation. The `OuterReference` root type supports two mutually exclusive resolution strategies:

#### `steps_out` (offset-based)

`steps_out` resolves the reference by counting subquery boundaries upward (`steps_out >= 1`). This works correctly whenever the plan is a **tree**, i.e., when each relation has exactly one parent, the path to the binding relation can be uniquely determined via `steps_out`.

#### `rel_reference` (id-based)

`rel_reference` resolves the reference by referring to the binding relation via its plan-wide unique `RelCommon.rel_anchor`. The `rel_anchor` on the referenced relation must be set (>= 1) and unique across all relations in the plan.

#### Coexistence rules

Exactly one of `steps_out` or `rel_reference` must be set on each `OuterReference`. A single plan may contain outer references using different strategies (e.g., some using `steps_out` and others using `rel_reference`), as long as every individual reference is unambiguous. However, if any shared relation (via `ReferenceRel`) contains an unresolved outer reference, that reference **must** use `rel_reference`.

#### When to use `rel_reference`

`rel_reference` must be used instead of `steps_out` when an outer reference appears inside a relation shared via `ReferenceRel` and the shared relation can be reached through multiple paths with different subquery depths, making `steps_out` ambiguous. In this case, the same outer reference could require different `steps_out` values depending on which path is followed.

For example, consider a plan with two nested scalar subqueries that share a common relation `x`. The outer reference to `tableA.a` lives inside `x`, which is reached via paths of different depth:

```
PlanRel.relations[0].rel: # let's call it 'x'
FilterRel(a > outer_ref(steps_out=1, tableA.a)) # steps_out 1 or 2?
└── ReadRel(tableB)

PlanRel.relations[1].root:
ProjectRel # Correct binding for tableA.a for the outer reference tableA.a in x.
├── ReadRel(tableA)
└── Subquery.Scalar # Subquery (1)
└── SetRel(MINUS_PRIMARY)
├── ProjectRel
| └── Subquery.Scalar # Subquery (2)
│ └── ReferenceRel(0) # Here steps_out=1 binds incorrectly, because tableA.a is actually two subquery boundaries out.
└── ReferenceRel(0) # Here steps_out=1 binds correctly, because tableA.a is one subquery boundary out.
```

The same shared relation `x` contains a single stored `steps_out=1` outer reference, but that value is only correct for one of its uses. The other use would need `steps_out=2`, so offset-based resolution is ambiguous.

With `rel_reference`, both reference rels can unambiguously refer to the correct binding.

```
PlanRel.relations[0].rel: # let's call it 'x'
FilterRel(a > outer_ref(rel_reference=7, tableA.a))
└── ReadRel(tableB)

PlanRel.relations[1].root:
ProjectRel [rel_anchor=7] # Correct binding for tableA.a for the outer reference tableA.a in x.
├── ReadRel(tableA)
└── Subquery.Scalar # Subquery (1)
└── SetRel(MINUS_PRIMARY)
├── ProjectRel
| └── Subquery.Scalar # Subquery (2)
│ └── ReferenceRel(0) # Reference 1: rel_reference = 7
└── ReferenceRel(0) # Reference 2: rel_reference = 7
```
173 changes: 98 additions & 75 deletions site/docs/expressions/subqueries.md
Original file line number Diff line number Diff line change
@@ -1,75 +1,98 @@
# Subqueries

Subqueries are scalar expressions comprised of another query.

## Forms

### Scalar

Scalar subqueries are subqueries that return one row and one column.

| Property | Description | Required |
| -------- | -------------- | -------- |
| Input | Input relation | Yes |

### `IN` predicate

An `IN` subquery predicate checks that the left expression is contained in the
right subquery.

#### Examples

```sql
SELECT *
FROM t1
WHERE x IN (SELECT * FROM t2)
```

```sql
SELECT *
FROM t1
WHERE (x, y) IN (SELECT a, b FROM t2)
```

| Property | Description | Required |
| -------- | ------------------------------------------- | -------- |
| Needles | Expressions whose existence will be checked | Yes |
| Haystack | Subquery to check | Yes |

### Set predicates

A set predicate is a predicate over a set of rows in the form of a subquery.

`EXISTS` and `UNIQUE` are common SQL spellings of these kinds of predicates.

| Property | Description | Required |
| --------- | ------------------------------------------ | -------- |
| Operation | The operation to perform over the set | Yes |
| Tuples | Set of tuples to check using the operation | Yes |

### Set comparisons

A set comparison subquery is a subquery comparison using `ANY` or `ALL` operations.

#### Examples

```sql
SELECT *
FROM t1
WHERE x < ANY(SELECT y from t2)
```

| Property | Description | Required |
| --------------------- | ---------------------------------------------- | -------- |
| Reduction operation | The kind of reduction to use over the subquery | Yes |
| Comparison operation | The kind of comparison operation to use | Yes |
| Expression | Left-hand side expression to check | Yes |
| Subquery | Subquery to check | Yes |



=== "Protobuf Representation"

```proto
%%% proto.message.Expression.Subquery %%%
```
# Subqueries

Subqueries are scalar expressions comprised of another query.

## Forms

### Scalar

Scalar subqueries are subqueries that return one row and one column.

| Property | Description | Required |
| -------- | -------------- | -------- |
| Input | Input relation | Yes |

### `IN` predicate

An `IN` subquery predicate checks that the left expression is contained in the
right subquery.

#### Examples

```sql
SELECT *
FROM t1
WHERE x IN (SELECT * FROM t2)
```

```sql
SELECT *
FROM t1
WHERE (x, y) IN (SELECT a, b FROM t2)
```

| Property | Description | Required |
| -------- | ------------------------------------------- | -------- |
| Needles | Expressions whose existence will be checked | Yes |
| Haystack | Subquery to check | Yes |

### Set predicates

A set predicate is a predicate over a set of rows in the form of a subquery.

`EXISTS` and `UNIQUE` are common SQL spellings of these kinds of predicates.

| Property | Description | Required |
| --------- | ------------------------------------------ | -------- |
| Operation | The operation to perform over the set | Yes |
| Tuples | Set of tuples to check using the operation | Yes |

### Set comparisons

A set comparison subquery is a subquery comparison using `ANY` or `ALL` operations.

#### Examples

```sql
SELECT *
FROM t1
WHERE x < ANY(SELECT y from t2)
```

| Property | Description | Required |
| --------------------- | ---------------------------------------------- | -------- |
| Reduction operation | The kind of reduction to use over the subquery | Yes |
| Comparison operation | The kind of comparison operation to use | Yes |
| Expression | Left-hand side expression to check | Yes |
| Subquery | Subquery to check | Yes |



## Outer References in Subqueries

Subqueries may contain *outer references*, which are field references that reach
outside the subquery boundary to access records from an enclosing relation.
The `OuterReference` root type provides two resolution fields:

* `steps_out`: Resolves the reference by counting subquery boundaries
upward. This works correctly when the plan is a tree (each relation has a
single parent).

* `rel_reference`: Resolves the reference by naming the binding relation
via its plan-wide unique `RelCommon.rel_anchor`. Must be used instead of
`steps_out` when an outer reference appears inside a relation shared via

`ReferenceRel` and that shared relation can be reached through multiple

paths with different subquery depths, making `steps_out` ambiguous.


Exactly one of these fields must be set. See
[Field References — Outer References](field_references.md#outer-references)
for details.

=== "Protobuf Representation"

```proto
%%% proto.message.Expression.Subquery %%%
```
4 changes: 4 additions & 0 deletions site/docs/relations/common_fields.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@ A relation which has a direct emit kind outputs the relation's output without re
* Many relations (such as Project) by default provide as their output the list of all their input columns plus any generated columns as its output columns. Review each relation to understand its specific output default.


## Relation ID

A relation may carry an optional plan-wide unique identifier (`rel_anchor`). When set, the value must be >= 1 and unique across all relations in the plan. This identifier is required when the relation is the binding point for an `OuterReference` that uses `rel_reference` resolution. See [Field References — Outer References](../expressions/field_references.md#outer-references) for details.

## Hints

Hints provide information that can improve performance but cannot be used to control the behavior. Table statistics, runtime constraints, name hints, and saved computations all fall into this category.
Expand Down
4 changes: 4 additions & 0 deletions site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -425,6 +425,10 @@ We could use the `ReferenceRel` to highlight the shared `A JOIN B` between the t
One expressing `A JOIN B` (in position 0 in the plan), one using reference as follows: `ReferenceRel(0) JOIN C` and a third one
doing `ReferenceRel(0) JOIN D`. This allows to avoid the redundancy of `A JOIN B`.

!!! note "Outer references in shared relations"

When a shared relation contains an unresolved outer reference, the reference must use `rel_reference` instead of `steps_out`, because a `ReferenceRel` can be reached through multiple paths of different depths, making offset-based resolution ambiguous. See [Field References — Outer References](../expressions/field_references.md#outer-references) for details.

| Signature | Value |
| -------------------- |---------------------------------------|
| Inputs | 1 |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Outer reference using rel_reference (id-based resolution)
#
# Scenario: A shared relation (via ReferenceRel) contains a correlated
# filter that references a column from an enclosing relation. Because the
# shared relation can be reached through multiple paths of different
# depths, offset-based resolution (steps_out) would be ambiguous.
# rel_reference resolves this by naming the binding relation directly.
#
# Plan structure:
#
# PlanRel.relations[0].rel: (shared relation "x")
# FilterRel(col > outer_ref(rel_reference=7, position 0))
# └── ReadRel(tableB)
#
# PlanRel.relations[1].root:
# ProjectRel [rel_anchor=7] <-- binding relation
# ├── ReadRel(tableA)
# └── Subquery.Scalar
# └── SetRel(MINUS_PRIMARY)
# ├── ProjectRel
# │ └── Subquery.Scalar
# │ └── ReferenceRel(0) (depth 2 from binding)
# └── ReferenceRel(0) (depth 1 from binding)
#
# Both ReferenceRel nodes point to the same shared relation, but they
# sit at different depths. rel_reference = 7 unambiguously resolves
# to the ProjectRel whose RelCommon.rel_anchor = 7, regardless of which
# path is taken.
#
# message Expression.FieldReference

outer_reference: {
rel_reference: 7 # Refers to the relation with RelCommon.rel_anchor = 7
}
direct_reference: {
struct_field: {
field: 0 # First column of the binding relation (tableA.a)
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Outer reference using steps_out (offset-based resolution)
#
# Scenario: A correlated scalar subquery where the inner filter references
# a column from the outer query.
#
# SQL equivalent:
# SELECT *
# FROM orders -- outer relation
# WHERE amount > (
# SELECT AVG(amount) -- scalar subquery
# FROM orders AS o2
# WHERE o2.customer_id = orders.customer_id -- outer reference
# )
#
# The outer reference `orders.customer_id` is one subquery boundary up,
# so steps_out = 1. The referenced field is at position 0 (customer_id)
# in the outer relation's output.
#
# steps_out works here because the plan is a tree (each relation has
# exactly one parent), so the path to the binding relation is unambiguous.
#
# message Expression.FieldReference

outer_reference: {
steps_out: 1 # One subquery boundary up to the enclosing relation
}
direct_reference: {
struct_field: {
field: 0 # First column of the outer relation (customer_id)
}
}