diff --git a/proto/substrait/algebra.proto b/proto/substrait/algebra.proto index ccc36b4d7..6fa504dc8 100644 --- a/proto/substrait/algebra.proto +++ b/proto/substrait/algebra.proto @@ -24,6 +24,12 @@ message RelCommon { Hint hint = 3; substrait.extensions.AdvancedExtension advanced_extension = 4; + // Optional plan-wide unique identifier for this relation. Required when + // this relation is the binding point for an OuterReference using + // rel_reference. Must be unique across all rels within a Plan. + // Must be >= 1 when set. + optional uint32 rel_anchor = 5; + // Direct indicates no change on presence and ordering of fields in the output message Direct {} @@ -1616,12 +1622,20 @@ message Expression { // incoming record type message RootReference {} - // A root reference for the outer relation's subquery + // A root reference for the outer relation's subquery. message OuterReference { - // number of subquery boundaries to traverse up for this field's reference - // - // This value must be >= 1 - uint32 steps_out = 1; + oneof outer_reference_type { + // Number of subquery boundaries to traverse up for this field's + // reference. Must be >= 1. + uint32 steps_out = 1; + + // References the plan-wide unique rel_anchor of the relation that this + // field reference is rooted on. Must match a rel_anchor defined on a + // RelCommon in the plan. Must be >= 1. Must be used instead of steps_out + // when this outer reference appears inside a relation shared via + // ReferenceRel and offset-based resolution via steps_out would be ambiguous. + uint32 rel_reference = 2; + } } // A reference to a lambda parameter within a lambda body expression. diff --git a/site/docs/expressions/field_references.md b/site/docs/expressions/field_references.md index 0dd052196..0f9123f5f 100644 --- a/site/docs/expressions/field_references.md +++ b/site/docs/expressions/field_references.md @@ -5,7 +5,7 @@ In Substrait, all fields are dealt with on a positional basis. Field names are o Field references can originate from different root types: - **RootReference**: References the incoming record from the relation -- **OuterReference**: References outer query records in correlated subqueries +- **OuterReference**: References outer query records in correlated subqueries, supporting either offset-based (`steps_out`) or id-based (`rel_reference`) resolution (see [Outer References](#outer-references)) - **Expression**: References the result of evaluating an expression - **LambdaParameterReference**: References lambda parameters within lambda body expressions (see [Lambda Expressions](lambda_expressions.md)) @@ -155,3 +155,60 @@ By default, when only a single field is selected from a struct, that struct is r * Should we support column reordering/positioning using a masked complex expression? (Right now, you can only mask things out.) +### Outer References + +Outer references allow expressions inside a subquery to access records from an enclosing relation. The `OuterReference` root type supports two mutually exclusive resolution strategies: + +#### `steps_out` (offset-based) + +`steps_out` resolves the reference by counting subquery boundaries upward (`steps_out >= 1`). This works correctly whenever the plan is a **tree**, i.e., when each relation has exactly one parent, the path to the binding relation can be uniquely determined via `steps_out`. + +#### `rel_reference` (id-based) + +`rel_reference` resolves the reference by referring to the binding relation via its plan-wide unique `RelCommon.rel_anchor`. The `rel_anchor` on the referenced relation must be set (>= 1) and unique across all relations in the plan. + +#### Coexistence rules + +Exactly one of `steps_out` or `rel_reference` must be set on each `OuterReference`. A single plan may contain outer references using different strategies (e.g., some using `steps_out` and others using `rel_reference`), as long as every individual reference is unambiguous. However, if any shared relation (via `ReferenceRel`) contains an unresolved outer reference, that reference **must** use `rel_reference`. + +#### When to use `rel_reference` + +`rel_reference` must be used instead of `steps_out` when an outer reference appears inside a relation shared via `ReferenceRel` and the shared relation can be reached through multiple paths with different subquery depths, making `steps_out` ambiguous. In this case, the same outer reference could require different `steps_out` values depending on which path is followed. + +For example, consider a plan with two nested scalar subqueries that share a common relation `x`. The outer reference to `tableA.a` lives inside `x`, which is reached via paths of different depth: + +``` +PlanRel.relations[0].rel: # let's call it 'x' +FilterRel(a > outer_ref(steps_out=1, tableA.a)) # steps_out 1 or 2? + └── ReadRel(tableB) + +PlanRel.relations[1].root: +ProjectRel # Correct binding for tableA.a for the outer reference tableA.a in x. +├── ReadRel(tableA) +└── Subquery.Scalar # Subquery (1) + └── SetRel(MINUS_PRIMARY) + ├── ProjectRel + | └── Subquery.Scalar # Subquery (2) + │ └── ReferenceRel(0) # Here steps_out=1 binds incorrectly, because tableA.a is actually two subquery boundaries out. + └── ReferenceRel(0) # Here steps_out=1 binds correctly, because tableA.a is one subquery boundary out. +``` + +The same shared relation `x` contains a single stored `steps_out=1` outer reference, but that value is only correct for one of its uses. The other use would need `steps_out=2`, so offset-based resolution is ambiguous. + +With `rel_reference`, both reference rels can unambiguously refer to the correct binding. + +``` +PlanRel.relations[0].rel: # let's call it 'x' +FilterRel(a > outer_ref(rel_reference=7, tableA.a)) + └── ReadRel(tableB) + +PlanRel.relations[1].root: +ProjectRel [rel_anchor=7] # Correct binding for tableA.a for the outer reference tableA.a in x. +├── ReadRel(tableA) +└── Subquery.Scalar # Subquery (1) + └── SetRel(MINUS_PRIMARY) + ├── ProjectRel + | └── Subquery.Scalar # Subquery (2) + │ └── ReferenceRel(0) # Reference 1: rel_reference = 7 + └── ReferenceRel(0) # Reference 2: rel_reference = 7 +``` diff --git a/site/docs/expressions/subqueries.md b/site/docs/expressions/subqueries.md index c71fee0dc..655bb024d 100644 --- a/site/docs/expressions/subqueries.md +++ b/site/docs/expressions/subqueries.md @@ -1,75 +1,98 @@ -# Subqueries - -Subqueries are scalar expressions comprised of another query. - -## Forms - -### Scalar - -Scalar subqueries are subqueries that return one row and one column. - -| Property | Description | Required | -| -------- | -------------- | -------- | -| Input | Input relation | Yes | - -### `IN` predicate - -An `IN` subquery predicate checks that the left expression is contained in the -right subquery. - -#### Examples - -```sql -SELECT * -FROM t1 -WHERE x IN (SELECT * FROM t2) -``` - -```sql -SELECT * -FROM t1 -WHERE (x, y) IN (SELECT a, b FROM t2) -``` - -| Property | Description | Required | -| -------- | ------------------------------------------- | -------- | -| Needles | Expressions whose existence will be checked | Yes | -| Haystack | Subquery to check | Yes | - -### Set predicates - -A set predicate is a predicate over a set of rows in the form of a subquery. - -`EXISTS` and `UNIQUE` are common SQL spellings of these kinds of predicates. - -| Property | Description | Required | -| --------- | ------------------------------------------ | -------- | -| Operation | The operation to perform over the set | Yes | -| Tuples | Set of tuples to check using the operation | Yes | - -### Set comparisons - -A set comparison subquery is a subquery comparison using `ANY` or `ALL` operations. - -#### Examples - -```sql -SELECT * -FROM t1 -WHERE x < ANY(SELECT y from t2) -``` - -| Property | Description | Required | -| --------------------- | ---------------------------------------------- | -------- | -| Reduction operation | The kind of reduction to use over the subquery | Yes | -| Comparison operation | The kind of comparison operation to use | Yes | -| Expression | Left-hand side expression to check | Yes | -| Subquery | Subquery to check | Yes | - - - -=== "Protobuf Representation" - - ```proto -%%% proto.message.Expression.Subquery %%% - ``` +# Subqueries + +Subqueries are scalar expressions comprised of another query. + +## Forms + +### Scalar + +Scalar subqueries are subqueries that return one row and one column. + +| Property | Description | Required | +| -------- | -------------- | -------- | +| Input | Input relation | Yes | + +### `IN` predicate + +An `IN` subquery predicate checks that the left expression is contained in the +right subquery. + +#### Examples + +```sql +SELECT * +FROM t1 +WHERE x IN (SELECT * FROM t2) +``` + +```sql +SELECT * +FROM t1 +WHERE (x, y) IN (SELECT a, b FROM t2) +``` + +| Property | Description | Required | +| -------- | ------------------------------------------- | -------- | +| Needles | Expressions whose existence will be checked | Yes | +| Haystack | Subquery to check | Yes | + +### Set predicates + +A set predicate is a predicate over a set of rows in the form of a subquery. + +`EXISTS` and `UNIQUE` are common SQL spellings of these kinds of predicates. + +| Property | Description | Required | +| --------- | ------------------------------------------ | -------- | +| Operation | The operation to perform over the set | Yes | +| Tuples | Set of tuples to check using the operation | Yes | + +### Set comparisons + +A set comparison subquery is a subquery comparison using `ANY` or `ALL` operations. + +#### Examples + +```sql +SELECT * +FROM t1 +WHERE x < ANY(SELECT y from t2) +``` + +| Property | Description | Required | +| --------------------- | ---------------------------------------------- | -------- | +| Reduction operation | The kind of reduction to use over the subquery | Yes | +| Comparison operation | The kind of comparison operation to use | Yes | +| Expression | Left-hand side expression to check | Yes | +| Subquery | Subquery to check | Yes | + + + +## Outer References in Subqueries + +Subqueries may contain *outer references*, which are field references that reach +outside the subquery boundary to access records from an enclosing relation. +The `OuterReference` root type provides two resolution fields: + +* `steps_out`: Resolves the reference by counting subquery boundaries + upward. This works correctly when the plan is a tree (each relation has a + single parent). + +* `rel_reference`: Resolves the reference by naming the binding relation + via its plan-wide unique `RelCommon.rel_anchor`. Must be used instead of + `steps_out` when an outer reference appears inside a relation shared via + + `ReferenceRel` and that shared relation can be reached through multiple + + paths with different subquery depths, making `steps_out` ambiguous. + + +Exactly one of these fields must be set. See +[Field References — Outer References](field_references.md#outer-references) +for details. + +=== "Protobuf Representation" + + ```proto +%%% proto.message.Expression.Subquery %%% + ``` diff --git a/site/docs/relations/common_fields.md b/site/docs/relations/common_fields.md index 37f0d4cf4..fa3316992 100644 --- a/site/docs/relations/common_fields.md +++ b/site/docs/relations/common_fields.md @@ -12,6 +12,10 @@ A relation which has a direct emit kind outputs the relation's output without re * Many relations (such as Project) by default provide as their output the list of all their input columns plus any generated columns as its output columns. Review each relation to understand its specific output default. +## Relation ID + +A relation may carry an optional plan-wide unique identifier (`rel_anchor`). When set, the value must be >= 1 and unique across all relations in the plan. This identifier is required when the relation is the binding point for an `OuterReference` that uses `rel_reference` resolution. See [Field References — Outer References](../expressions/field_references.md#outer-references) for details. + ## Hints Hints provide information that can improve performance but cannot be used to control the behavior. Table statistics, runtime constraints, name hints, and saved computations all fall into this category. diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index a6b0990aa..9d08e37f3 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -425,6 +425,10 @@ We could use the `ReferenceRel` to highlight the shared `A JOIN B` between the t One expressing `A JOIN B` (in position 0 in the plan), one using reference as follows: `ReferenceRel(0) JOIN C` and a third one doing `ReferenceRel(0) JOIN D`. This allows to avoid the redundancy of `A JOIN B`. +!!! note "Outer references in shared relations" + + When a shared relation contains an unresolved outer reference, the reference must use `rel_reference` instead of `steps_out`, because a `ReferenceRel` can be reached through multiple paths of different depths, making offset-based resolution ambiguous. See [Field References — Outer References](../expressions/field_references.md#outer-references) for details. + | Signature | Value | | -------------------- |---------------------------------------| | Inputs | 1 | diff --git a/site/examples/proto-textformat/field_reference/outer_reference_rel_reference.textproto b/site/examples/proto-textformat/field_reference/outer_reference_rel_reference.textproto new file mode 100644 index 000000000..30dbc8eb0 --- /dev/null +++ b/site/examples/proto-textformat/field_reference/outer_reference_rel_reference.textproto @@ -0,0 +1,39 @@ +# Outer reference using rel_reference (id-based resolution) +# +# Scenario: A shared relation (via ReferenceRel) contains a correlated +# filter that references a column from an enclosing relation. Because the +# shared relation can be reached through multiple paths of different +# depths, offset-based resolution (steps_out) would be ambiguous. +# rel_reference resolves this by naming the binding relation directly. +# +# Plan structure: +# +# PlanRel.relations[0].rel: (shared relation "x") +# FilterRel(col > outer_ref(rel_reference=7, position 0)) +# └── ReadRel(tableB) +# +# PlanRel.relations[1].root: +# ProjectRel [rel_anchor=7] <-- binding relation +# ├── ReadRel(tableA) +# └── Subquery.Scalar +# └── SetRel(MINUS_PRIMARY) +# ├── ProjectRel +# │ └── Subquery.Scalar +# │ └── ReferenceRel(0) (depth 2 from binding) +# └── ReferenceRel(0) (depth 1 from binding) +# +# Both ReferenceRel nodes point to the same shared relation, but they +# sit at different depths. rel_reference = 7 unambiguously resolves +# to the ProjectRel whose RelCommon.rel_anchor = 7, regardless of which +# path is taken. +# +# message Expression.FieldReference + +outer_reference: { + rel_reference: 7 # Refers to the relation with RelCommon.rel_anchor = 7 +} +direct_reference: { + struct_field: { + field: 0 # First column of the binding relation (tableA.a) + } +} diff --git a/site/examples/proto-textformat/field_reference/outer_reference_steps_out.textproto b/site/examples/proto-textformat/field_reference/outer_reference_steps_out.textproto new file mode 100644 index 000000000..a04d89f10 --- /dev/null +++ b/site/examples/proto-textformat/field_reference/outer_reference_steps_out.textproto @@ -0,0 +1,31 @@ +# Outer reference using steps_out (offset-based resolution) +# +# Scenario: A correlated scalar subquery where the inner filter references +# a column from the outer query. +# +# SQL equivalent: +# SELECT * +# FROM orders -- outer relation +# WHERE amount > ( +# SELECT AVG(amount) -- scalar subquery +# FROM orders AS o2 +# WHERE o2.customer_id = orders.customer_id -- outer reference +# ) +# +# The outer reference `orders.customer_id` is one subquery boundary up, +# so steps_out = 1. The referenced field is at position 0 (customer_id) +# in the outer relation's output. +# +# steps_out works here because the plan is a tree (each relation has +# exactly one parent), so the path to the binding relation is unambiguous. +# +# message Expression.FieldReference + +outer_reference: { + steps_out: 1 # One subquery boundary up to the enclosing relation +} +direct_reference: { + struct_field: { + field: 0 # First column of the outer relation (customer_id) + } +}