Skip to content

Unified outer reference handling with common subexpression #1024

@yongchul

Description

@yongchul

Problem

The current outer reference in Substrait is offset based. To resolve the outer reference, it needs to walk up the tree to find the binding Rel. There can be multiple layers of nestings, thus the implementer must be aware what does provide such boundary. Today, the boundary is set by Subquery, and it is the only construct introducing such outer reference boundary.

The main problem is this assumes tree -- you have a single parent. If you have a common subexpressions, or a graph, this scheme no longer works any more.

The following motivating example uses a lateral join (#973 ), SQL standard feature allowing outer references in a table reference, for the sake of demonstration but the same problem can be demonstrated in scalar subqueries or set predicates.

Example

Consider following SQL.

WITH x as (
  select a,b
  from tableA
  where b = @p1
),
y as (
  select a, count(*) as b
  from x
  group by a
  order by a limit 10
),
z as (
  select a, count(y.b) as b
  from x left outer lateral join y on x.a = y.a
  where y.a > b
  group by a
)
select *
from tableB LATERAL JOIN (
    SELECT a, b
    FROM y
    WHERE a > tableB.a
    UNION
    SELECT a, b
    FROM z
    WHERE a > tableB.a
)
ON true

A rather straight forward translation to Substrait Rel is following.

graph TD;
  A["FilterRel(b=@p1)"] --- B["ReadRel(tableA)"];
  D["FetchRel(10)"] --- E["AggRel(a, count(*))"] --- A
  F["AggRel(a,count(y.b))"] --- G["FilterRel(y.a > b)"] --- H[LateralJoin2] --- A
  H --- D
  LateralJoin1 --- ReadRel(tableB)
  LateralJoin1 --- Union --- X["FilterRel(a > tableB.a)"] --- D
  Union --- Y["FilterRel(a > tableB.a)"] --- F
Loading

After simple predicate push down, the plan will look like following. Note that the common correlated predicate a > tableB.a is pushed all the way down to ReadRel(tableA) (see the node with !!!).

graph TD;
  A["!!! FilterRel(b=@p1 AND a > tableB.a) !!!"] --- B["ReadRel(tableA)"];
  D["FetchRel(10)"] --- E["AggRel(a, count(*))"] ---|step=2| A
  F["AggRel(a,count(y.b))"] --- G["FilterRel(y.a > b)"] --- H[LateralJoin2] ---|step=1| A
  H --- D
  LateralJoin1 --- ReadRel(tableB)
  LateralJoin1 --- Union --- D
  Union ---  F
Loading

tableB.a is an outer reference introduced by the top-most lateral join. The problem is, you have multiple paths to reach to the binding point, and the depth or steps are different. Thus, the binding is ambiguous or not well-defined in such a case.

Proposed Solution 1: correlation id (preferred)

  • Each rel introducing correlation have correlation id.
  • The correlation id is a new root in field reference.

In this model, there is no more ambiguity. period.

Generalized rel id

The correlation id needs to be introduced practically all the rels with expressions. Thus, it is easily becoming just a generic rel_id as there are majority of rels have references.

Another reason to consider more general "rel_id" is technically, the id needs to be unique within subject paths of outer reference resolution. This definition is mouthful and probably hard to understand than simple "plan-wide" unique.

Proposed Solution 2: fix the path on resolution

  • Fix the outer reference resolution path. Follow the left most path in DFS order.

This is a workable solution but we need to define or declare more stuff (e.g., what does it mean the "left" in substrait plan with shared expression reference) and potentially more burden to the implementation.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions