Skip to content

Recursive CTE Nullability Handling Should Preserve Logical Schema Without Requiring SQL Rewrites #22034

@kosiew

Description

@kosiew

Summary

#21912 introduced reusable execution-layer schema alignment helpers and updated recursive CTE execution to avoid ad hoc schema rebinding.

However, it also rewrote an existing SLT query in order for the test to pass:

diff --git a/datafusion/sqllogictest/test_files/cte.slt b/datafusion/sqllogictest/test_files/cte.slt
index e9c1c0245..d13e0d4f0 100644
--- a/datafusion/sqllogictest/test_files/cte.slt
+++ b/datafusion/sqllogictest/test_files/cte.slt
@@ -699,7 +699,7 @@ WITH RECURSIVE region_sales AS (
     SELECT
         s.salesperson_id AS salesperson_id,
         SUM(s.sale_amount) AS amount,
-        0 as level
+        SUM(0) as level
     FROM
         sales s
     GROUP BY

The rewrite changes a naturally non-null expression:

0 AS level

into a nullable aggregate expression:

SUM(0) AS level

This masks a recursive CTE nullability issue rather than fixing it.

Recursive CTE handling should not require rewriting valid SQL solely to satisfy schema alignment requirements.

Problem

Recursive self-reference/work-table columns need to be treated conservatively as nullable during recursive-term planning and optimization.

Without this, optimizer rules may incorrectly rely on anchor-only non-nullability assumptions.

For example, this query should execute correctly:

WITH RECURSIVE t(a, b) AS (
    SELECT 0 AS a, 0 AS b
    UNION ALL
    SELECT b AS a, CAST(NULL AS INT) AS b FROM t WHERE a IS NOT NULL
)
SELECT * FROM t;

If recursive self-reference columns are incorrectly planned as non-nullable, optimizer rules may incorrectly eliminate:

a IS NOT NULL

leading to unsound recursive behavior.

Expected Behavior

Recursive CTE handling should:

  • allow valid SQL such as 0 AS level without requiring rewrites like SUM(0);
  • plan recursive self-reference/work-table columns conservatively as nullable;
  • preserve the declared/logical recursive CTE schema;
  • derive final recursive CTE output nullability correctly from both static and recursive branches;
  • avoid unnecessary schema-alignment casts and projections.

Proposed Behavior

Recursive self-reference/work-table schemas should be nullable before recursive-term planning:

let work_table_schema = nullable_schema(static_plan.schema().inner());

After recursive-term planning, recursive CTE output nullability should be derived from both branches, for example:

static_field.is_nullable() || recursive_field.is_nullable()

rather than forcing all recursive CTE output fields to become nullable.

This preserves correctness while avoiding unnecessary logical/physical plan churn and allowing existing valid SQL queries to continue working unchanged.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions