Skip to content

feat: support InSubquery and Exists in Projection expressions#21363

Open
crm26 wants to merge 5 commits intoapache:mainfrom
crm26:feat/in-subquery-projection
Open

feat: support InSubquery and Exists in Projection expressions#21363
crm26 wants to merge 5 commits intoapache:mainfrom
crm26:feat/in-subquery-projection

Conversation

@crm26
Copy link
Copy Markdown

@crm26 crm26 commented Apr 4, 2026

Summary

DecorrelatePredicateSubquery only handled InSubquery and Exists in Filter nodes. When they appeared in Projection expressions (CASE, COALESCE, bare boolean SELECT), the physical planner errored:

This feature is not implemented: Physical plan does not support logical expression InSubquery(...)

This adds a Projection match arm following the same pattern as ScalarSubqueryToJoin's Projection handler.

What's fixed

All of these now work:

-- CASE with IN subquery
SELECT CASE WHEN id IN (SELECT id FROM t2) THEN 'yes' ELSE 'no' END FROM t1

-- EXISTS in CASE
SELECT CASE WHEN EXISTS (SELECT 1 FROM t2 WHERE t2.id = t1.id) THEN 'found' ELSE 'missing' END FROM t1

-- NOT IN / NOT EXISTS
SELECT CASE WHEN id NOT IN (SELECT id FROM t2) THEN 'keep' ELSE 'drop' END FROM t1

-- Bare boolean
SELECT id IN (SELECT id FROM t2) AS is_match FROM t1

-- Multiple subqueries in one expression
SELECT CASE WHEN id IN (SELECT id FROM t2) AND EXISTS (SELECT 1 FROM t3 WHERE t3.id = t1.id) THEN 'both' ELSE 'neither' END FROM t1

-- COALESCE with IN subquery
SELECT COALESCE(CASE WHEN id IN (SELECT id FROM t2) THEN 'found' END, 'missing') FROM t1

Design

  • Follows ScalarSubqueryToJoin's established Projection handler pattern
  • Reuses existing rewrite_inner_subqueries + mark_join — no new decorrelation logic
  • Bails out when decorrelation fails (e.g., correlated subquery with LIMIT) — returns original plan unchanged
  • LeftMark joins already handled by all downstream optimizer rules (verified)

Tests

  • 8 sqllogictest cases (in_subquery_projection.slt)
  • Plan-shape unit test for the bail-out path (undecorrelatable subquery in Projection)
  • 58 optimizer unit tests passing, clippy clean

Previously, `DecorrelatePredicateSubquery` only handled InSubquery
and Exists nodes in Filter (WHERE/HAVING). When they appeared in
Projection expressions (CASE WHEN, COALESCE, bare boolean SELECT),
the physical planner errored with "Physical plan does not support
logical expression InSubquery(...)".

This adds a Projection match arm following the same pattern as
ScalarSubqueryToJoin's Projection handler. InSubquery and Exists
in Projection expressions are decorrelated into LeftMark joins,
with the mark column aliased to preserve the original schema.

When decorrelation fails (e.g., correlated subquery with LIMIT),
the plan is returned unchanged — matching ScalarSubqueryToJoin's
bail-out semantics.

Patterns now supported:
- CASE WHEN id IN (SELECT ...) THEN ... ELSE ... END
- CASE WHEN EXISTS (SELECT ...) THEN ... END
- NOT IN / NOT EXISTS in CASE
- id IN (SELECT ...) as bare boolean in SELECT
- Correlated IN/EXISTS in expressions
- Multiple subqueries in one expression
- COALESCE with IN subquery

Includes sqllogictest coverage (8 test cases) and plan-shape
unit test for the bail-out path.
@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 4, 2026
@neilconway
Copy link
Copy Markdown
Contributor

@crm26 Thanks for this, I'll take a look shortly! Can you fix the test and fmt CI failures?

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@neilconway neilconway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great!

One minor quibble: I notice the name of the file and various identifiers refer to "Predicate", which will no longer be accurate. The common thread is that we are rewriting only IN and EXISTS subqueries, so maybe some name involving that -- like ExistenceSubqueryToJoin or ExistsInSubqueryToJoin. But if we're going to change that, let's do it in a followup PR.

- Per-expression early bail-out: check has_subquery(&rewritten) immediately
  after each rewrite instead of processing all exprs then checking at end.
  Preserves all-or-nothing semantics with earlier termination.
- Add "Optimization:" prefix to alias preservation comment to clarify it is
  not required for correctness.
- Add test projection_mixed_decorrelatable_and_non_bails_out covering one
  decorrelatable + one non-decorrelatable subquery in the same projection
  (exercises the partial-rewrite-then-bail path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@crm26
Copy link
Copy Markdown
Author

crm26 commented Apr 7, 2026

Thanks for the thorough review, @neilconway! Addressed all three points in the latest push (commit 12a12bc):

  1. Optimization: prefix — Added to the alias preservation comment to make it clear it's not required for correctness.

  2. Per-expression early bail-out — Refactored the loop to call has_subquery(&rewritten) immediately after each rewrite. Now bails out on first failure instead of processing all expressions then checking at the end. Semantics are still all-or-nothing (same as ScalarSubqueryToJoin).

  3. Mixed decorrelatable + non-decorrelatable test — Added projection_mixed_decorrelatable_and_non_bails_out. Listed the decorrelatable case_ok first in the projection so it would rewrite successfully before case_bad triggers the bail-out — this specifically exercises the partial-rewrite-then-discard path to ensure case_ok doesn't leak a partial rewrite.

Validation: 59 decorrelate tests pass (was 58), full optimizer suite (665 + 26 + 5) passes, fmt + clippy -D warnings clean, subquery.slt passes.

The test at predicates.slt:845 asserted that `NULL IN (SELECT ...)` in a
projection would fail with "Physical plan does not support logical expression
InSubquery". With decorrelation of InSubquery in projections now supported,
the query correctly returns (false, true) instead of erroring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@crm26
Copy link
Copy Markdown
Author

crm26 commented Apr 7, 2026

Found and fixed the CI failure, pushed as 9288448.

The cargo test (amd64) / (macos-aarch64) failures came from a stale negative test in predicates.slt:845 that asserted NULL IN (SELECT ... FROM empty) in a projection would error with "Physical plan does not support logical expression InSubquery". With decorrelation now supported in projections, the query correctly returns (false, true) — updated the test to expect the correct result.

My earlier validation gap: I only ran cargo test --test sqllogictests -- subquery.slt locally and didn't catch the stale test in predicates.slt. My fault — should have run the full slt suite. I've now run:

  • cargo test -p datafusion-optimizer — 665 + 26 + 5 pass
  • cargo test --test sqllogictests (all 422 files) — all pass
  • cargo test --test sqllogictests -- --include-sqlite (1017 files) — all pass
  • cargo test -p datafusion-cli --features datafusion/extended_tests — 25/25 pass
  • cargo fmt --check + cargo clippy -D warnings — clean

CI on the new commit is currently in action_required state — could you kick off the workflow run when you get a chance?

On the rename suggestion ("Predicate" → "ExistsInSubquery") — happy to do that, agreed a followup PR makes sense to keep this one focused.

@neilconway
Copy link
Copy Markdown
Contributor

@crm26 Thanks for iterating on this!

The comment I was suggesting you add "Optimization:" to was actually a different one :) My suggestion was attached to the "// Skip if no predicate subqueries in any projection expression" check at the top of the "Projection" match arm.

I noticed you didn't address my suggestion to simplify the cloning / early-return code path; let me know if what I'm suggesting isn't clear.

I don't have permissions to run CI myself, but perhaps @alamb or someone else can help.

BTW I personally find it helpful to "Resolve comment" on review comments that you believe have been addressed, so we can more easily see what remains open.

Previous push addressed the per-expression bail-out correctly but got
two items wrong:
- Added "Optimization:" prefix to the alias preservation comment instead
  of the skip check at the top of the Projection arm (the one neilconway
  actually pointed at).
- Did not apply the ScalarSubqueryToJoin-style cloning simplification.

This commit fixes both:
- Moved the "Optimization:" prefix to the correct line and removed the
  stray prefix on the alias comment.
- Replaced Arc::clone + Arc::unwrap_or_clone with
  projection.input.as_ref().clone(), mirroring ScalarSubqueryToJoin.
- Dropped the original_input variable and the
  Projection::try_new_with_schema reconstruction on bail-out; the
  bail-out now returns Ok(Transformed::no(LogicalPlan::Projection(
  projection))) directly since projection is still fully owned.
- Removed the now-unused Projection import.

The loop clones projection.expr upfront so it can consume owned
expressions without holding a borrow on projection, which lets the
bail-out move projection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@crm26
Copy link
Copy Markdown
Author

crm26 commented Apr 8, 2026

@neilconway You were right — I got two of the three wrong in the last push. Apologies for the noise.

Pushed fixes in 93341e033:

  1. "Optimization:" prefix — I put it on the alias preservation comment last time instead of the skip check you actually pointed at. Moved it to the correct line (// Optimization: skip if no predicate subqueries in any projection expression at the top of the Projection match arm) and cleaned up the stray prefix on the alias comment.

  2. Cloning / early-return simplification — I skipped this in the last push entirely. Now mirrors ScalarSubqueryToJoin:

    • let mut cur_input = projection.input.as_ref().clone();
    • Dropped original_input and Projection::try_new_with_schema
    • Bail-out just returns Ok(Transformed::no(LogicalPlan::Projection(projection))) directly
    • Removed the now-unused Projection import
    • The one structural difference from ScalarSubqueryToJoin: I clone projection.expr upfront (let projection_exprs = projection.expr.clone();) so the loop can consume owned expressions without borrowing projection, which lets the bail-out move it. A comment at the clone site explains why. Happy to restructure if you'd prefer a different approach.
  3. Per-expr bail-out and mixed decorrelatable+non-decorrelatable test from the previous push are unchanged.

Validation I ran locally this time (full suite, not just the targeted file):

  • cargo fmt --check + cargo clippy -p datafusion-optimizer -- -D warnings — clean
  • cargo test -p datafusion-optimizer — 665 + 26 + 5 pass (59 decorrelate tests)
  • cargo test --test sqllogictests — all 422 files pass
  • cargo test --profile ci-optimized --test sqllogictests --features backtrace -- --include-sqlite — all 1017 files pass
  • cargo test -p datafusion-cli --features datafusion/extended_tests — 61 + 7 + 25 pass

Will resolve the review threads in the UI once CI confirms. The "Predicate" → "ExistsInSubquery" rename is noted for a followup PR as you suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants