SNOW-3485482: Eliminate unnecessary SELECT * from joins by sfc-gh-joshi · Pull Request #4248 · snowflakedb/snowpark-python

sfc-gh-joshi · 2026-06-02T23:01:06Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-3485482
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
- If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
Please describe how your code solves the related issue.

Removes unnecessary SELECT * layers from repeated join operations. Truncated example SQL from Python code df1.join(df2.join(df3, on="key", how="inner"), on="key", how="inner"), edited for readability:

Current main:

SELECT * FROM (
  SELECT * FROM (
    (
      SELECT * FROM (
        (
          SELECT "KEY", "VAL_BASE"
          FROM (
            SELECT $1 AS "KEY", $2 AS "VAL_BASE" FROM table_name
        ) AS SNOWPARK_LEFT
        INNER JOIN (
          SELECT "KEY", "VAL_1" FROM (
            SELECT $1 AS "KEY", $2 AS "VAL_1" FROM table_name
        ) AS SNOWPARK_RIGHT
          USING (key)
      )
    ) AS SNOWPARK_LEFT
    INNER JOIN (
      SELECT "KEY", "VAL_2" FROM (
        SELECT $1 AS "KEY", $2 AS "VAL_2" FROM table_name
    ) AS SNOWPARK_RIGHT
      USING (key)
  )
)
WHERE "VAL_BASE" > 100
ORDER BY "VAL_BASE" DESC NULLS LAST
LIMIT 5

New:

SELECT * FROM (
  SELECT * FROM (
    (
      SELECT "KEY", "VAL_BASE" FROM table_name
    ) AS SNOWPARK_LEFT
    INNER JOIN (
      SELECT "KEY", "VAL_1" FROM table_name
    ) AS SNOWPARK_RIGHT
      USING (key)
    INNER JOIN (
      SELECT "KEY", "VAL_2"FROM table_name
    ) AS SNOWPARK_TEMP_TABLE_Y7YO05WRW5
      USING (key)
  )
)
WHERE "VAL_BASE" > 100
ORDER BY "VAL_BASE" DESC NULLS LAST
LIMIT 5

The outer-most layers of SELECT * cannot be removed, but this PR prevents the addition of a new layer of SELECT * wrapping on every additional chained join operatio. For a sequence of 10 joins, this PR reduces the query text from 3752 characters to 3528 (-6%), but does not meaningfully affect SQL compilation time.

codecov-commenter · 2026-06-02T23:41:45Z

Codecov Report

❌ Patch coverage is 92.59259% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.49%. Comparing base (8a140ef) to head (0105d8d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...lake/snowpark/_internal/analyzer/analyzer_utils.py	88.88%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4248      +/-   ##
==========================================
- Coverage   95.49%   95.49%   -0.01%     
==========================================
  Files         171      171              
  Lines       44184    44219      +35     
  Branches     7534     7540       +6     
==========================================
+ Hits        42195    42228      +33     
  Misses       1226     1226              
- Partials      763      765       +2

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sfc-gh-aling

The outer-most layers of SELECT * cannot be removed
why we cannot -- where does the nested select come from?
I think the optimal is we just have one select * for this case

SELECT * FROM (
  (
    SELECT "KEY", "VAL_BASE" FROM table_name
  ) AS SNOWPARK_LEFT
  INNER JOIN (
    SELECT "KEY", "VAL_1" FROM table_name
  ) AS SNOWPARK_RIGHT
    USING ("KEY")
  INNER JOIN (
    SELECT "KEY", "VAL_2" FROM table_name
  ) AS SNOWPARK_TEMP_TABLE_Y7YO05WRW5
    USING ("KEY")
)
WHERE "VAL_BASE" > 100
ORDER BY "VAL_BASE" DESC NULLS LAST
LIMIT 5;

sfc-gh-aling · 2026-06-04T20:50:48Z

+    # Though it is technically less efficient than constructing the join sub-queries
+    # without the SELECT in the first place, the structure of our SQL processing code
+    # top-level projections to be wrapped by a select.
+    unwrapped_left = _unwrap_select_star_from(left)


dumb question, why we do not also unwrap right node

A sequence of chained operations (df1.join(df2).join(df3)...) would only nest operations on the left side, while nesting on the right side would occur if you did df1.join(df2.join(df3)...). Flattening only one side of the join is significantly simpler; here's Cursor's analysis:

The 2nd concern I raised was: "Merging two join expressions into one FROM clause is significantly more complex — you'd need to handle alias collisions from both sides and potentially break column resolution semantics."

Here's what that means concretely. Consider:

left_result = a.join(b, "key") # produces: A JOIN B right_result = c.join(d, "key") # produces: C JOIN D final = left_result.join(right_result, "key")

With left-only unwrapping, we produce:

SELECT * FROM ( (A) AS L1 INNER JOIN (B) AS R1 USING (key) INNER JOIN (SELECT * FROM ((C) AS L2 INNER JOIN (D) AS R2 USING (key))) AS R3 USING (key) )

The right operand keeps its SELECT * FROM (...) shell, acting as a self-contained derived table with its own alias namespace.

If we also unwrapped the right side, we'd need to merge both join expressions into a single flat FROM clause:

SELECT * FROM ( (A) AS L1 INNER JOIN (B) AS R1 USING (key) INNER JOIN (C) AS L2 INNER JOIN (D) AS R2 USING (key) -- ← problematic USING (key) -- which USING clause binds to which JOIN? )

This creates two problems:

Join precedence ambiguity — SQL evaluates joins left-to-right. Splicing in C JOIN D without a subquery boundary changes how the USING/ON clauses bind. The condition between left_result and right_result needs to apply to the combined result of C JOIN D, but without the subquery wrapper, it would syntactically bind only to D. You'd need explicit parenthesization of the right-side join group, which Snowflake doesn't support as a FROM item (the [TOK_JOIN] error we hit earlier).

Alias collisions — Both sides independently generate aliases (SNOWPARK_LEFT, SNOWPARK_RIGHT, or random names). If both join expressions are spliced into the same FROM clause, their aliases coexist in the same namespace. We'd need to rewrite aliases within the right-side join expression to avoid conflicts — which means parsing and modifying already-generated SQL strings, a fragile and error-prone operation.

By keeping the SELECT * FROM (...) wrapper on the right operand, it acts as an opaque derived table: its internal aliases are scoped, its join conditions bind correctly, and it integrates cleanly as a single table reference in the outer FROM clause.

sfc-gh-joshi · 2026-06-04T22:42:15Z

The outer-most layers of SELECT * cannot be removed
why we cannot -- where does the nested select come from?
I think the optimal is we just have one select * for this case

Here's the exact code I used for the example query:

base_data = [(1, 100), (2, 200), (3, 300), (4, 400), (5, 500), (6, 600), (7, 700), (8, 800)]

base = session.create_dataframe(base_data, schema=["key", "val_base"])

dims = []
for i in range(1, 10):
    dim = session.create_dataframe(base_data, schema=["key", f"val_{i}"])
    dims.append(dim)

df = base
for dim in dims:
    df = df.join(dim, on="key", how="inner")

df = df.filter(col("val_base") > 100)
df = df.order_by(col("val_base").desc())
df = df.limit(5)

One layer of SELECT * is from the final join, and one layer is added by the filter/order by/limit operations, independent on the depth of the join. I think it would be more complicated to get rid of all of these at once, and this PR focuses on ensuring that repeated joins don't result in repeated SELECT *s.

sfc-gh-yuwang · 2026-06-09T21:26:35Z

+        _SELECT_STAR_FROM_SUFFIX
+    ):
+        inner = sql[len(_SELECT_STAR_FROM_PREFIX) : -len(_SELECT_STAR_FROM_SUFFIX)]
+        if inner.startswith(LEFT_PARENTHESIS):


cursor give me this:
Medium (behavioral/perf): flattening may stop in trace-SQL mode.
In analyzer_utils._unwrap_select_star_from, unwrapping requires the inner text to start with "(". But when UUID trace comments are injected, inner SQL may start with a comment instead, so chained joins might stop flattening after one level. This is likely only when SQL trace-to-dataframe context is enabled.
this sounds like a valid assumption, could this happen in real user workload?

Good catch. I've adjusted the code to strip comments before checking for parentheses.

sfc-gh-aling

LGTM! please ensure the snowaprk connect test pass before merging

sfc-gh-joshi force-pushed the joshi-SNOW-3485482-unnest-join branch from a30ead9 to 12a15d5 Compare June 3, 2026 23:32

sfc-gh-joshi marked this pull request as ready for review June 3, 2026 23:45

sfc-gh-joshi requested review from a team as code owners June 3, 2026 23:45

sfc-gh-joshi requested review from sfc-gh-aling, sfc-gh-jrose and sfc-gh-yuwang June 3, 2026 23:45

sfc-gh-aling reviewed Jun 4, 2026

View reviewed changes

sfc-gh-joshi added 5 commits June 8, 2026 14:42

test for ci

2456ede

cleanup

5ffc818

changelog

3f2c561

propagate through plan

2738e25

fix simplifier test

80cc298

sfc-gh-joshi force-pushed the joshi-SNOW-3485482-unnest-join branch from c9cccc0 to 80cc298 Compare June 8, 2026 21:52

sfc-gh-joshi added 3 commits June 8, 2026 17:13

fix mock

36187df

fix type checking error

e347e78

propagate join flag in copy

fd1e60d

sfc-gh-yuwang reviewed Jun 9, 2026

View reviewed changes

catch UUID comments

0105d8d

sfc-gh-aling approved these changes Jun 10, 2026

View reviewed changes

sfc-gh-yuwang approved these changes Jun 10, 2026

View reviewed changes

sfc-gh-joshi merged commit e22bc94 into main Jun 10, 2026
32 checks passed

sfc-gh-joshi deleted the joshi-SNOW-3485482-unnest-join branch June 10, 2026 23:15

github-actions Bot locked and limited conversation to collaborators Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-3485482: Eliminate unnecessary SELECT * from joins#4248

SNOW-3485482: Eliminate unnecessary SELECT * from joins#4248
sfc-gh-joshi merged 9 commits into
mainfrom
joshi-SNOW-3485482-unnest-join

sfc-gh-joshi commented Jun 2, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 2, 2026 •

edited

Loading

Uh oh!

sfc-gh-aling left a comment

Uh oh!

Uh oh!

Uh oh!

sfc-gh-aling Jun 4, 2026

Uh oh!

sfc-gh-joshi Jun 4, 2026

Uh oh!

sfc-gh-joshi commented Jun 4, 2026

Uh oh!

sfc-gh-yuwang Jun 9, 2026

Uh oh!

sfc-gh-joshi Jun 9, 2026

Uh oh!

sfc-gh-aling left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sfc-gh-joshi commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sfc-gh-aling left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sfc-gh-aling Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi commented Jun 4, 2026

Uh oh!

sfc-gh-yuwang Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-joshi Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-aling left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sfc-gh-joshi commented Jun 2, 2026 •

edited

Loading

codecov-commenter commented Jun 2, 2026 •

edited

Loading