Skip to content

SNOW-3485482: Eliminate unnecessary SELECT * from joins#4248

Merged
sfc-gh-joshi merged 9 commits into
mainfrom
joshi-SNOW-3485482-unnest-join
Jun 10, 2026
Merged

SNOW-3485482: Eliminate unnecessary SELECT * from joins#4248
sfc-gh-joshi merged 9 commits into
mainfrom
joshi-SNOW-3485482-unnest-join

Conversation

@sfc-gh-joshi

@sfc-gh-joshi sfc-gh-joshi commented Jun 2, 2026

Copy link
Copy Markdown
Contributor
  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-3485482

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

Removes unnecessary SELECT * layers from repeated join operations. Truncated example SQL from Python code df1.join(df2.join(df3, on="key", how="inner"), on="key", how="inner"), edited for readability:

Current main:

SELECT * FROM (
  SELECT * FROM (
    (
      SELECT * FROM (
        (
          SELECT "KEY", "VAL_BASE"
          FROM (
            SELECT $1 AS "KEY", $2 AS "VAL_BASE" FROM table_name
        ) AS SNOWPARK_LEFT
        INNER JOIN (
          SELECT "KEY", "VAL_1" FROM (
            SELECT $1 AS "KEY", $2 AS "VAL_1" FROM table_name
        ) AS SNOWPARK_RIGHT
          USING (key)
      )
    ) AS SNOWPARK_LEFT
    INNER JOIN (
      SELECT "KEY", "VAL_2" FROM (
        SELECT $1 AS "KEY", $2 AS "VAL_2" FROM table_name
    ) AS SNOWPARK_RIGHT
      USING (key)
  )
)
WHERE "VAL_BASE" > 100
ORDER BY "VAL_BASE" DESC NULLS LAST
LIMIT 5

New:

SELECT * FROM (
  SELECT * FROM (
    (
      SELECT "KEY", "VAL_BASE" FROM table_name
    ) AS SNOWPARK_LEFT
    INNER JOIN (
      SELECT "KEY", "VAL_1" FROM table_name
    ) AS SNOWPARK_RIGHT
      USING (key)
    INNER JOIN (
      SELECT "KEY", "VAL_2"FROM table_name
    ) AS SNOWPARK_TEMP_TABLE_Y7YO05WRW5
      USING (key)
  )
)
WHERE "VAL_BASE" > 100
ORDER BY "VAL_BASE" DESC NULLS LAST
LIMIT 5

The outer-most layers of SELECT * cannot be removed, but this PR prevents the addition of a new layer of SELECT * wrapping on every additional chained join operatio. For a sequence of 10 joins, this PR reduces the query text from 3752 characters to 3528 (-6%), but does not meaningfully affect SQL compilation time.

@codecov-commenter

codecov-commenter commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.59259% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.49%. Comparing base (8a140ef) to head (0105d8d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...lake/snowpark/_internal/analyzer/analyzer_utils.py 88.88% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4248      +/-   ##
==========================================
- Coverage   95.49%   95.49%   -0.01%     
==========================================
  Files         171      171              
  Lines       44184    44219      +35     
  Branches     7534     7540       +6     
==========================================
+ Hits        42195    42228      +33     
  Misses       1226     1226              
- Partials      763      765       +2     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-3485482-unnest-join branch from a30ead9 to 12a15d5 Compare June 3, 2026 23:32
@sfc-gh-joshi sfc-gh-joshi marked this pull request as ready for review June 3, 2026 23:45
@sfc-gh-joshi sfc-gh-joshi requested review from a team as code owners June 3, 2026 23:45

@sfc-gh-aling sfc-gh-aling left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The outer-most layers of SELECT * cannot be removed
why we cannot -- where does the nested select come from?
I think the optimal is we just have one select * for this case

SELECT * FROM (
  (
    SELECT "KEY", "VAL_BASE" FROM table_name
  ) AS SNOWPARK_LEFT
  INNER JOIN (
    SELECT "KEY", "VAL_1" FROM table_name
  ) AS SNOWPARK_RIGHT
    USING ("KEY")
  INNER JOIN (
    SELECT "KEY", "VAL_2" FROM table_name
  ) AS SNOWPARK_TEMP_TABLE_Y7YO05WRW5
    USING ("KEY")
)
WHERE "VAL_BASE" > 100
ORDER BY "VAL_BASE" DESC NULLS LAST
LIMIT 5;

Comment thread src/snowflake/snowpark/_internal/analyzer/analyzer_utils.py
Comment thread src/snowflake/snowpark/_internal/analyzer/analyzer_utils.py
# Though it is technically less efficient than constructing the join sub-queries
# without the SELECT in the first place, the structure of our SQL processing code
# top-level projections to be wrapped by a select.
unwrapped_left = _unwrap_select_star_from(left)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dumb question, why we do not also unwrap right node

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sequence of chained operations (df1.join(df2).join(df3)...) would only nest operations on the left side, while nesting on the right side would occur if you did df1.join(df2.join(df3)...). Flattening only one side of the join is significantly simpler; here's Cursor's analysis:

The 2nd concern I raised was: "Merging two join expressions into one FROM clause is significantly more complex — you'd need to handle alias collisions from both sides and potentially break column resolution semantics."

Here's what that means concretely. Consider:

left_result = a.join(b, "key")     # produces: A JOIN B
right_result = c.join(d, "key")    # produces: C JOIN D
final = left_result.join(right_result, "key")

With left-only unwrapping, we produce:

SELECT * FROM (
  (A) AS L1 INNER JOIN (B) AS R1 USING (key)
  INNER JOIN (SELECT * FROM ((C) AS L2 INNER JOIN (D) AS R2 USING (key))) AS R3 USING (key)
)

The right operand keeps its SELECT * FROM (...) shell, acting as a self-contained derived table with its own alias namespace.

If we also unwrapped the right side, we'd need to merge both join expressions into a single flat FROM clause:

SELECT * FROM (
  (A) AS L1 INNER JOIN (B) AS R1 USING (key)
  INNER JOIN (C) AS L2 INNER JOIN (D) AS R2 USING (key)   -- ← problematic
  USING (key)  -- which USING clause binds to which JOIN?
)

This creates two problems:

  1. Join precedence ambiguity — SQL evaluates joins left-to-right. Splicing in C JOIN D without a subquery boundary changes how the USING/ON clauses bind. The condition between left_result and right_result needs to apply to the combined result of C JOIN D, but without the subquery wrapper, it would syntactically bind only to D. You'd need explicit parenthesization of the right-side join group, which Snowflake doesn't support as a FROM item (the [TOK_JOIN] error we hit earlier).

  2. Alias collisions — Both sides independently generate aliases (SNOWPARK_LEFT, SNOWPARK_RIGHT, or random names). If both join expressions are spliced into the same FROM clause, their aliases coexist in the same namespace. We'd need to rewrite aliases within the right-side join expression to avoid conflicts — which means parsing and modifying already-generated SQL strings, a fragile and error-prone operation.

By keeping the SELECT * FROM (...) wrapper on the right operand, it acts as an opaque derived table: its internal aliases are scoped, its join conditions bind correctly, and it integrates cleanly as a single table reference in the outer FROM clause.

@sfc-gh-joshi

Copy link
Copy Markdown
Contributor Author

The outer-most layers of SELECT * cannot be removed
why we cannot -- where does the nested select come from?
I think the optimal is we just have one select * for this case

Here's the exact code I used for the example query:

base_data = [(1, 100), (2, 200), (3, 300), (4, 400), (5, 500), (6, 600), (7, 700), (8, 800)]

base = session.create_dataframe(base_data, schema=["key", "val_base"])

dims = []
for i in range(1, 10):
    dim = session.create_dataframe(base_data, schema=["key", f"val_{i}"])
    dims.append(dim)

df = base
for dim in dims:
    df = df.join(dim, on="key", how="inner")

df = df.filter(col("val_base") > 100)
df = df.order_by(col("val_base").desc())
df = df.limit(5)

One layer of SELECT * is from the final join, and one layer is added by the filter/order by/limit operations, independent on the depth of the join. I think it would be more complicated to get rid of all of these at once, and this PR focuses on ensuring that repeated joins don't result in repeated SELECT *s.

@sfc-gh-joshi sfc-gh-joshi force-pushed the joshi-SNOW-3485482-unnest-join branch from c9cccc0 to 80cc298 Compare June 8, 2026 21:52
_SELECT_STAR_FROM_SUFFIX
):
inner = sql[len(_SELECT_STAR_FROM_PREFIX) : -len(_SELECT_STAR_FROM_SUFFIX)]
if inner.startswith(LEFT_PARENTHESIS):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cursor give me this:
Medium (behavioral/perf): flattening may stop in trace-SQL mode.
In analyzer_utils._unwrap_select_star_from, unwrapping requires the inner text to start with "(". But when UUID trace comments are injected, inner SQL may start with a comment instead, so chained joins might stop flattening after one level. This is likely only when SQL trace-to-dataframe context is enabled.
this sounds like a valid assumption, could this happen in real user workload?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I've adjusted the code to strip comments before checking for parentheses.

@sfc-gh-aling sfc-gh-aling left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! please ensure the snowaprk connect test pass before merging

@sfc-gh-joshi sfc-gh-joshi merged commit e22bc94 into main Jun 10, 2026
32 checks passed
@sfc-gh-joshi sfc-gh-joshi deleted the joshi-SNOW-3485482-unnest-join branch June 10, 2026 23:15
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 10, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants