SNOW-3485482: Eliminate unnecessary SELECT * from joins#4248
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4248 +/- ##
==========================================
- Coverage 95.49% 95.49% -0.01%
==========================================
Files 171 171
Lines 44184 44219 +35
Branches 7534 7540 +6
==========================================
+ Hits 42195 42228 +33
Misses 1226 1226
- Partials 763 765 +2 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
a30ead9 to
12a15d5
Compare
sfc-gh-aling
left a comment
There was a problem hiding this comment.
The outer-most layers of SELECT * cannot be removed
why we cannot -- where does the nested select come from?
I think the optimal is we just have oneselect *for this case
SELECT * FROM (
(
SELECT "KEY", "VAL_BASE" FROM table_name
) AS SNOWPARK_LEFT
INNER JOIN (
SELECT "KEY", "VAL_1" FROM table_name
) AS SNOWPARK_RIGHT
USING ("KEY")
INNER JOIN (
SELECT "KEY", "VAL_2" FROM table_name
) AS SNOWPARK_TEMP_TABLE_Y7YO05WRW5
USING ("KEY")
)
WHERE "VAL_BASE" > 100
ORDER BY "VAL_BASE" DESC NULLS LAST
LIMIT 5;| # Though it is technically less efficient than constructing the join sub-queries | ||
| # without the SELECT in the first place, the structure of our SQL processing code | ||
| # top-level projections to be wrapped by a select. | ||
| unwrapped_left = _unwrap_select_star_from(left) |
There was a problem hiding this comment.
dumb question, why we do not also unwrap right node
There was a problem hiding this comment.
A sequence of chained operations (df1.join(df2).join(df3)...) would only nest operations on the left side, while nesting on the right side would occur if you did df1.join(df2.join(df3)...). Flattening only one side of the join is significantly simpler; here's Cursor's analysis:
The 2nd concern I raised was: "Merging two join expressions into one FROM clause is significantly more complex — you'd need to handle alias collisions from both sides and potentially break column resolution semantics."
Here's what that means concretely. Consider:
left_result = a.join(b, "key") # produces: A JOIN B
right_result = c.join(d, "key") # produces: C JOIN D
final = left_result.join(right_result, "key")
With left-only unwrapping, we produce:
SELECT * FROM (
(A) AS L1 INNER JOIN (B) AS R1 USING (key)
INNER JOIN (SELECT * FROM ((C) AS L2 INNER JOIN (D) AS R2 USING (key))) AS R3 USING (key)
)
The right operand keeps its SELECT * FROM (...) shell, acting as a self-contained derived table with its own alias namespace.
If we also unwrapped the right side, we'd need to merge both join expressions into a single flat FROM clause:
SELECT * FROM (
(A) AS L1 INNER JOIN (B) AS R1 USING (key)
INNER JOIN (C) AS L2 INNER JOIN (D) AS R2 USING (key) -- ← problematic
USING (key) -- which USING clause binds to which JOIN?
)
This creates two problems:
-
Join precedence ambiguity — SQL evaluates joins left-to-right. Splicing in C JOIN D without a subquery boundary changes how the USING/ON clauses bind. The condition between left_result and right_result needs to apply to the combined result of C JOIN D, but without the subquery wrapper, it would syntactically bind only to D. You'd need explicit parenthesization of the right-side join group, which Snowflake doesn't support as a FROM item (the [TOK_JOIN] error we hit earlier).
-
Alias collisions — Both sides independently generate aliases (SNOWPARK_LEFT, SNOWPARK_RIGHT, or random names). If both join expressions are spliced into the same FROM clause, their aliases coexist in the same namespace. We'd need to rewrite aliases within the right-side join expression to avoid conflicts — which means parsing and modifying already-generated SQL strings, a fragile and error-prone operation.
By keeping the SELECT * FROM (...) wrapper on the right operand, it acts as an opaque derived table: its internal aliases are scoped, its join conditions bind correctly, and it integrates cleanly as a single table reference in the outer FROM clause.
Here's the exact code I used for the example query: One layer of SELECT * is from the final |
c9cccc0 to
80cc298
Compare
| _SELECT_STAR_FROM_SUFFIX | ||
| ): | ||
| inner = sql[len(_SELECT_STAR_FROM_PREFIX) : -len(_SELECT_STAR_FROM_SUFFIX)] | ||
| if inner.startswith(LEFT_PARENTHESIS): |
There was a problem hiding this comment.
cursor give me this:
Medium (behavioral/perf): flattening may stop in trace-SQL mode.
In analyzer_utils._unwrap_select_star_from, unwrapping requires the inner text to start with "(". But when UUID trace comments are injected, inner SQL may start with a comment instead, so chained joins might stop flattening after one level. This is likely only when SQL trace-to-dataframe context is enabled.
this sounds like a valid assumption, could this happen in real user workload?
There was a problem hiding this comment.
Good catch. I've adjusted the code to strip comments before checking for parentheses.
sfc-gh-aling
left a comment
There was a problem hiding this comment.
LGTM! please ensure the snowaprk connect test pass before merging
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-3485482
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Removes unnecessary SELECT * layers from repeated join operations. Truncated example SQL from Python code
df1.join(df2.join(df3, on="key", how="inner"), on="key", how="inner"), edited for readability:Current main:
New:
The outer-most layers of SELECT * cannot be removed, but this PR prevents the addition of a new layer of SELECT * wrapping on every additional chained join operatio. For a sequence of 10 joins, this PR reduces the query text from 3752 characters to 3528 (-6%), but does not meaningfully affect SQL compilation time.