Skip to content

SNOW-3485471: cte dedup for self-join case#4245

Merged
sfc-gh-aling merged 10 commits into
mainfrom
aling-poc-cte-optimization
Jun 3, 2026
Merged

SNOW-3485471: cte dedup for self-join case#4245
sfc-gh-aling merged 10 commits into
mainfrom
aling-poc-cte-optimization

Conversation

@sfc-gh-aling

@sfc-gh-aling sfc-gh-aling commented May 29, 2026

Copy link
Copy Markdown
Collaborator
  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

In SCOS, spark.table() translates into SnowflakePlan nodes with non-empty expr_to_alias. The TPCDS_Q39 query aliases the same inventory DataFrame as inv1 and inv2 and self-joins them. The join path
deep-copies the plan for each branch and calls add_aliases separately, resulting in both branches carrying the same alias values under different UUID keys. The old key-sort hashing produced
different hashes for the two branches, so the CTE optimizer did not detect them as duplicates and inlined the full aggregation pipeline twice. The value-sort fix correctly identifies them as
the same computation and emits a single CTE.

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  2. Please describe how your code solves the related issue.

  • hash node by expr_to_alias value instead expr_to_alias items
  • conflicting alias detection to avoid over optimization

@codecov-commenter

codecov-commenter commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.39%. Comparing base (495acce) to head (c29f910).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4245   +/-   ##
=======================================
  Coverage   95.38%   95.39%           
=======================================
  Files         171      171           
  Lines       44243    44252    +9     
  Branches     7557     7559    +2     
=======================================
+ Hits        42203    42215   +12     
+ Misses       1252     1250    -2     
+ Partials      788      787    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sfc-gh-aling sfc-gh-aling changed the title poc fix SNOW-3485471: self-join not pushed into cte May 29, 2026
@sfc-gh-aling sfc-gh-aling changed the title SNOW-3485471: self-join not pushed into cte SNOW-3485471: optimize cte for self-join case May 29, 2026
@sfc-gh-aling sfc-gh-aling requested a review from sfc-gh-aalam May 29, 2026 22:06
@sfc-gh-aling sfc-gh-aling changed the title SNOW-3485471: optimize cte for self-join case SNOW-3485471: cte dedup for self-join case May 29, 2026
@sfc-gh-aling sfc-gh-aling marked this pull request as ready for review May 29, 2026 22:51
@sfc-gh-aling sfc-gh-aling requested review from a team as code owners May 29, 2026 22:51
Comment on lines +310 to +315
# Hash by alias values only, not the UUID keys, since UUID keys are regenerated on every deep-copy/re-resolve (e.g. the two
# branches of a self-join). This lets nodes representing the same computation hash identically, enabling CTE dedup for self-joins.
# NOTE: since nodes with different UUID keys can now share a CTE, _replace_duplicate_node_with_cte must merge each duplicate's
# UUID→alias entries into the shared CTE so parent re-resolution can resolve any UUID variant (see companion comment there).
# Different alias values (e.g. a "_WITH_AD_GROUP" join suffix from _disambiguate) still hash differently, preserving SNOW-2261400.
string = f"{string}#{sorted(set(node.expr_to_alias.values()))}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of cases will this produce same hash-keys. When it is wrong, what kind of risk are we dealing with

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I assume you are asking the hash-keys for a node -- same hash-keys are produced by nodes with same sql, sql params, df aliases, and expr_to_alias values (before the change it's by expr_to_alias.items().

(hash-keys for expr_to_alias are generated from uuid, it can rarely happen that two uuid collide)

  1. the only risk is parent column resolution as we relax the standard of repeated node identification. I introduced extra logic to detect conflicting expr_to_alias (same expr maps to two different alias) in nodes to prevent CTE in this case. thank you for bringing it up

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance of duplicate value entries here that would silently get swallowed by the sorted(set(...)) call?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, duplicate alias can happen but consolidating them is harmless in our case, it's not reflected in the newly-generated CTE optimized node, and the generated sql in the hash is enough for dup node detection in the self-join case.

the expr_to_alias is mostly for internal column name resolution in the downstream compilation stage.

theoretically I think this info shall be excluded from the hash computation of a node, but right now I keep it as a defensive layer to distinguish nodes.

Comment thread tests/integ/test_cte.py
Comment on lines -877 to +889
# When adding a lsuffix, expr alias map will be updated, so df2 and df3 are considered
# different and have different ids. So only df1 and df will be converted to a CTE
# With value-sort hashing, df1/df2/df3 now hash identically (same SQL +
# same alias values, different UUID keys). df2 and df3 are replaced with a
# shared CTE, but df1's left-join position remains inline. That gives 2
# CTEs (base VALUES + filtered df1) and the filter appears twice (once in
# the CTE body, once inline for the left-join position).
assert (
count_number_of_ctes(Utils.normalize_sql(df6.queries["queries"][-1])) == 1
count_number_of_ctes(Utils.normalize_sql(df6.queries["queries"][-1])) == 2
)
assert Utils.normalize_sql(df6.queries["queries"][-1]).count(filter_clause) == 3
assert Utils.normalize_sql(df6.queries["queries"][-1]).count(filter_clause) == 2

@sfc-gh-aling sfc-gh-aling May 29, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, more CTE is better for this case, see the below comparison between the generated sql:

Before (1 CTE, filter appears 3 times):

  WITH CTE_values AS (
    SELECT $1 AS "A", $2 AS "B" FROM VALUES (1,2),(3,4)                                                                                                                             
  )                                                                                                                                                                                 
  SELECT * FROM (                                                                                                                                                                   
    (                                                                                                                                                                               
      SELECT "A_XXX", "B_XXX", "A" AS "A_YYY", "B" AS "B_YYY"                                                                                                                       
      FROM (                                                                                                                                                                        
        (                                                                                                                                                                           
          SELECT "A" AS "A_XXX", "B" AS "B_XXX"                                                                                                                                     
          FROM CTE_values WHERE ("A" = 1)       -- filter #1 (df1, inline)                                                                                                          
        ) AS SNOWPARK_LEFT                                                                                                                                                          
        INNER JOIN (                                                                                                                                                                
          SELECT "A", "B"                                                                                                                                                           
          FROM CTE_values WHERE ("A" = 1)       -- filter #2 (df2, inline)                                                                                                          
        ) AS SNOWPARK_RIGHT                                                                                                                                                         
      )                                                           
    ) AS SNOWPARK_LEFT
    INNER JOIN (      
      SELECT "A", "B"
      FROM CTE_values WHERE ("A" = 1)           -- filter #3 (df3, inline)
    ) AS SNOWPARK_RIGHT                                                   
  )                                                                                                                                                                                 

After (2 CTEs, filter appears 2 times):

  WITH CTE_values AS (                                                                                                                                                              
    SELECT $1 AS "A", $2 AS "B" FROM VALUES (1,2),(3,4)                                                                                                                             
  ),                                                                                                                                                                                
  CTE_filtered AS (                              -- df2 and df3 deduplicated here                                                                                                   
    SELECT "A", "B"                                                                                                                                                                 
    FROM CTE_values WHERE ("A" = 1)             -- filter #1 (in CTE body)                                                                                                          
  )                                                                                                                                                                                 
  SELECT * FROM (                                                                                                                                                                   
    (                                                             
      SELECT "A_XXX", "B_XXX", "A" AS "A_YYY", "B" AS "B_YYY"
      FROM (                                                 
        (   
          SELECT "A" AS "A_XXX", "B" AS "B_XXX"
          FROM CTE_values WHERE ("A" = 1)       -- filter #2 (df1, still inline)
        ) AS SNOWPARK_LEFT                                                      
        INNER JOIN (SELECT * FROM CTE_filtered) AS SNOWPARK_RIGHT                                                                                                                   
      )                                                                                                                                                                             
    ) AS SNOWPARK_LEFT                                                                                                                                                              
    INNER JOIN (SELECT * FROM CTE_filtered) AS SNOWPARK_RIGHT                                                                                                                       
  )   

@sfc-gh-aling sfc-gh-aling requested a review from sfc-gh-aalam June 2, 2026 17:56
Comment on lines +310 to +315
# Hash by alias values only, not the UUID keys, since UUID keys are regenerated on every deep-copy/re-resolve (e.g. the two
# branches of a self-join). This lets nodes representing the same computation hash identically, enabling CTE dedup for self-joins.
# NOTE: since nodes with different UUID keys can now share a CTE, _replace_duplicate_node_with_cte must merge each duplicate's
# UUID→alias entries into the shared CTE so parent re-resolution can resolve any UUID variant (see companion comment there).
# Different alias values (e.g. a "_WITH_AD_GROUP" join suffix from _disambiguate) still hash differently, preserving SNOW-2261400.
string = f"{string}#{sorted(set(node.expr_to_alias.values()))}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance of duplicate value entries here that would silently get swallowed by the sorted(set(...)) call?

@sfc-gh-aling sfc-gh-aling requested a review from sfc-gh-joshi June 3, 2026 18:17
@sfc-gh-aling sfc-gh-aling force-pushed the aling-poc-cte-optimization branch from d854f34 to 5172f7f Compare June 3, 2026 19:11
@sfc-gh-aling sfc-gh-aling merged commit dc4deee into main Jun 3, 2026
28 of 29 checks passed
@sfc-gh-aling sfc-gh-aling deleted the aling-poc-cte-optimization branch June 3, 2026 20:56
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 3, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants