Skip to content

SNOW-3384967: reduce describe query generated by alias#4183

Merged
sfc-gh-yuwang merged 10 commits into
mainfrom
SNOW-3384967
Apr 29, 2026
Merged

SNOW-3384967: reduce describe query generated by alias#4183
sfc-gh-yuwang merged 10 commits into
mainfrom
SNOW-3384967

Conversation

@sfc-gh-yuwang

@sfc-gh-yuwang sfc-gh-yuwang commented Apr 16, 2026

Copy link
Copy Markdown
Collaborator
  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-NNNNNNN

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  3. Please describe how your code solves the related issue.

When describe reduction is on and the inner select already has resolved
attributes, infer new.attributes for this outer select by reusing datatype and
nullable from the subquery: (0) skip if parent column names collide, (1) index
attributes by normalized name, (2) walk new.projection, (3) only handle plain
columns or Alias(column), (4) resolve source via quoted-identifier-aware lookup,
(5) assign only if every output column was inferred (length matches projection).

https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/view/SnowparkPython/job/PythonStoredProcBuildSnowfortTest/860/
https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SnowparkConnectRegressionRunner/486/
snowpark and scos regression test to verify the change

@github-actions

github-actions Bot commented Apr 16, 2026

Copy link
Copy Markdown

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@codecov-commenter

codecov-commenter commented Apr 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.43590% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.41%. Comparing base (160c418) to head (f198187).

Files with missing lines Patch % Lines
...ke/snowpark/_internal/analyzer/select_statement.py 97.43% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4183      +/-   ##
==========================================
+ Coverage   95.17%   95.41%   +0.24%     
==========================================
  Files         171      171              
  Lines       43801    43840      +39     
  Branches     7505     7517      +12     
==========================================
+ Hits        41686    41832     +146     
+ Misses       1294     1227      -67     
+ Partials      821      781      -40     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sfc-gh-yuwang sfc-gh-yuwang marked this pull request as ready for review April 21, 2026 02:34
@sfc-gh-yuwang sfc-gh-yuwang requested review from a team as code owners April 21, 2026 02:34
@sfc-gh-yuwang sfc-gh-yuwang added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Apr 21, 2026
Comment thread src/snowflake/snowpark/_internal/analyzer/select_statement.py Outdated
# column was inferred (length matches projection).
if self._session.reduce_describe_query_enabled and self.attributes is not None:
# subquery lookup by name
attributes_by_name = {attr.name: attr for attr in self.attributes}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could there a name collision?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have conclusion on:

  1. whether there will be name collision?
  2. if there is collision, is the logic still safe?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by collision I mean in the last project, same name column showing up twice: "a": StringType , "a": IntType

Comment thread src/snowflake/snowpark/_internal/analyzer/select_statement.py Outdated
_ = df.col("a")


def test_project_alias_infers_attributes_from_parent_metadata(session):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests for more complicated aliasing scenarios like what Adam described? Can you also test some cases where the describe query cannot be skipped, even with reduce_describe_query_enabled set?

# subquery lookup by name
attributes_by_name = {attr.name: attr for attr in self.attributes}
inferred_attributes: List[Attribute] = []
assert new.projection is not None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this potentially raises assertion error which will break customers

could new.project be None and what happen if new.project is None before?
should we just fall back/skip optimization in this case?

inferred_attributes = []
break

source_attr = attributes_by_name.get(source_column_name)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to handle double quoted identifiers?

# else aborts without setting partial attributes, (4) map each case to an
# Attribute named for the projected column, (5) assign only if every output
# column was inferred (length matches projection).
if self._session.reduce_describe_query_enabled and self.attributes is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can call this out in the changelog. this is a snowpark public facing improvement

@sfc-gh-aling sfc-gh-aling left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use a scos artificial workload to verify the improvement and paste the number here

@sfc-gh-yuwang

Copy link
Copy Markdown
Collaborator Author

tests/workload_tests/artificial_workloads/test_artificial_workload_09_26.py
for above workload, it reduce 6 queries in total

# Skip: no projection to walk (do not assert; leave new.attributes unchanged).
if projection is not None:
# Skip: duplicate output names on the parent — dict/lookup would be ambiguous.
if len(parent_attributes) == len({a.name for a in parent_attributes}):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this if condition redundant?

you are already doing the collision check in the 1612-1618 loop.
this 1609 is another O(N) loop

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally included this if to determine whether there are collision before we normalize the attr name, but I think we can remove this check to save a O(N) loop

Comment on lines +2159 to +2163
def _normalized_snowflake_identifier_key(name: str) -> str:
"""Canonical quoted key: delimited identifiers preserve case; unquoted follow Snowflake uppercasing."""
if ALREADY_QUOTED.match(name):
return quote_name_without_upper_casing(unquote_if_quoted(name))
return quote_name(name)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as per discussion, we do not need this util, quote_name is good enough

@sfc-gh-aling

Copy link
Copy Markdown
Collaborator

please update to use quoted_name for attribute key loop up before merging the PR

@sfc-gh-yuwang sfc-gh-yuwang merged commit c3cc950 into main Apr 29, 2026
29 checks passed
@sfc-gh-yuwang sfc-gh-yuwang deleted the SNOW-3384967 branch April 29, 2026 01:21
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 29, 2026
@sfc-gh-yuwang sfc-gh-yuwang restored the SNOW-3384967 branch April 30, 2026 20:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants