SNOW-3384967: reduce describe query generated by alias#4183
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4183 +/- ##
==========================================
+ Coverage 95.17% 95.41% +0.24%
==========================================
Files 171 171
Lines 43801 43840 +39
Branches 7505 7517 +12
==========================================
+ Hits 41686 41832 +146
+ Misses 1294 1227 -67
+ Partials 821 781 -40 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| # column was inferred (length matches projection). | ||
| if self._session.reduce_describe_query_enabled and self.attributes is not None: | ||
| # subquery lookup by name | ||
| attributes_by_name = {attr.name: attr for attr in self.attributes} |
There was a problem hiding this comment.
could there a name collision?
There was a problem hiding this comment.
There was a problem hiding this comment.
do we have conclusion on:
- whether there will be name collision?
- if there is collision, is the logic still safe?
There was a problem hiding this comment.
by collision I mean in the last project, same name column showing up twice: "a": StringType , "a": IntType
| _ = df.col("a") | ||
|
|
||
|
|
||
| def test_project_alias_infers_attributes_from_parent_metadata(session): |
There was a problem hiding this comment.
Can you add tests for more complicated aliasing scenarios like what Adam described? Can you also test some cases where the describe query cannot be skipped, even with reduce_describe_query_enabled set?
| # subquery lookup by name | ||
| attributes_by_name = {attr.name: attr for attr in self.attributes} | ||
| inferred_attributes: List[Attribute] = [] | ||
| assert new.projection is not None |
There was a problem hiding this comment.
this potentially raises assertion error which will break customers
could new.project be None and what happen if new.project is None before?
should we just fall back/skip optimization in this case?
| inferred_attributes = [] | ||
| break | ||
|
|
||
| source_attr = attributes_by_name.get(source_column_name) |
There was a problem hiding this comment.
do we need to handle double quoted identifiers?
| # else aborts without setting partial attributes, (4) map each case to an | ||
| # Attribute named for the projected column, (5) assign only if every output | ||
| # column was inferred (length matches projection). | ||
| if self._session.reduce_describe_query_enabled and self.attributes is not None: |
There was a problem hiding this comment.
we can call this out in the changelog. this is a snowpark public facing improvement
sfc-gh-aling
left a comment
There was a problem hiding this comment.
let's use a scos artificial workload to verify the improvement and paste the number here
|
tests/workload_tests/artificial_workloads/test_artificial_workload_09_26.py |
0a7bc27 to
2d1cbf1
Compare
| # Skip: no projection to walk (do not assert; leave new.attributes unchanged). | ||
| if projection is not None: | ||
| # Skip: duplicate output names on the parent — dict/lookup would be ambiguous. | ||
| if len(parent_attributes) == len({a.name for a in parent_attributes}): |
There was a problem hiding this comment.
is this if condition redundant?
you are already doing the collision check in the 1612-1618 loop.
this 1609 is another O(N) loop
There was a problem hiding this comment.
originally included this if to determine whether there are collision before we normalize the attr name, but I think we can remove this check to save a O(N) loop
| def _normalized_snowflake_identifier_key(name: str) -> str: | ||
| """Canonical quoted key: delimited identifiers preserve case; unquoted follow Snowflake uppercasing.""" | ||
| if ALREADY_QUOTED.match(name): | ||
| return quote_name_without_upper_casing(unquote_if_quoted(name)) | ||
| return quote_name(name) |
There was a problem hiding this comment.
as per discussion, we do not need this util, quote_name is good enough
|
please update to use quoted_name for attribute key loop up before merging the PR |
ad2bd4e to
01c423e
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-NNNNNNN
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
When describe reduction is on and the inner select already has resolved
attributes, infer new.attributes for this outer select by reusing datatype and
nullable from the subquery: (0) skip if parent column names collide, (1) index
attributes by normalized name, (2) walk new.projection, (3) only handle plain
columns or Alias(column), (4) resolve source via quoted-identifier-aware lookup,
(5) assign only if every output column was inferred (length matches projection).
https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/view/SnowparkPython/job/PythonStoredProcBuildSnowfortTest/860/
https://snowpark-python-001.jenkinsdev1.us-west-2.aws-dev.app.snowflake.com/job/SnowparkConnectRegressionRunner/486/
snowpark and scos regression test to verify the change