SNOW-3409016: Support structured type schema string#4155
Conversation
|
I have read the CLA Document and I hereby sign the CLA You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot. |
8230e3c to
d5773e7
Compare
This reverts commit c5ab05de9c185ee004ddde5e6458615b39566686.
d5773e7 to
129d7ef
Compare
| from snowflake.snowpark.table import Table | ||
| from snowflake.snowpark.types import ( | ||
| ArrayType, | ||
| DataType, | ||
| DecimalType, | ||
| DoubleType, | ||
| LongType, | ||
| MapType, | ||
| StringType, | ||
| StructType, |
There was a problem hiding this comment.
Missing required imports for types used in the new code. StructField is used on line 219 but not imported. VariantType is used on lines 245, 257, and 1558 but not imported. This will cause NameError at runtime when these functions are called.
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructField, # Add this
StructType,
VariantType, # Add this
)| from snowflake.snowpark.table import Table | |
| from snowflake.snowpark.types import ( | |
| ArrayType, | |
| DataType, | |
| DecimalType, | |
| DoubleType, | |
| LongType, | |
| MapType, | |
| StringType, | |
| StructType, | |
| from snowflake.snowpark.table import Table | |
| from snowflake.snowpark.types import ( | |
| ArrayType, | |
| DataType, | |
| DecimalType, | |
| DoubleType, | |
| LongType, | |
| MapType, | |
| StringType, | |
| StructField, | |
| StructType, | |
| VariantType, | |
Spotted by Graphite
Is this helpful? React 👍 or 👎 to let us know.
| field_def = field_def.strip() | ||
| if not field_def: | ||
| continue |
There was a problem hiding this comment.
curious what is this case, is it something like ", , ," -- is this considered valid?
There was a problem hiding this comment.
Changed to raise an exception for this case.
| ) | ||
|
|
||
| self._use_structured_type_infer_schema: bool = ( | ||
| self._conn._get_client_side_session_parameter( |
There was a problem hiding this comment.
is this feature available in prod? shall we have integ test for it?
There was a problem hiding this comment.
It's currently disabled by now. QQ for the snowpark CI, does the test account have the permission to alter Snowflake internal parameters? If so, we can add integ tests.
There was a problem hiding this comment.
if it's a parameter invisible to client, probably we can not test it.
qq: can the parameter be set at session level sql -- "alter session set =true"?
There was a problem hiding this comment.
There is a global system parameter ENABLE_STRUCTURED_TYPE_INFER_SCHEMA_FOR_PARQUET, which can be set at session level. However, it's internal parameter that's not visible to clients. Does it help here?
|
|
||
| self._use_structured_type_infer_schema: bool = ( | ||
| self._conn._get_client_side_session_parameter( | ||
| _PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA, False |
There was a problem hiding this comment.
do we enable this feature by:
alter session set PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA=True ?
And like Adam said, can we add an integ test to test the end to end behavior, we can use mock if it is not available on prod
There was a problem hiding this comment.
Actually no. The flag PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA is just to guard the code change in snowpark python. There are other account level parameters to control whether the backend to return structured type schema string or not. So as mentioned https://github.com/snowflakedb/snowpark-python/pull/4155/changes/BASE..129d7ef3c2cc04f98811eb63d3ae522b3d412d3c#r3103032414, we can add integ test if test account used in snowpark is able to alter Snowflake non-public parameter.
There was a problem hiding this comment.
Updated to another way to avoid the confusion here.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4155 +/- ##
==========================================
+ Coverage 95.15% 95.43% +0.27%
==========================================
Files 171 171
Lines 43587 43785 +198
Branches 7457 7502 +45
==========================================
+ Hits 41476 41787 +311
+ Misses 1291 1221 -70
+ Partials 820 777 -43 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| raise ValueError(f"Unbalanced parentheses in type string: '{type_str}'") | ||
|
|
||
|
|
||
| def _sf_type_to_type_object(type_str: str) -> DataType: |
There was a problem hiding this comment.
+1 on @sfc-gh-yuwang 's comment, can we move those util functions to type_utils.py module?
| normalized = base_upper.replace(" ", "").lower() | ||
| if normalized in DATA_TYPE_STRING_OBJECT_MAPPINGS: | ||
| return DATA_TYPE_STRING_OBJECT_MAPPINGS[normalized]() | ||
| if normalized in _SF_EXTRA_TYPE_MAPPINGS: | ||
| return _SF_EXTRA_TYPE_MAPPINGS[normalized]() | ||
| raise ValueError(f"'{type_str}' is not a supported type") |
There was a problem hiding this comment.
this part of logic repeats the one at the beginning of the function.
IIUC, this is for handling non-structured types right? do we need the logic twice in the same function?
and if we indeed want, shall we define a local util function to reuse the logic?
SNOW-3409016
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Summary
INFER_SCHEMAcan now return structured type strings (for example,OBJECT(...),MAP(...),ARRAY(...)) for parquet/json files with nested columns. The legacy parser in_infer_schema_for_file_formatwas based on simple string splitting and could fail on nested structured types.This PR:
OBJECT(...),MAP(...),ARRAY(...)with nesting andNOT NULLhandling.OBJECT/MAP/ARRAY) by mapping them toVariantTypeand using$1:{name}(no cast) to preserve column access.Problem
When
INFER_SCHEMAreturns nested structured strings likeOBJECT(bio TEXT, social OBJECT(twitter TEXT, linkedin TEXT)), the old parser path can break because it assumes parenthesized values are numeric precision/scale.Even when parsing succeeds, cast generation for structured strings can be brittle (for example,
NOT NULLannotations inside structured casts).Solution
New parser path (
_parse_structured_type_str->_sf_type_to_type_object):StructType/ArrayType/MapTypewith structured semantics.TEXT->StringType,REAL->DoubleType,FIXED->LongType).convert_sf_to_sp_typefor simple and parameterized scalar types (backward compatible).Structured infer-schema identifier handling:
NOT NULLfrom cast text.VariantTypeand uses$1:{name}(no cast) to avoid failures and preserve column names.Changes
Core implementation:
src/snowflake/snowpark/dataframe_reader.py_extract_paren_content()_sf_type_to_type_object()_parse_structured_type_str()_infer_schema_for_file_formatto use structured parser pathformat.lower->format.lower()VariantType+ no-cast identifier)Supporting wiring/tests:
src/snowflake/snowpark/session.py_PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMAtests/unit/test_dataframe_reader_type_parsing.pyBackward Compatibility
NUMBER,VARCHAR,BOOLEAN, etc.) continue to use existing conversion logic.