Skip to content

SNOW-3409016: Support structured type schema string#4155

Merged
sfc-gh-wshangguan merged 15 commits into
mainfrom
wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python
Apr 20, 2026
Merged

SNOW-3409016: Support structured type schema string#4155
sfc-gh-wshangguan merged 15 commits into
mainfrom
wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python

Conversation

@sfc-gh-wshangguan

@sfc-gh-wshangguan sfc-gh-wshangguan commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator
  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

SNOW-3409016

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
    • If adding any arguments to public Snowpark APIs or creating new public Snowpark APIs, I acknowledge that I have ensured my changes include AST support. Follow the link for more information: AST Support Guidelines
  2. Please describe how your code solves the related issue.

Summary

INFER_SCHEMA can now return structured type strings (for example, OBJECT(...), MAP(...), ARRAY(...)) for parquet/json files with nested columns. The legacy parser in _infer_schema_for_file_format was based on simple string splitting and could fail on nested structured types.
This PR:

  • Adds a recursive structured type parser for OBJECT(...), MAP(...), ARRAY(...) with nesting and NOT NULL handling.
  • Adds safer fallback handling for bare structured keywords (OBJECT/MAP/ARRAY) by mapping them to VariantType and using $1:{name} (no cast) to preserve column access.

Problem

When INFER_SCHEMA returns nested structured strings like OBJECT(bio TEXT, social OBJECT(twitter TEXT, linkedin TEXT)), the old parser path can break because it assumes parenthesized values are numeric precision/scale.
Even when parsing succeeds, cast generation for structured strings can be brittle (for example, NOT NULL annotations inside structured casts).

Solution

New parser path (_parse_structured_type_str -> _sf_type_to_type_object):

  • Recursively parses nested structured type strings into Snowpark StructType / ArrayType / MapType with structured semantics.
  • Handles Snowflake metadata aliases (TEXT -> StringType, REAL -> DoubleType, FIXED -> LongType).
  • Falls back to convert_sf_to_sp_type for simple and parameterized scalar types (backward compatible).
    Structured infer-schema identifier handling:
  • For structured types, preserves raw inferred type text while stripping NOT NULL from cast text.
  • For bare structured keywords (older backend behavior), returns VariantType and uses $1:{name} (no cast) to avoid failures and preserve column names.

Changes

Core implementation:

  • src/snowflake/snowpark/dataframe_reader.py
    • Added _extract_paren_content()
    • Added _sf_type_to_type_object()
    • Added _parse_structured_type_str()
    • Updated _infer_schema_for_file_format to use structured parser path
    • Fixed format.lower -> format.lower()
    • Added bare-structured fallback behavior (VariantType + no-cast identifier)
      Supporting wiring/tests:
  • src/snowflake/snowpark/session.py
    • Added session parameter _PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA
  • tests/unit/test_dataframe_reader_type_parsing.py
    • Added parser and infer-schema coverage (nested structured, nullability, bare structured fallback, parquet/json path behavior)

Backward Compatibility

  • Simple scalar types (NUMBER, VARCHAR, BOOLEAN, etc.) continue to use existing conversion logic.
  • Legacy behavior remains available when structured infer-schema flag is disabled.
  • Non-structured infer-schema results are unaffected.

@github-actions

Copy link
Copy Markdown


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

@sfc-gh-wshangguan sfc-gh-wshangguan changed the title (WIP) first attempt, still need parameter protection? SNOW-3237416: Support structured type schema string Apr 11, 2026
@sfc-gh-wshangguan sfc-gh-wshangguan marked this pull request as ready for review April 11, 2026 00:18
@sfc-gh-wshangguan sfc-gh-wshangguan force-pushed the wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python branch from 8230e3c to d5773e7 Compare April 11, 2026 08:52
@sfc-gh-wshangguan sfc-gh-wshangguan force-pushed the wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python branch from d5773e7 to 129d7ef Compare April 14, 2026 01:08
Comment on lines 98 to 107
from snowflake.snowpark.table import Table
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructType,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing required imports for types used in the new code. StructField is used on line 219 but not imported. VariantType is used on lines 245, 257, and 1558 but not imported. This will cause NameError at runtime when these functions are called.

from snowflake.snowpark.types import (
    ArrayType,
    DataType,
    DecimalType,
    DoubleType,
    LongType,
    MapType,
    StringType,
    StructField,  # Add this
    StructType,
    VariantType,  # Add this
)
Suggested change
from snowflake.snowpark.table import Table
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructType,
from snowflake.snowpark.table import Table
from snowflake.snowpark.types import (
ArrayType,
DataType,
DecimalType,
DoubleType,
LongType,
MapType,
StringType,
StructField,
StructType,
VariantType,

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Comment thread src/snowflake/snowpark/dataframe_reader.py Outdated
Comment thread src/snowflake/snowpark/dataframe_reader.py Outdated
Comment thread src/snowflake/snowpark/dataframe_reader.py Outdated
Comment on lines +210 to +212
field_def = field_def.strip()
if not field_def:
continue

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what is this case, is it something like ", , ," -- is this considered valid?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to raise an exception for this case.

Comment thread src/snowflake/snowpark/dataframe_reader.py Outdated
Comment thread src/snowflake/snowpark/session.py Outdated
)

self._use_structured_type_infer_schema: bool = (
self._conn._get_client_side_session_parameter(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this feature available in prod? shall we have integ test for it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently disabled by now. QQ for the snowpark CI, does the test account have the permission to alter Snowflake internal parameters? If so, we can add integ tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's a parameter invisible to client, probably we can not test it.
qq: can the parameter be set at session level sql -- "alter session set =true"?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a global system parameter ENABLE_STRUCTURED_TYPE_INFER_SCHEMA_FOR_PARQUET, which can be set at session level. However, it's internal parameter that's not visible to clients. Does it help here?

Comment thread src/snowflake/snowpark/session.py Outdated

self._use_structured_type_infer_schema: bool = (
self._conn._get_client_side_session_parameter(
_PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA, False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we enable this feature by:
alter session set PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA=True ?
And like Adam said, can we add an integ test to test the end to end behavior, we can use mock if it is not available on prod

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no. The flag PYTHON_SNOWPARK_USE_STRUCTURED_TYPE_INFER_SCHEMA is just to guard the code change in snowpark python. There are other account level parameters to control whether the backend to return structured type schema string or not. So as mentioned https://github.com/snowflakedb/snowpark-python/pull/4155/changes/BASE..129d7ef3c2cc04f98811eb63d3ae522b3d412d3c#r3103032414, we can add integ test if test account used in snowpark is able to alter Snowflake non-public parameter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to another way to avoid the confusion here.

Comment thread src/snowflake/snowpark/dataframe_reader.py Outdated
@codecov-commenter

codecov-commenter commented Apr 17, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.43%. Comparing base (c35990f) to head (ebf4cc4).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4155      +/-   ##
==========================================
+ Coverage   95.15%   95.43%   +0.27%     
==========================================
  Files         171      171              
  Lines       43587    43785     +198     
  Branches     7457     7502      +45     
==========================================
+ Hits        41476    41787     +311     
+ Misses       1291     1221      -70     
+ Partials      820      777      -43     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

raise ValueError(f"Unbalanced parentheses in type string: '{type_str}'")


def _sf_type_to_type_object(type_str: str) -> DataType:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on @sfc-gh-yuwang 's comment, can we move those util functions to type_utils.py module?

Comment on lines +247 to +252
normalized = base_upper.replace(" ", "").lower()
if normalized in DATA_TYPE_STRING_OBJECT_MAPPINGS:
return DATA_TYPE_STRING_OBJECT_MAPPINGS[normalized]()
if normalized in _SF_EXTRA_TYPE_MAPPINGS:
return _SF_EXTRA_TYPE_MAPPINGS[normalized]()
raise ValueError(f"'{type_str}' is not a supported type")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part of logic repeats the one at the beginning of the function.
IIUC, this is for handling non-structured types right? do we need the logic twice in the same function?
and if we indeed want, shall we define a local util function to reuse the logic?

@sfc-gh-wshangguan sfc-gh-wshangguan merged commit c63ad27 into main Apr 20, 2026
28 of 34 checks passed
@sfc-gh-wshangguan sfc-gh-wshangguan deleted the wshangguan-SNOW-3237416-infer-schema-parquet-structured-type-snowpark-python branch April 20, 2026 20:26
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 20, 2026
@sfc-gh-wshangguan sfc-gh-wshangguan changed the title SNOW-3237416: Support structured type schema string SNOW-3409016: Support structured type schema string Apr 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants