Skip to content

SNOW-3172585: [Local Testing] Bug in the mock_substring() function with local testing in Snowpark Python #4091

Description

@andresccu

Please answer these questions before submitting your issue. Thanks!

1. What version of Python are you using?

Python 3.10.19 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 16:41:31) [MSC v.1929 64 bit (AMD64)]

2. What are the Snowpark Python and pandas versions in the environment?

pandas==2.2.3
snowflake-snowpark-python==1.34.0

3. What did you do?

Applied substring() on a DataFrame that had been previously filtered with using filter().
When rows are removed by a filter, the underlying pandas index becomes non-contiguous (e.g. [1, 2] instead of [0, 1, 2]).
The built-in local testing mock for substring returns a ColumnEmulator with a fresh 0-based index instead of preserving the original index from the input column.
When with_column (for example) merges the result back into the DataFrame, pandas performs an outer-join on the mismatched indices, producing extra rows filled with NaN.

You can reproduce this error executing the following test:

from snowflake.snowpark import Session, Row, DataFrame
from snowflake.snowpark.functions import col, lit, substring


def test_substring(session: Session):
    df: DataFrame = session.create_dataframe(
        [
            ["hello"],
            ["world"],
            ["snowflake"],
        ],
        schema=["string"],
    )

    # After filter, the internal pandas index becomes non-contiguous ([1, 2]).
    # mock_substring builds its result ColumnEmulator without index=base_expr.index,
    # so pandas introduces a spurious NaN row when with_column joins the result back.
    result = (
        df.filter(col("string") != lit("hello"))
        .with_column(
            "substring",
            substring(col("string"), lit(2), lit(3)),
        )
        .collect()
    )

    # Expected: [Row('world', 'orl'), Row('snowflake', 'now')]
    # Actual: 3 rows, added extra NaN/None row due to index mismatch in mock_substring
    assert len(result) == 2
    assert result[0]["SUBSTRING"] == "orl"
    assert result[1]["SUBSTRING"] == "now"

4. What did you expect to see?

The query should return exactly 2 rows with the substring correctly aligned to each row:

[
    Row(STRING='world', SUBSTRING='orl'),
    Row(STRING='snowflake', SUBSTRING='now')
]

Instead, the local testing framework returns 3 rows with misaligned data:

[
    Row(STRING='world', SUBSTRING='now'), 
    Row(STRING='snowflake', SUBSTRING=nan),
    Row(STRING=nan, SUBSTRING='orl'),
]

When the same code runs against a real Snowflake connection, SUBSTRING works fine because there is no pandas index involved. The bug is exclusively in the local testing mock.

This happens because the mock_substring function in snowflake.snowpark.mock._functions builds its result ColumnEmulator without preserving index=base_expr.index, so after any .filter() that makes the index non-contiguous, with_column produces None/NaN rows.


I'm willing to implement this feature and submit a pull request. The implementation would include:

  • mock_substring function() in src/snowflake/snowpark/mock/_functions.py
  • test_substring() test case in tests/mock/test_functions.py

Metadata

Metadata

Labels

bugSomething isn't workinglocal testingLocal Testing issues/PRsstatus-triage_doneInitial triage done, will be further handled by the driver team

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions