Skip to content

SNOW-2372686: Write large_pandas_backend_df.to_snowflake() via parquet.#3820

Merged
sfc-gh-mvashishtha merged 8 commits into
mainfrom
mvashishtha/SNOW-2372686/improve-pandas-backend-df-to-snowflake-performance
Oct 1, 2025
Merged

SNOW-2372686: Write large_pandas_backend_df.to_snowflake() via parquet.#3820
sfc-gh-mvashishtha merged 8 commits into
mainfrom
mvashishtha/SNOW-2372686/improve-pandas-backend-df-to-snowflake-performance

Conversation

@sfc-gh-mvashishtha

@sfc-gh-mvashishtha sfc-gh-mvashishtha commented Sep 29, 2025

Copy link
Copy Markdown
Contributor

To implement to_snowflake() for a Snowpark pandas dataframe on the pandas backend with large enough data, upload data via a parquet file instead of via a Snowpark dataframe. The Snowpark dataframe typically inserts values through parametrized SQL queries.

Benchmarking showed that a good threshold to switch to parquet was roughly 3 MB, so I've set that as the configurable default switching threshold. Performance of this approach seems to improve with dataset size. Exporting an 800 MB dataframe took about 55 seconds via parquet versus about 429 seconds via the old method, so we get over 7x speedup.

This work took me much longer than expected, so I am sending out this PR with just the changes for DataFrame.to_snowflake(). We can address Series.to_snowflake() separately.

We can take a similar approach to speed up pandas_backend_df.move_to('snowflake').

benchmark script with a 3XL warehouse
import pandas as native_pd
import modin.pandas as pd
import snowflake.snowpark.modin.plugin
import numpy as np
import datetime
from snowflake.snowpark.session import Session
from tests.utils import Utils
from multiprocessing import Pool
import psutil



from time import perf_counter

def get_benchmark_result(input_data: native_pd.DataFrame, method: str):
    if method == "parquet":
        pd.session.pandas_to_snowflake_row_limit = 0
    else:
        pd.session.pandas_to_snowflake_row_limit = int(1e99)

    df = pd.DataFrame(input_data).set_backend('pandas').pin_backend()
    table = Utils.random_table_name()
    start = perf_counter()
    df.to_snowflake(table)
    end = perf_counter()
    Utils.drop_table(pd.session, table)
    return {
        "time": datetime.timedelta(seconds=end-start),
        "memory_usage": input_data.memory_usage().sum(),
        "num_rows": input_data.shape[0],
        "num_columns": input_data.shape[1],
        "method": method
    }



# num_processes = psutil.cpu_count(logical=False)
# for 
session = Session.builder.create()
rows = []
for input_data in (
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e0), 1))),     
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e3), 1))),    
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e3), 3))),     
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e4), 1))),
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e4), 3))),    
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e5), 1))),    
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e5), 3))),     
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e6), 1))),
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e6), 3))),    
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e6), 10))),
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e7), 1))),
    native_pd.DataFrame(np.random.randint(low=0,high=1000,size=(int(1e7), 10))),    
):
    for method in ("parquet", "snowpark"):
        rows.append(get_benchmark_result(input_data, method=method))

native_pd.DataFrame(rows).to_parquet('to_snowflake_timing_2', index=False)
to_snowflake_timing_2

Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
@sfc-gh-mvashishtha sfc-gh-mvashishtha added the NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs label Sep 29, 2025
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
@sfc-gh-mvashishtha sfc-gh-mvashishtha changed the title SNOW-2372686: Write large_pandas_df.to_snowflake() via parquet. SNOW-2372686: Write large_pandas_backend_df.to_snowflake() via parquet. Sep 30, 2025
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
@sfc-gh-mvashishtha sfc-gh-mvashishtha marked this pull request as ready for review September 30, 2025 02:49
@sfc-gh-mvashishtha sfc-gh-mvashishtha requested a review from a team as a code owner September 30, 2025 02:49

@sfc-gh-joshi sfc-gh-joshi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! That's an impressive perf graph.

Left a few comments.

Comment thread CHANGELOG.md Outdated
Comment thread src/snowflake/snowpark/modin/plugin/extensions/dataframe_extensions.py Outdated
Comment thread src/snowflake/snowpark/modin/plugin/extensions/dataframe_extensions.py Outdated
Comment thread src/snowflake/snowpark/modin/plugin/extensions/pd_extensions.py Outdated
Comment thread src/snowflake/snowpark/modin/plugin/extensions/pd_extensions.py Outdated
Comment thread tests/integ/modin/frame/test_to_snowflake.py Outdated
Comment thread src/snowflake/snowpark/modin/plugin/extensions/dataframe_extensions.py Outdated
Comment thread src/snowflake/snowpark/modin/plugin/extensions/dataframe_extensions.py Outdated
sfc-gh-mvashishtha and others added 3 commits October 1, 2025 04:14
…sions.py

Co-authored-by: Jonathan Shi <149419494+sfc-gh-joshi@users.noreply.github.com>
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
Signed-off-by: sfc-gh-mvashishtha <mahesh.vashishtha@snowflake.com>
Comment thread tests/integ/modin/frame/test_to_snowflake.py
@sfc-gh-mvashishtha sfc-gh-mvashishtha merged commit e4584e2 into main Oct 1, 2025
30 checks passed
@sfc-gh-mvashishtha sfc-gh-mvashishtha deleted the mvashishtha/SNOW-2372686/improve-pandas-backend-df-to-snowflake-performance branch October 1, 2025 19:48
@github-actions github-actions Bot locked and limited conversation to collaborators Oct 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants