Skip to content

Commit 20b4d51

Browse files
authored
SNOW-2345238: Re-use row count from relaxed query compiler (#3791)
SNOW-2230971 (#3687) introduced a second "relaxed" query compiler with placeholder row positions that exists in parallel to the existing Snowflake QC. When the Snowflake QC calls create_initial_ordered_dataframe, it may issue a metadata query to retrieve a row count if the frame is created from a Snowflake query on a table. However, after the addition of the relaxed QC, this code path was hit twice (here), and a second redundant row count query was issued. This PR re-uses the result of the first row count query when possible to do so. This regression was not caught by testing because the SQL counter explicitly filters out row count metadata queries. Accordingly, this PR makes no testing changes because the row count reduction is not visible to the SQL counter as currently configured. Moreover, the regression was not caught by the benchmarking job (which does NOT filter row count queries) because it stopped running due to crashes caused by enabling hybrid execution on main. I verified the removal of the extra row count query manually, and its impact will be visible in the benchmark dashboard. The removal of this query has a non-trivial impact on operations with short runtimes, where network latency becomes a meaningful factor. This will address the regression of single-API benchmarks like pd.read_snowflake, which went from ~0.2s to ~0.6s for a 1e6x10 table between commits.
1 parent 3da0012 commit 20b4d51

3 files changed

Lines changed: 24 additions & 3 deletions

File tree

CHANGELOG.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,14 @@
66

77
#### New Features
88

9+
### Snowpark pandas API Updates
10+
11+
#### New Features
12+
13+
#### Improvements
14+
- Hybrid execution mode is now enabled by default. Certain operations on smaller data will now automatically execute in native pandas in-memory. Use `from modin.config import AutoSwitchBackend; AutoSwitchBackend.disable()` to turn this off and force all execution to occur in Snowflake.
15+
- Removed an unnecessary `SHOW OBJECTS` query issued from `read_snowflake` under certain conditions.
16+
917
## 1.39.0 (YYYY-MM-DD)
1018

1119
### Snowpark Python API Updates
@@ -73,7 +81,6 @@
7381

7482
#### Improvements
7583

76-
- Hybrid execution mode is now enabled by default. Certain operations on smaller data will now automatically execute in native pandas in-memory. Use `from modin.config import AutoSwitchBackend; AutoSwitchBackend.disable()` to turn this off and force all execution to occur in Snowflake.
7784
- Downgraded to level `logging.DEBUG - 1` the log message saying that the
7885
Snowpark `DataFrame` reference of an internal `DataFrameReference` object
7986
has changed.

src/snowflake/snowpark/modin/plugin/_internal/utils.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -322,7 +322,9 @@ def _create_read_only_table(
322322
def create_initial_ordered_dataframe(
323323
table_name_or_query: Union[str, Iterable[str]],
324324
enforce_ordering: bool,
325+
*,
325326
dummy_row_pos_mode: bool = False,
327+
row_count_hint: Optional[int] = None,
326328
) -> tuple[OrderedDataFrame, str]:
327329
"""
328330
create read only temp table on top of the existing table or Snowflake query if required, and create a OrderedDataFrame
@@ -334,6 +336,11 @@ def create_initial_ordered_dataframe(
334336
enforce_ordering: If True, create a read only temp table on top of the existing table or Snowflake query,
335337
and create the OrderedDataFrame using the read only temp table created.
336338
Otherwise, directly using the existing table.
339+
dummy_row_pos_mode: If True, uses "dummy" row position columns to avoid a potentially
340+
expensive ROW_NUMBER() query.
341+
row_count_hint: An optional hint for the exact row count of the frame. This is used in scenarios
342+
where we have already performed a query for the size of the underlying data, and can re-use
343+
the value.
337344
338345
Returns:
339346
OrderedDataFrame with row position column.
@@ -502,8 +509,10 @@ def create_initial_ordered_dataframe(
502509
ordered_dataframe.row_position_snowflake_quoted_identifier
503510
)
504511

505-
materialized_row_count = None
506-
if not is_query:
512+
if row_count_hint is not None:
513+
ordered_dataframe.row_count = row_count_hint
514+
ordered_dataframe.row_count_upper_bound = row_count_hint
515+
elif not is_query:
507516
materialized_row_count = get_object_metadata_row_count(table_name_or_query)
508517
ordered_dataframe.row_count = materialized_row_count
509518
ordered_dataframe.row_count_upper_bound = materialized_row_count

src/snowflake/snowpark/modin/plugin/compiler/snowflake_query_compiler.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1479,6 +1479,11 @@ def from_snowflake(
14791479
table_name_or_query=name_or_query,
14801480
enforce_ordering=enforce_ordering,
14811481
dummy_row_pos_mode=dummy_row_pos_mode,
1482+
row_count_hint=(
1483+
relaxed_query_compiler._modin_frame.ordered_dataframe.row_count
1484+
if relaxed_query_compiler is not None
1485+
else None
1486+
),
14821487
)
14831488
pandas_labels_to_snowflake_quoted_identifiers_map = {
14841489
# pandas labels of resulting Snowpark pandas dataframe will be snowflake identifier

0 commit comments

Comments
 (0)