[GH-2037] Implement _row_wise_operation + intersection, intersect#2038
Conversation
f31308c to
b7d73c0
Compare
|
@zhangfengcdt For these row-wise operations, the default align=True matches elements using Spark internal index column. I'm planning to never support |
7828511 to
6b7510b
Compare
There was a problem hiding this comment.
Pull Request Overview
Implements row-wise spatial operations (intersects and intersection) in GeoSeries and updates related tests.
- Added support for multi-column queries and optional DataFrame override in
_query_geometry_column - Introduced
_row_wise_operationhelper, andintersects/intersectionmethods inGeoSeries - Extended tests in both
test_match_geopandas_series.pyandtest_geoseries.pyto cover the new methods and handle empty-geometry mismatches
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| python/sedona/geopandas/geoseries.py | Added PS_INDEX_COL, updated _query_geometry_column, and implemented _row_wise_operation, intersects, and intersection |
| python/tests/geopandas/test_match_geopandas_series.py | Added test_intersects/test_intersection, and skipped mismatched empty geometries in comparison |
| python/tests/geopandas/test_geoseries.py | Added test_intersects/test_intersection, and skipped mismatched empty geometries in comparison |
| raise NotImplementedError("Sedona Geopandas does not support align=False") | ||
|
|
||
| if isinstance(other, BaseGeometry): | ||
| other = GeoSeries([other] * len(self)) |
There was a problem hiding this comment.
When wrapping a single BaseGeometry into a GeoSeries, the original index is not preserved. This can lead to incorrect alignment when the series has a custom index. Consider passing index=self.index to the constructor, e.g., GeoSeries([other] * len(self), index=self.index).
| other = GeoSeries([other] * len(self)) | |
| other = GeoSeries([other] * len(self), index=self.index) |
There was a problem hiding this comment.
Reverted this too for the same reason as the below change. This is not a left join, so using self.index isn't right. It also caused tests to fail. Will revisit later once we add index support.
| other_df = other._internal.spark_frame.select( | ||
| col(other.get_first_geometry_column()).alias("R"), col(PS_INDEX_COL) | ||
| ) | ||
| joined_df = df.join(other_df, on=PS_INDEX_COL, how="outer") |
There was a problem hiding this comment.
Using an outer join can introduce rows for indices that exist only in one series, potentially yielding NULL geometries and unexpected results. To better mirror pandas-style alignment for row-wise operations, consider using a left join (how="left") or explicitly handling missing values before applying the spatial predicate.
| joined_df = df.join(other_df, on=PS_INDEX_COL, how="outer") | |
| joined_df = df.join(other_df, on=PS_INDEX_COL, how="left") |
There was a problem hiding this comment.
This originally made sense, but it caused the match geopandas tests to fail. After testing it out on various inputs manually, it does seem to do an outer join 🤷
I think the API is to match on the order of geopandas index column (not spark dataframe index). This is actually useful because it is the only way the user would be able to align other columns with the resulting I am thinking if there is index created and users want to do align, we should add the original index column to the SQL sent to Sedona, and resulting PySpark Series should keep this index column in it. This way, end users could link the index to intersects results. |
fe07681 to
282b983
Compare
|
@zhangfengcdt I think we're mostly on the same page actually. The spark index column ( # A function to turn given numbers to Spark columns that represent pandas-on-Spark index.
SPARK_INDEX_NAME_FORMAT = "__index_level_{}__".format
SPARK_DEFAULT_INDEX_NAME = SPARK_INDEX_NAME_FORMAT(0)
If no index is given, pandas on pyspark creates a default index which we can use for the Originally, I was proposing not to support Regardless, if the current default |
282b983 to
18d069e
Compare
Cool, that works for me. |
|
Adding traditional pandas index support turned out much simpler than I expected. That fix along with |
…ct (apache#2038) * Refactor process_geometry_column to create a more flexible query_geometry_column() * Implement length() * Implement intersection * Implement intersects * Add rename field to _row_wise_operation and provide select instead of operation * Replace PS_INDEX_COL w/ imported SPARK_DEFAULT_INDEX_NAME
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject. Closes Geopandas.GeoSeries: Implement intersection #2037What changes were proposed in this PR?
Implement helper function for row_wise operations + intersection, intersect
How was this patch tested?
Added tests
Did this PR include necessary documentation updates?