Skip to content

[GH-2037] Implement _row_wise_operation + intersection, intersect#2038

Merged
jiayuasu merged 8 commits into
apache:masterfrom
petern48:intersection_series
Jul 2, 2025
Merged

[GH-2037] Implement _row_wise_operation + intersection, intersect#2038
jiayuasu merged 8 commits into
apache:masterfrom
petern48:intersection_series

Conversation

@petern48
Copy link
Copy Markdown
Member

@petern48 petern48 commented Jul 1, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Implement helper function for row_wise operations + intersection, intersect

How was this patch tested?

Added tests

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@petern48 petern48 force-pushed the intersection_series branch from f31308c to b7d73c0 Compare July 1, 2025 17:44
@petern48 petern48 marked this pull request as ready for review July 1, 2025 22:39
@petern48 petern48 requested a review from jiayuasu as a code owner July 1, 2025 22:39
@petern48
Copy link
Copy Markdown
Member Author

petern48 commented Jul 1, 2025

@zhangfengcdt For these row-wise operations, the default align=True matches elements using Spark internal index column. I'm planning to never support align=False in any of these row_wise functions in Sedona because Spark dataframes do not naturally support a deterministic order to my understanding. It's possible to hack something up with an extra column in the cases where the user specifies an index, but I don't think it makes sense to support that to be honest.
What do you think?

@petern48 petern48 force-pushed the intersection_series branch from 7828511 to 6b7510b Compare July 2, 2025 03:58
@zhangfengcdt zhangfengcdt requested a review from Copilot July 2, 2025 17:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Implements row-wise spatial operations (intersects and intersection) in GeoSeries and updates related tests.

  • Added support for multi-column queries and optional DataFrame override in _query_geometry_column
  • Introduced _row_wise_operation helper, and intersects/intersection methods in GeoSeries
  • Extended tests in both test_match_geopandas_series.py and test_geoseries.py to cover the new methods and handle empty-geometry mismatches

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
python/sedona/geopandas/geoseries.py Added PS_INDEX_COL, updated _query_geometry_column, and implemented _row_wise_operation, intersects, and intersection
python/tests/geopandas/test_match_geopandas_series.py Added test_intersects/test_intersection, and skipped mismatched empty geometries in comparison
python/tests/geopandas/test_geoseries.py Added test_intersects/test_intersection, and skipped mismatched empty geometries in comparison

raise NotImplementedError("Sedona Geopandas does not support align=False")

if isinstance(other, BaseGeometry):
other = GeoSeries([other] * len(self))
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When wrapping a single BaseGeometry into a GeoSeries, the original index is not preserved. This can lead to incorrect alignment when the series has a custom index. Consider passing index=self.index to the constructor, e.g., GeoSeries([other] * len(self), index=self.index).

Suggested change
other = GeoSeries([other] * len(self))
other = GeoSeries([other] * len(self), index=self.index)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this too for the same reason as the below change. This is not a left join, so using self.index isn't right. It also caused tests to fail. Will revisit later once we add index support.

Comment thread python/sedona/geopandas/geoseries.py Outdated
other_df = other._internal.spark_frame.select(
col(other.get_first_geometry_column()).alias("R"), col(PS_INDEX_COL)
)
joined_df = df.join(other_df, on=PS_INDEX_COL, how="outer")
Copy link

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an outer join can introduce rows for indices that exist only in one series, potentially yielding NULL geometries and unexpected results. To better mirror pandas-style alignment for row-wise operations, consider using a left join (how="left") or explicitly handling missing values before applying the spatial predicate.

Suggested change
joined_df = df.join(other_df, on=PS_INDEX_COL, how="outer")
joined_df = df.join(other_df, on=PS_INDEX_COL, how="left")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This originally made sense, but it caused the match geopandas tests to fail. After testing it out on various inputs manually, it does seem to do an outer join 🤷

@zhangfengcdt
Copy link
Copy Markdown
Member

@zhangfengcdt For these row-wise operations, the default align=True matches elements using Spark internal index column. I'm planning to never support align=False in any of these row_wise functions in Sedona because Spark dataframes do not naturally support a deterministic order to my understanding. It's possible to hack something up with an extra column in the cases where the user specifies an index, but I don't think it makes sense to support that to be honest. What do you think?

I think the API is to match on the order of geopandas index column (not spark dataframe index). This is actually useful because it is the only way the user would be able to align other columns with the resulting intersects results. And this seems to be a pretty common case. However, if no index is used in the GeoSeries creation, then we don't need to support alignment.

I am thinking if there is index created and users want to do align, we should add the original index column to the SQL sent to Sedona, and resulting PySpark Series should keep this index column in it. This way, end users could link the index to intersects results.

@petern48 petern48 force-pushed the intersection_series branch from fe07681 to 282b983 Compare July 2, 2025 18:25
@petern48
Copy link
Copy Markdown
Member Author

petern48 commented Jul 2, 2025

@zhangfengcdt I think we're mostly on the same page actually. The spark index column (__index_level_{}__) I'm using does actually represent the index in geopandas. See the comment in the from the pyspark codebase here below

# A function to turn given numbers to Spark columns that represent pandas-on-Spark index.
SPARK_INDEX_NAME_FORMAT = "__index_level_{}__".format
SPARK_DEFAULT_INDEX_NAME = SPARK_INDEX_NAME_FORMAT(0)

However, if no index is used in the GeoSeries creation, then we don't need to support alignment

If no index is given, pandas on pyspark creates a default index which we can use for the align=True. This is what the current tests use since we don't yet have index support.

Originally, I was proposing not to support align=False, where geopandas uses the "natural ordering" of the series instead of the given index. However, it looks like Pandas on PySpark does already have a hidden natural ordering column, so we can try using that.

Regardless, if the current default align=True logic sounds good to you, I'd rather merge this in now and revisit additional functionality (align=False) later when we add indexes (creating a separate issue of course). Does that make sense?

@petern48 petern48 force-pushed the intersection_series branch from 282b983 to 18d069e Compare July 2, 2025 18:47
@zhangfengcdt
Copy link
Copy Markdown
Member

@zhangfengcdt I think we're mostly on the same page actually. The spark index column (__index_level_{}__) I'm using does actually represent the index in geopandas. See the comment in the from the pyspark codebase here below

# A function to turn given numbers to Spark columns that represent pandas-on-Spark index.
SPARK_INDEX_NAME_FORMAT = "__index_level_{}__".format
SPARK_DEFAULT_INDEX_NAME = SPARK_INDEX_NAME_FORMAT(0)

However, if no index is used in the GeoSeries creation, then we don't need to support alignment

If no index is given, pandas on pyspark creates a default index which we can use for the align=True. This is what the current tests use since we don't yet have index support.

Originally, I was proposing not to support align=False, where geopandas uses the "natural ordering" of the series instead of the given index. However, it looks like Pandas on PySpark does already have a hidden natural ordering column, so we can try using that.

Regardless, if the current default align=True logic sounds good to you, I'd rather merge this in now and revisit additional functionality (align=False) later when we add indexes (creating a separate issue of course). Does that make sense?

Cool, that works for me.

Copy link
Copy Markdown
Member

@zhangfengcdt zhangfengcdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jiayuasu jiayuasu merged commit f8fe6f3 into apache:master Jul 2, 2025
26 checks passed
@petern48 petern48 deleted the intersection_series branch July 2, 2025 22:03
@petern48
Copy link
Copy Markdown
Member Author

petern48 commented Jul 3, 2025

Adding traditional pandas index support turned out much simpler than I expected. That fix along with align=False support is this new PR: #2057

Kontinuation pushed a commit to Kontinuation/sedona that referenced this pull request Jan 21, 2026
…ct (apache#2038)

* Refactor process_geometry_column to create a more flexible query_geometry_column()

* Implement length()

* Implement intersection

* Implement intersects

* Add rename field to _row_wise_operation and provide select instead of operation

* Replace PS_INDEX_COL w/ imported SPARK_DEFAULT_INDEX_NAME
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geopandas.GeoSeries: Implement intersection

4 participants