[GH-2037] Implement _row_wise_operation + intersection, intersect by petern48 · Pull Request #2038 · apache/sedona

petern48 · 2025-07-01T17:05:47Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject. Closes Geopandas.GeoSeries: Implement intersection #2037

What changes were proposed in this PR?

Implement helper function for row_wise operations + intersection, intersect

How was this patch tested?

Added tests

Did this PR include necessary documentation updates?

Yes, I have updated the documentation.

…etry_column()

petern48 · 2025-07-01T23:39:25Z

@zhangfengcdt For these row-wise operations, the default align=True matches elements using Spark internal index column. I'm planning to never support align=False in any of these row_wise functions in Sedona because Spark dataframes do not naturally support a deterministic order to my understanding. It's possible to hack something up with an extra column in the cases where the user specifies an index, but I don't think it makes sense to support that to be honest.
What do you think?

… operation

Copilot

Pull Request Overview

Implements row-wise spatial operations (intersects and intersection) in GeoSeries and updates related tests.

Added support for multi-column queries and optional DataFrame override in _query_geometry_column
Introduced _row_wise_operation helper, and intersects/intersection methods in GeoSeries
Extended tests in both test_match_geopandas_series.py and test_geoseries.py to cover the new methods and handle empty-geometry mismatches

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
python/sedona/geopandas/geoseries.py	Added `PS_INDEX_COL`, updated `_query_geometry_column`, and implemented `_row_wise_operation`, `intersects`, and `intersection`
python/tests/geopandas/test_match_geopandas_series.py	Added `test_intersects`/`test_intersection`, and skipped mismatched empty geometries in comparison
python/tests/geopandas/test_geoseries.py	Added `test_intersects`/`test_intersection`, and skipped mismatched empty geometries in comparison

Copilot · 2025-07-02T17:12:45Z

+            raise NotImplementedError("Sedona Geopandas does not support align=False")
+
+        if isinstance(other, BaseGeometry):
+            other = GeoSeries([other] * len(self))


When wrapping a single BaseGeometry into a GeoSeries, the original index is not preserved. This can lead to incorrect alignment when the series has a custom index. Consider passing index=self.index to the constructor, e.g., GeoSeries([other] * len(self), index=self.index).

Suggested change

other = GeoSeries([other] * len(self))

other = GeoSeries([other] * len(self), index=self.index)

Reverted this too for the same reason as the below change. This is not a left join, so using self.index isn't right. It also caused tests to fail. Will revisit later once we add index support.

Copilot · 2025-07-02T17:12:45Z

+        other_df = other._internal.spark_frame.select(
+            col(other.get_first_geometry_column()).alias("R"), col(PS_INDEX_COL)
+        )
+        joined_df = df.join(other_df, on=PS_INDEX_COL, how="outer")


Using an outer join can introduce rows for indices that exist only in one series, potentially yielding NULL geometries and unexpected results. To better mirror pandas-style alignment for row-wise operations, consider using a left join (how="left") or explicitly handling missing values before applying the spatial predicate.

Suggested change

joined_df = df.join(other_df, on=PS_INDEX_COL, how="outer")

joined_df = df.join(other_df, on=PS_INDEX_COL, how="left")

This originally made sense, but it caused the match geopandas tests to fail. After testing it out on various inputs manually, it does seem to do an outer join 🤷

zhangfengcdt · 2025-07-02T17:33:28Z

@zhangfengcdt For these row-wise operations, the default align=True matches elements using Spark internal index column. I'm planning to never support align=False in any of these row_wise functions in Sedona because Spark dataframes do not naturally support a deterministic order to my understanding. It's possible to hack something up with an extra column in the cases where the user specifies an index, but I don't think it makes sense to support that to be honest. What do you think?

I think the API is to match on the order of geopandas index column (not spark dataframe index). This is actually useful because it is the only way the user would be able to align other columns with the resulting intersects results. And this seems to be a pretty common case. However, if no index is used in the GeoSeries creation, then we don't need to support alignment.

I am thinking if there is index created and users want to do align, we should add the original index column to the SQL sent to Sedona, and resulting PySpark Series should keep this index column in it. This way, end users could link the index to intersects results.

petern48 · 2025-07-02T18:40:23Z

@zhangfengcdt I think we're mostly on the same page actually. The spark index column (__index_level_{}__) I'm using does actually represent the index in geopandas. See the comment in the from the pyspark codebase here below

# A function to turn given numbers to Spark columns that represent pandas-on-Spark index.
SPARK_INDEX_NAME_FORMAT = "__index_level_{}__".format
SPARK_DEFAULT_INDEX_NAME = SPARK_INDEX_NAME_FORMAT(0)

However, if no index is used in the GeoSeries creation, then we don't need to support alignment

If no index is given, pandas on pyspark creates a default index which we can use for the align=True. This is what the current tests use since we don't yet have index support.

Originally, I was proposing not to support align=False, where geopandas uses the "natural ordering" of the series instead of the given index. However, it looks like Pandas on PySpark does already have a hidden natural ordering column, so we can try using that.

Regardless, if the current default align=True logic sounds good to you, I'd rather merge this in now and revisit additional functionality (align=False) later when we add indexes (creating a separate issue of course). Does that make sense?

zhangfengcdt · 2025-07-02T19:18:41Z

@zhangfengcdt I think we're mostly on the same page actually. The spark index column (__index_level_{}__) I'm using does actually represent the index in geopandas. See the comment in the from the pyspark codebase here below
# A function to turn given numbers to Spark columns that represent pandas-on-Spark index.
SPARK_INDEX_NAME_FORMAT = "__index_level_{}__".format
SPARK_DEFAULT_INDEX_NAME = SPARK_INDEX_NAME_FORMAT(0)
However, if no index is used in the GeoSeries creation, then we don't need to support alignment

If no index is given, pandas on pyspark creates a default index which we can use for the align=True. This is what the current tests use since we don't yet have index support.

Originally, I was proposing not to support align=False, where geopandas uses the "natural ordering" of the series instead of the given index. However, it looks like Pandas on PySpark does already have a hidden natural ordering column, so we can try using that.

Regardless, if the current default align=True logic sounds good to you, I'd rather merge this in now and revisit additional functionality (align=False) later when we add indexes (creating a separate issue of course). Does that make sense?

Cool, that works for me.

zhangfengcdt

LGTM

petern48 · 2025-07-03T17:09:30Z

Adding traditional pandas index support turned out much simpler than I expected. That fix along with align=False support is this new PR: #2057

…ct (apache#2038) * Refactor process_geometry_column to create a more flexible query_geometry_column() * Implement length() * Implement intersection * Implement intersects * Add rename field to _row_wise_operation and provide select instead of operation * Replace PS_INDEX_COL w/ imported SPARK_DEFAULT_INDEX_NAME

petern48 added 4 commits June 28, 2025 12:00

Refactor process_geometry_column to create a more flexible query_geom…

6d0a061

…etry_column()

Implement length()

498ea1b

Implement intersection

bf52b88

Implement intersects

0592282

github-actions Bot added the sedona-python label Jul 1, 2025

petern48 requested a review from zhangfengcdt July 1, 2025 17:12

Merge branch 'master' into intersection_series

b7d73c0

petern48 force-pushed the intersection_series branch from f31308c to b7d73c0 Compare July 1, 2025 17:44

Merge branch 'master' into intersection_series

deb8554

petern48 marked this pull request as ready for review July 1, 2025 22:39

petern48 requested a review from jiayuasu as a code owner July 1, 2025 22:39

Add rename field to _row_wise_operation and provide select instead of…

6b7510b

… operation

petern48 force-pushed the intersection_series branch from 7828511 to 6b7510b Compare July 2, 2025 03:58

zhangfengcdt requested a review from Copilot July 2, 2025 17:10

Copilot AI reviewed Jul 2, 2025

View reviewed changes

petern48 force-pushed the intersection_series branch from fe07681 to 282b983 Compare July 2, 2025 18:25

Replace PS_INDEX_COL w/ imported SPARK_DEFAULT_INDEX_NAME

18d069e

petern48 force-pushed the intersection_series branch from 282b983 to 18d069e Compare July 2, 2025 18:47

zhangfengcdt approved these changes Jul 2, 2025

View reviewed changes

This was referenced Jul 2, 2025

Geopandas.GeoSeries: Implement proper index behavior for _row_wise_operations #2049

Closed

Geopandas.GeoSeries: Implement align=False behavior for _row_wise_operation + support index #2050

Closed

jiayuasu approved these changes Jul 2, 2025

View reviewed changes

jiayuasu added improvement affect public APIs labels Jul 2, 2025

jiayuasu merged commit f8fe6f3 into apache:master Jul 2, 2025
26 checks passed

petern48 deleted the intersection_series branch July 2, 2025 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GH-2037] Implement _row_wise_operation + intersection, intersect#2038

[GH-2037] Implement _row_wise_operation + intersection, intersect#2038
jiayuasu merged 8 commits into
apache:masterfrom
petern48:intersection_series

petern48 commented Jul 1, 2025

Uh oh!

petern48 commented Jul 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 2, 2025

Uh oh!

petern48 Jul 2, 2025

Uh oh!

Copilot AI Jul 2, 2025

Uh oh!

petern48 Jul 2, 2025

Uh oh!

zhangfengcdt commented Jul 2, 2025

Uh oh!

petern48 commented Jul 2, 2025

Uh oh!

zhangfengcdt commented Jul 2, 2025

Uh oh!

zhangfengcdt left a comment

Uh oh!

Uh oh!

petern48 commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	other = GeoSeries([other] * len(self))
	other = GeoSeries([other] * len(self), index=self.index)

	joined_df = df.join(other_df, on=PS_INDEX_COL, how="outer")
	joined_df = df.join(other_df, on=PS_INDEX_COL, how="left")

Conversation

petern48 commented Jul 1, 2025

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

petern48 commented Jul 1, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

petern48 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

petern48 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

zhangfengcdt commented Jul 2, 2025

Uh oh!

petern48 commented Jul 2, 2025

Uh oh!

zhangfengcdt commented Jul 2, 2025

Uh oh!

zhangfengcdt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

petern48 commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants