Skip to content

[GH-2013] Geopandas.Series: Implement crs property and set_crs#2014

Merged
jiayuasu merged 12 commits into
apache:masterfrom
petern48:series_crs
Jul 1, 2025
Merged

[GH-2013] Geopandas.Series: Implement crs property and set_crs#2014
jiayuasu merged 12 commits into
apache:masterfrom
petern48:series_crs

Conversation

@petern48
Copy link
Copy Markdown
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Implement crs property and set_crs()

How was this patch tested?

Added new tests

Did this PR include necessary documentation updates?

  • Yes, I have updated the documentation.

@petern48 petern48 requested a review from jiayuasu as a code owner June 26, 2025 00:22
@petern48 petern48 requested review from zhangfengcdt and removed request for jiayuasu June 26, 2025 00:22
@petern48
Copy link
Copy Markdown
Member Author

petern48 commented Jun 26, 2025

inplace is not yet working. I tried fixing it a few ways (e.g modifying the internal frame or anchor), but it didn't work. I prefer to just move on for now since it's not the most important feature. Implementing inplace for all functions should be the same.

EDIT: nvm i got it

@petern48
Copy link
Copy Markdown
Member Author

Ready for review @zhangfengcdt. I'll also start properly doing draft PRs in the future and open for review when they're ready.

Copy link
Copy Markdown
Member

@zhangfengcdt zhangfengcdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, just one inline comment

Comment thread python/sedona/geopandas/geoseries.py
@jiayuasu jiayuasu requested a review from Copilot June 27, 2025 20:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a new crs property and a set_crs method for GeoSeries, and adds corresponding tests.

  • Updates tests in both test_match_geopandas_series.py and test_geoseries.py to validate CRS handling.
  • Implements CRS assignment, error handling, and internal processing changes in geoseries.py.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
python/tests/geopandas/test_match_geopandas_series.py Added tests comparing crs behavior against geopandas.GeoSeries
python/tests/geopandas/test_geoseries.py Introduced tests for set_crs functionality including error cases
python/sedona/geopandas/geoseries.py Implemented crs property, its setter, and set_crs with improved SQL alias and error handling
Comments suppressed due to low confidence (1)

python/sedona/geopandas/geoseries.py:378

  • Ensure that BinaryType is imported from pyspark.sql.types to make the code self-contained and clearer.
            if isinstance(data_type, BinaryType):

Comment thread python/sedona/geopandas/geoseries.py Outdated
petern48 and others added 2 commits June 28, 2025 18:49
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@petern48 petern48 changed the title Geopandas.Series: Implement crs property and set_crs [GH-2013] Geopandas.Series: Implement crs property and set_crs Jun 29, 2025
Comment thread python/sedona/geopandas/geoseries.py Outdated
GeoSeries.to_crs : re-project to another CRS
"""
tmp_df = self._process_geometry_column("ST_SRID", rename="crs")
srid = tmp_df.to_pandas().iloc[0]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that to_pandas() function will collect the distributed Sedona DF to the master node. This will likely cause OOM on the master. The best way to do this is via Spark's take() which only takes the first value of the distributed DF without collecting the entire DF.

In addition, this function assumes all records in the DF has the same CRS/SRID which is not true for Sedona in many cases. It is worth mentioning this in the docs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also do tmp_df._to_spark_pandas_df() that returns the pspd.DataFrame (pyspark).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yeah I remember realizing that, but I forgot to circle back to figure out how to remove the to_pandas.

Also added the doc change. Slight nit about your wording: all CRS have to be the same in the Series, not the DF. Different series are still allowed to have different CRS in geopandas.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should overwrite the two functions to_pandas() and to_geopandas in GeoSeries to disable end users from performing these operations directly. We can still keep internal versions of them _to_pandas and _to_geopandas if needed by the implementation itself.

In PySpark, it keeps the to_pandas() that warn users that

        log_advice(
            "`to_pandas` loads all data into the driver's memory. "
            "It should only be used if the resulting pandas Series is expected to be small."
        )

And we should at least do this in to_geopandas() as well if we don't like the idea of disabling it.

Copy link
Copy Markdown
Member Author

@petern48 petern48 Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally prefer keeping the public method, so I added the warning message to to_geopandas().

Comment thread python/sedona/geopandas/geoseries.py
Comment thread python/sedona/geopandas/geoseries.py Outdated

import geopandas as gpd
import pandas as pd
from pyproj import CRS
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petern48 can you import pyproj only when it is needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, just did. Didn't realize I had a top level import

@petern48 petern48 requested a review from jiayuasu June 30, 2025 21:59
@jiayuasu jiayuasu added this to the sedona-1.8.0 milestone Jul 1, 2025
@jiayuasu jiayuasu merged commit 319e8ee into apache:master Jul 1, 2025
26 checks passed
@petern48 petern48 deleted the series_crs branch July 1, 2025 22:06
Kontinuation pushed a commit to Kontinuation/sedona that referenced this pull request Jan 21, 2026
…pache#2014)

* Implement set_crs and crs property

* Add support to keep old name instead of renaming column for set_crs

* Stop catching (hiding) exceptions in __repr__()

* Add backticks around {rename}

* Implement inplace correctly

* Fix 'the the'

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* fix fmt

* Use .take() instead of to_pandas()

* Add note about the all CRS in the series being the same

* Remove top-level pyproj import

* Add to_geopandas warning msg

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geopandas.GeoSeries: Implement crs property and set_crs

4 participants