[GH-2013] Geopandas.Series: Implement crs property and set_crs#2014
Conversation
|
EDIT: nvm i got it |
|
Ready for review @zhangfengcdt. I'll also start properly doing draft PRs in the future and open for review when they're ready. |
zhangfengcdt
left a comment
There was a problem hiding this comment.
looks good, just one inline comment
There was a problem hiding this comment.
Pull Request Overview
This PR implements a new crs property and a set_crs method for GeoSeries, and adds corresponding tests.
- Updates tests in both test_match_geopandas_series.py and test_geoseries.py to validate CRS handling.
- Implements CRS assignment, error handling, and internal processing changes in geoseries.py.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| python/tests/geopandas/test_match_geopandas_series.py | Added tests comparing crs behavior against geopandas.GeoSeries |
| python/tests/geopandas/test_geoseries.py | Introduced tests for set_crs functionality including error cases |
| python/sedona/geopandas/geoseries.py | Implemented crs property, its setter, and set_crs with improved SQL alias and error handling |
Comments suppressed due to low confidence (1)
python/sedona/geopandas/geoseries.py:378
- Ensure that BinaryType is imported from pyspark.sql.types to make the code self-contained and clearer.
if isinstance(data_type, BinaryType):
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
| GeoSeries.to_crs : re-project to another CRS | ||
| """ | ||
| tmp_df = self._process_geometry_column("ST_SRID", rename="crs") | ||
| srid = tmp_df.to_pandas().iloc[0] |
There was a problem hiding this comment.
Note that to_pandas() function will collect the distributed Sedona DF to the master node. This will likely cause OOM on the master. The best way to do this is via Spark's take() which only takes the first value of the distributed DF without collecting the entire DF.
In addition, this function assumes all records in the DF has the same CRS/SRID which is not true for Sedona in many cases. It is worth mentioning this in the docs.
There was a problem hiding this comment.
You can also do tmp_df._to_spark_pandas_df() that returns the pspd.DataFrame (pyspark).
There was a problem hiding this comment.
Thanks. Yeah I remember realizing that, but I forgot to circle back to figure out how to remove the to_pandas.
Also added the doc change. Slight nit about your wording: all CRS have to be the same in the Series, not the DF. Different series are still allowed to have different CRS in geopandas.
There was a problem hiding this comment.
I think we should overwrite the two functions to_pandas() and to_geopandas in GeoSeries to disable end users from performing these operations directly. We can still keep internal versions of them _to_pandas and _to_geopandas if needed by the implementation itself.
In PySpark, it keeps the to_pandas() that warn users that
log_advice(
"`to_pandas` loads all data into the driver's memory. "
"It should only be used if the resulting pandas Series is expected to be small."
)
And we should at least do this in to_geopandas() as well if we don't like the idea of disabling it.
There was a problem hiding this comment.
I personally prefer keeping the public method, so I added the warning message to to_geopandas().
|
|
||
| import geopandas as gpd | ||
| import pandas as pd | ||
| from pyproj import CRS |
There was a problem hiding this comment.
@petern48 can you import pyproj only when it is needed?
There was a problem hiding this comment.
Yep, just did. Didn't realize I had a top level import
…pache#2014) * Implement set_crs and crs property * Add support to keep old name instead of renaming column for set_crs * Stop catching (hiding) exceptions in __repr__() * Add backticks around {rename} * Implement inplace correctly * Fix 'the the' Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix fmt * Use .take() instead of to_pandas() * Add note about the all CRS in the series being the same * Remove top-level pyproj import * Add to_geopandas warning msg --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Did you read the Contributor Guide?
Is this PR related to a ticket?
[GH-XXX] my subject.What changes were proposed in this PR?
Implement
crsproperty andset_crs()How was this patch tested?
Added new tests
Did this PR include necessary documentation updates?