[GH-2004] Geopandas.GeoSeries: Implement Test Framework by petern48 · Pull Request #2005 · apache/sedona

petern48 · 2025-06-23T18:36:42Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [GH-XXX] my subject.
Geopandas.Series: Implement Test Framework #2004

What changes were proposed in this PR?

Implement GeoSeries test skeleton. The goal here is to fully flush out the common testing code we will use for testing GeoSeries methods. Once we have this merged in, I can rapid-fire GeoSeries method implementations using the same testing structure, occasionally adding function specific tests when needed.

This PR also fixes a bug in the __repr__() method and changes the return type of .area() to pd.Series to be consistent with the Geopandas behavior.

How was this patch tested?

Add tests for existing functionality

Did this PR include necessary documentation updates?

No, this PR does not affect any public API so no need to change the documentation.

… add area tests

petern48 · 2025-06-23T18:42:35Z

@zhangfengcdt PR definitely isn't fully ready yet, but I want to hear your thoughts. What types of common tests should we include for each function? Originally, I was thinking we have type checking tests (done) and tests for comparing our output to original geopandas function output, but the result of .buffer() is too far off for to match geopandas output successfully. Should we drop that type of test entirely or keep that type of test for the methods that can pass it? I guess we have to add tests for manually checking the output. Any other tests we should add?

As I mentioned in the PR description, I want this PR to flush out all the common tests we use for every method.

zhangfengcdt · 2025-06-23T19:37:42Z

@zhangfengcdt PR definitely isn't fully ready yet, but I want to hear your thoughts. What types of common tests should we include for each function? Originally, I was thinking we have type checking tests (done) and tests for comparing our output to original geopandas function output, but the result of .buffer() is too far off for to match geopandas output successfully. Should we drop that type of test entirely or keep that type of test for the methods that can pass it? I guess we have to add tests for manually checking the output. Any other tests we should add?

As I mentioned in the PR description, I want this PR to flush out all the common tests we use for every method.

I think we need both basic tests that manually check the results from the API and the match to geopandas test suits. For the later, we might get some idea from the pandas on spark package, especially how assertPandasOnSparkEqual works in pyspark source.

For the former, Sedona has a test suit for expression (functions) and we may want to at least cover the cases in there:
https://github.com/apache/sedona/blob/master/spark/common/src/test/scala/org/apache/sedona/sql/functionTestScala.scala

petern48 · 2025-06-24T17:15:43Z

For various reasons, none of the built in assert methods work out of the box, even when using parameters like checkExact and check_less_precise. I did manage to get it to pass by looping through, which we already do elsewhere in our code, and tuning the tolerance a bit. Personally, I think it's good enough for now.

Another reason I'd like to avoid using the pyspark testing functions (e.g assertPandasOnSparkEqual and assertDataFrameEqual) because they've been removed and added across different version. They're not available until 3.5.0, and we'd have to start using annoying conditional logic like if pyspark.__version__ >= 4.0.0 use this function, else if use this one, else skip, etc. Cleaner and easier to maintain if we just avoid using them all together.

petern48 · 2025-06-24T20:59:15Z

I moved old test code to test_match_geopandas_series.py and created a new test_geoseries.py for micmic the scala tests you mentioned. I guess for these ST_AREA and ST_BUFFER, functionTestScala.scala doesn't test for exact output. Is this still fine @zhangfengcdt?

zhangfengcdt · 2025-06-24T23:59:16Z

I moved old test code to test_match_geopandas_series.py and created a new test_geoseries.py for micmic the scala tests you mentioned. I guess for these ST_AREA and ST_BUFFER, functionTestScala.scala doesn't test for exact output. Is this still fine @zhangfengcdt?

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

petern48 · 2025-06-25T05:50:02Z

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

That made perfect sense originally, but now I see why it wasn't done in Scala. These test files we're using (self.mixedWktGeometryInputLocation) have 100 entries. Hard coding 100 polygons for a single test (e.g for ST_buffer) would make reading and navigating the test file a real pain.

Another thought. Originally, I had a test checking if the sedona sql function's result matched our new sgpd result. However, we agreed to remove it because it seemed trivial that of course the results would match. Following that same logic, if we can assume the new sedona geopandas results match the sedona sql results, then why do we need to replicate the Scala tests again if we already know sedona sql passes it? Or maybe we should just not make that assumption (in that case maybe it is worth bringing back that old sgpd vs sedona test). Personally, I'm fine with making the assumption. If anything, maybe we should hard-code results for the test_match_geopandas_series.py tests since those results are smaller. WDYT @zhangfengcdt?

zhangfengcdt · 2025-06-25T17:17:35Z

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

That made perfect sense originally, but now I see why it wasn't done in Scala. These test files we're using (self.mixedWktGeometryInputLocation) have 100 entries. Hard coding 100 polygons for a single test (e.g for ST_buffer) would make reading and navigating the test file a real pain.

Another thought. Originally, I had a test checking if the sedona sql function's result matched our new sgpd result. However, we agreed to remove it because it seemed trivial that of course the results would match. Following that same logic, if we can assume the new sedona geopandas results match the sedona sql results, then why do we need to replicate the Scala tests again if we already know sedona sql passes it? Or maybe we should just not make that assumption (in that case maybe it is worth bringing back that old sgpd vs sedona test). Personally, I'm fine with making the assumption. If anything, maybe we should hard-code results for the test_match_geopandas_series.py tests since those results are smaller. WDYT @zhangfengcdt?

I think the reason why we still need the hard-coded tests are that they are not simply replicating the scala expression tests, they should extend the scala cases. Basically, the goals are:

Verify the SQL created to convert geopandas -> sedona queries are expected;
Verify manually the results for matching or differences (reason) between geopandas origin and geopandas on Sedona.

This means we will very possibly cover more cases than the scala expression tests.

petern48 · 2025-06-25T17:30:17Z

Yes that makes sense, but the hard-coded outputs from the original scala tests would clutter up the codebase too much due to their sheer size (possibly like 1000+ lines for some test cases). IMO, it's not reasonable to hard code those outputs. Then at that point, if we're not extending the scala tests, is it worth still replicate those tests exactly in in python (test_geoseries.py)?

How do you feel about hard-coding the geopandas match test cases instead? Those ones are smaller, so the expected results won't be too bad.

zhangfengcdt · 2025-06-25T17:33:06Z

Yes that makes sense, but the hard-coded outputs from the original scala tests would clutter up the codebase too much due to their sheer size (possibly like 1000+ lines for some test cases). IMO, it's not reasonable to hard code those outputs. Then at that point, if we're not extending the scala tests, is it worth still replicate those tests exactly in in python (test_geoseries.py)?

How do you feel about hard-coding the geopandas match test cases instead? Those ones are smaller, so the expected results won't be too bad.

Oh, I meant we can replicate the types of checks from that scala test, not the exact data, we can use smaller inputs for sure.

petern48 · 2025-06-25T21:37:40Z

How's this

zhangfengcdt

LGTM

…#2005)

petern48 added 5 commits June 23, 2025 09:40

Fix small nit in series __repr__()

f990012

Add test_non_geom_fails()

46f34e5

test_constructor on all different geometry types

245f62c

Change Series.area return type to pd.Series to match gpd behavior and…

b30ef0c

… add area tests

Fix GeoSeries.to_pandas() and fix refactor tests

7827ea7

petern48 requested a review from jiayuasu as a code owner June 23, 2025 18:36

github-actions Bot added the sedona-python label Jun 23, 2025

pre-commit

e91bdb8

petern48 requested a review from zhangfengcdt June 23, 2025 18:48

Test if sgpd_res equals sedona result and gpd result

113bda7

jiayuasu reviewed Jun 24, 2025

View reviewed changes

Comment thread python/sedona/geopandas/geoseries.py Outdated

petern48 added 4 commits June 24, 2025 11:41

Remove run_sedona_sql test

0f4f5f5

Rename test_geoseries.py to test_match_geopandas_series.py

0509877

Make area( return ps.Series instead of pd.Series

521cff3

Add new test_geoseries to mimic the scala tests

3805261

zhangfengcdt reviewed Jun 25, 2025

View reviewed changes

Comment thread python/sedona/geopandas/geoseries.py

petern48 added 2 commits June 25, 2025 13:22

Use smaller tests for test_geoseries and hard-code expected results

d1eeb1b

Remove check_less_precise for version compatibility

10d2550

zhangfengcdt approved these changes Jun 25, 2025

View reviewed changes

petern48 requested a review from jiayuasu June 25, 2025 23:36

jiayuasu added this to the sedona-1.8.0 milestone Jun 25, 2025

jiayuasu added the improvement label Jun 25, 2025

jiayuasu approved these changes Jun 25, 2025

View reviewed changes

jiayuasu merged commit d799f50 into apache:master Jun 25, 2025
26 checks passed

petern48 deleted the series_test_framework branch June 25, 2025 23:56

Kontinuation pushed a commit to Kontinuation/sedona that referenced this pull request Jan 21, 2026

[apacheGH-2004] Geopandas.GeoSeries: Implement Test Framework (apache…

167b903

…#2005)

Uh oh!

Conversation

petern48 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

petern48 commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangfengcdt commented Jun 23, 2025

Uh oh!

petern48 commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

petern48 commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhangfengcdt commented Jun 24, 2025

Uh oh!

Uh oh!

petern48 commented Jun 25, 2025

Uh oh!

zhangfengcdt commented Jun 25, 2025

Uh oh!

petern48 commented Jun 25, 2025

Uh oh!

zhangfengcdt commented Jun 25, 2025

Uh oh!

petern48 commented Jun 25, 2025

Uh oh!

zhangfengcdt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

petern48 commented Jun 23, 2025 •

edited

Loading

petern48 commented Jun 23, 2025 •

edited

Loading

petern48 commented Jun 24, 2025 •

edited

Loading

petern48 commented Jun 24, 2025 •

edited

Loading