Skip to content

[GH-2004] Geopandas.GeoSeries: Implement Test Framework#2005

Merged
jiayuasu merged 13 commits into
apache:masterfrom
petern48:series_test_framework
Jun 25, 2025
Merged

[GH-2004] Geopandas.GeoSeries: Implement Test Framework#2005
jiayuasu merged 13 commits into
apache:masterfrom
petern48:series_test_framework

Conversation

@petern48

@petern48 petern48 commented Jun 23, 2025

Copy link
Copy Markdown
Member

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

Implement GeoSeries test skeleton. The goal here is to fully flush out the common testing code we will use for testing GeoSeries methods. Once we have this merged in, I can rapid-fire GeoSeries method implementations using the same testing structure, occasionally adding function specific tests when needed.

This PR also fixes a bug in the __repr__() method and changes the return type of .area() to pd.Series to be consistent with the Geopandas behavior.

How was this patch tested?

Add tests for existing functionality

Did this PR include necessary documentation updates?

  • No, this PR does not affect any public API so no need to change the documentation.

@petern48

petern48 commented Jun 23, 2025

Copy link
Copy Markdown
Member Author

@zhangfengcdt PR definitely isn't fully ready yet, but I want to hear your thoughts. What types of common tests should we include for each function? Originally, I was thinking we have type checking tests (done) and tests for comparing our output to original geopandas function output, but the result of .buffer() is too far off for to match geopandas output successfully. Should we drop that type of test entirely or keep that type of test for the methods that can pass it? I guess we have to add tests for manually checking the output. Any other tests we should add?

As I mentioned in the PR description, I want this PR to flush out all the common tests we use for every method.

@petern48 petern48 requested a review from zhangfengcdt June 23, 2025 18:48
@zhangfengcdt

Copy link
Copy Markdown
Member

@zhangfengcdt PR definitely isn't fully ready yet, but I want to hear your thoughts. What types of common tests should we include for each function? Originally, I was thinking we have type checking tests (done) and tests for comparing our output to original geopandas function output, but the result of .buffer() is too far off for to match geopandas output successfully. Should we drop that type of test entirely or keep that type of test for the methods that can pass it? I guess we have to add tests for manually checking the output. Any other tests we should add?

As I mentioned in the PR description, I want this PR to flush out all the common tests we use for every method.

I think we need both basic tests that manually check the results from the API and the match to geopandas test suits. For the later, we might get some idea from the pandas on spark package, especially how assertPandasOnSparkEqual works in pyspark source.

For the former, Sedona has a test suit for expression (functions) and we may want to at least cover the cases in there:
https://github.com/apache/sedona/blob/master/spark/common/src/test/scala/org/apache/sedona/sql/functionTestScala.scala

@petern48

petern48 commented Jun 24, 2025

Copy link
Copy Markdown
Member Author

For various reasons, none of the built in assert methods work out of the box, even when using parameters like checkExact and check_less_precise. I did manage to get it to pass by looping through, which we already do elsewhere in our code, and tuning the tolerance a bit. Personally, I think it's good enough for now.

Another reason I'd like to avoid using the pyspark testing functions (e.g assertPandasOnSparkEqual and assertDataFrameEqual) because they've been removed and added across different version. They're not available until 3.5.0, and we'd have to start using annoying conditional logic like if pyspark.__version__ >= 4.0.0 use this function, else if use this one, else skip, etc. Cleaner and easier to maintain if we just avoid using them all together.

Comment thread python/sedona/geopandas/geoseries.py Outdated
@petern48

petern48 commented Jun 24, 2025

Copy link
Copy Markdown
Member Author

I moved old test code to test_match_geopandas_series.py and created a new test_geoseries.py for micmic the scala tests you mentioned. I guess for these ST_AREA and ST_BUFFER, functionTestScala.scala doesn't test for exact output. Is this still fine @zhangfengcdt?

@zhangfengcdt

Copy link
Copy Markdown
Member

I moved old test code to test_match_geopandas_series.py and created a new test_geoseries.py for micmic the scala tests you mentioned. I guess for these ST_AREA and ST_BUFFER, functionTestScala.scala doesn't test for exact output. Is this still fine @zhangfengcdt?

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

Comment thread python/sedona/geopandas/geoseries.py
@petern48

Copy link
Copy Markdown
Member Author

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

That made perfect sense originally, but now I see why it wasn't done in Scala. These test files we're using (self.mixedWktGeometryInputLocation) have 100 entries. Hard coding 100 polygons for a single test (e.g for ST_buffer) would make reading and navigating the test file a real pain.

Another thought. Originally, I had a test checking if the sedona sql function's result matched our new sgpd result. However, we agreed to remove it because it seemed trivial that of course the results would match. Following that same logic, if we can assume the new sedona geopandas results match the sedona sql results, then why do we need to replicate the Scala tests again if we already know sedona sql passes it? Or maybe we should just not make that assumption (in that case maybe it is worth bringing back that old sgpd vs sedona test). Personally, I'm fine with making the assumption. If anything, maybe we should hard-code results for the test_match_geopandas_series.py tests since those results are smaller. WDYT @zhangfengcdt?

@zhangfengcdt

Copy link
Copy Markdown
Member

Looks great! I think for the first step, we could target covering whatever these scala tests cover in exact results.

That made perfect sense originally, but now I see why it wasn't done in Scala. These test files we're using (self.mixedWktGeometryInputLocation) have 100 entries. Hard coding 100 polygons for a single test (e.g for ST_buffer) would make reading and navigating the test file a real pain.

Another thought. Originally, I had a test checking if the sedona sql function's result matched our new sgpd result. However, we agreed to remove it because it seemed trivial that of course the results would match. Following that same logic, if we can assume the new sedona geopandas results match the sedona sql results, then why do we need to replicate the Scala tests again if we already know sedona sql passes it? Or maybe we should just not make that assumption (in that case maybe it is worth bringing back that old sgpd vs sedona test). Personally, I'm fine with making the assumption. If anything, maybe we should hard-code results for the test_match_geopandas_series.py tests since those results are smaller. WDYT @zhangfengcdt?

I think the reason why we still need the hard-coded tests are that they are not simply replicating the scala expression tests, they should extend the scala cases. Basically, the goals are:

  • Verify the SQL created to convert geopandas -> sedona queries are expected;
  • Verify manually the results for matching or differences (reason) between geopandas origin and geopandas on Sedona.

This means we will very possibly cover more cases than the scala expression tests.

@petern48

Copy link
Copy Markdown
Member Author

Yes that makes sense, but the hard-coded outputs from the original scala tests would clutter up the codebase too much due to their sheer size (possibly like 1000+ lines for some test cases). IMO, it's not reasonable to hard code those outputs. Then at that point, if we're not extending the scala tests, is it worth still replicate those tests exactly in in python (test_geoseries.py)?

How do you feel about hard-coding the geopandas match test cases instead? Those ones are smaller, so the expected results won't be too bad.

@zhangfengcdt

Copy link
Copy Markdown
Member

Yes that makes sense, but the hard-coded outputs from the original scala tests would clutter up the codebase too much due to their sheer size (possibly like 1000+ lines for some test cases). IMO, it's not reasonable to hard code those outputs. Then at that point, if we're not extending the scala tests, is it worth still replicate those tests exactly in in python (test_geoseries.py)?

How do you feel about hard-coding the geopandas match test cases instead? Those ones are smaller, so the expected results won't be too bad.

Oh, I meant we can replicate the types of checks from that scala test, not the exact data, we can use smaller inputs for sure.

@petern48

Copy link
Copy Markdown
Member Author

How's this

@zhangfengcdt zhangfengcdt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@petern48 petern48 requested a review from jiayuasu June 25, 2025 23:36
@jiayuasu jiayuasu added this to the sedona-1.8.0 milestone Jun 25, 2025
@jiayuasu jiayuasu merged commit d799f50 into apache:master Jun 25, 2025
26 checks passed
@petern48 petern48 deleted the series_test_framework branch June 25, 2025 23:56
Kontinuation pushed a commit to Kontinuation/sedona that referenced this pull request Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants