feat: add support for GeoArrow WKB and Points#713
Conversation
implement WKB and point types for geoarrow extensions
|
All new files need to have the apache license header added to them please. Also you can use pre-commit to run the linter locally |
9913151 to
b77f9ce
Compare
|
sorry about the churn - had a chicken/egg problem on the license check. |
paleolimbot
left a comment
There was a problem hiding this comment.
Very cool!
My main question is: why here? If the geoarrow implementation lives here you have to wait for Matt to review the changes when your team/the geo-go community is more likely to provide useful feedback at the speed you want to iterate (zeroshade to Matt, pun intended).
If there's something specific required for the Parquet reader to work, I would suggest a much more limited scope (i.e., just geoarrow.wkb). I'm actually not sure you need it at all: in Arrow C++ we were able to support GeoParquet and completely avoid implementing any GeoArrow extension types (we just query the global type registry and if somebody else had registered the extension type and deseriailzed some JSON, returning a geoarrow.wkb data type).
Happy to defer to Matt on the scope he's willing to maintain here! (Also equally happy to open up geoarrow/geoarrow-go for you, or link to seerai/geoarrow-go from our website).
| if e.meta.CRSType != other.meta.CRSType { | ||
| return false | ||
| } | ||
| if e.meta.Edges != other.meta.Edges { | ||
| return false | ||
| } | ||
|
|
||
| if len(e.meta.CRS) != len(other.meta.CRS) { | ||
| return false | ||
| } | ||
| for i, r := range e.meta.CRS { | ||
| if r != other.meta.CRS[i] { | ||
| return false | ||
| } | ||
| } |
There was a problem hiding this comment.
I am not sure if you want to handle CRS equality properly here (we handle it here: https://github.com/apache/sedona-db/blob/4fcf9edf81fe5010a3b51475c243f155942e5541/rust/sedona-schema/src/crs.rs#L290-L300 ).
| // As defined in https://github.com/geoarrow/geoarrow | ||
| const ( | ||
| PointID GeometryTypeID = 1 | ||
| LineStringID GeometryTypeID = 2 | ||
| PolygonID GeometryTypeID = 3 | ||
| MultiPointID GeometryTypeID = 4 | ||
| MultiLineStringID GeometryTypeID = 5 | ||
| MultiPolygonID GeometryTypeID = 6 |
There was a problem hiding this comment.
May be worth repeating here that value % 10 and value / 10 have meaning here (can be used to separate geometry type and dimensions)
| if v.dim != XYZ && v.dim != XYZM { | ||
| return 0 | ||
| } | ||
| return v.coords[2] |
There was a problem hiding this comment.
I personally like returning NaN here but I know that the default value of 0 is more common (I just like having this be explicit).
|
Up to Matt, but my general thinking here is for the iceberg-go package since it already depends on it, GeoArrow would be a registered extension here. I'd like to get a geoarrow-go or geoarrow-go-* set of packages for different flavors of go geometry packages. In general - geospatial types really shouldn't be considered "special". Keeping them here lowers friction in general. I'll take a look at your comments - seem straightforward. |
If you're willing to handle some basic CRS things here I'm happy to review them, and, geoarrow.wkb certainly has precedent and is the required zero copy to/from type for Parquet and Iceberg. The other ones are considerably more complex, though, and aren't needed for Parquet or Iceberg. I think we would all benefit from seeing what it is like to have geoarrow.wkb here before embarking on that project. (I personally think those extension types are the perfect example of domain-specific types that are more efficiently maintained elsewhere). |
|
I wasn't planning a full geometry implementation, but I did Point and Polygon to make sure the pattern works well - they are just basic containers for the arrow values and some generic utilities: dwilson1988#1 along with a extremely minimal test for zero-copy between arrow and go-geom here https://github.com/seerai/geoarrow-go I don't think taking it any further than that really has any value, but I think stopping at WKB isn't far enough - want the arrow ecosystem in go complete for more than just iceberg/parquet. |
|
@paleolimbot would it make more sense possibly to create a repo in the
That said, I think I'm fine with the basic CRS things and minimum necessary to zero-copy to/from for Parquet and Iceberg to live here with additional packages in the geoarrow org to add more functionality for users that might want it. I'll give this a review and take a look at the level of complexity here. Depending on the amount of work/complexity it might make sense for the entire extension type to live in a geoarrow org repo instead of here even for this basic piece. |
|
I'm fine with whatever approach you feel is best. I can start the conversation with @kylebarron if that's the direction you want to go. Happy to incubate/maintain them in the @seerai org until ready too. The main thing to me that matters is the geospatial types are accessible to everyone. They've been specialized in software for far too long. |
|
I agree with you that we should make them accessible to everyone. It looks like you've pretty much isolated everything into a single geoarrow package which is perfect and i think makes it ripe to be pulled out as its own package that can live in the geoarrow org (@paleolimbot could you help with creating a repo there for this?) where it can grow and extend to cover all of the geoarrow types and include packages for connecting to the wider geospatial ecosystem in go with geom-go etc. Then iceberg-go can simply import that geoarrow-go package and register the extension type to manipulate them. Given the scope of this I think that's probably the best route, particularly given the TODOs you have here for the remaining types. I wouldn't want to push this out here and then move it later on, it would be better to have that separate module from the beginning so it has room to grow. |
|
No objections from me for a new repo in the geoarrow org! |
Done! https://github.com/geoarrow/geoarrow-go (Daniel, you should have write access when you accept the invite...feel free to push willy nilly although I am happy to review anything if you'd like) |
|
Sweet. Thanks! We sticking with an MIT or should it be an Apache 2.0? I'll migrate this PR there as the baseline and we can iterate. Point of clarification: Do we want to geoarrow-go to be a dependency of arrow-go or make it an opt-in extension? |
It should be an opt-in extension. Once you have the package ready in the geoarrow-go repo we can add that as a dependency to iceberg-go to handle the geospatial stuff |
|
Closing this since we're moving in a different direction. Thanks for the input, all! |
Either! I think Kyle's repos are dual licensed Apache/MIT and I think mine are Apache but I didn't think too hard about it. |
This PR migrates the contents of apache/arrow-go#713 into this repo This does not yet address @paleolimbot's comments such as on CRS equality and will be covered in a future PR. This just puts something here for us to work on, but consider it rough clay.
#504
Rationale for this change
This PR adds GeoArrow types for one native type (Point) and one encoded type (WKB) with a pattern set for adding all of the others by following the implementation pattern. I've tested integration with a 3rd party geometry library and confirmed it can be done with zero-copy (mostly, depends on the external library) for the underlying data.
This is a pretty big PR, but instead of breaking up the native and encoded types, I wanted one of each, with a full set of tests.
This PR supercedes #545. I think I got all/most of the comments applied there rolled into this implementation. The biggest change was isolating the geoarrow types into a subpackage to keep things neat and tidy.
I have a branch that implements for polygons as well and a very rough go-geom integration to demonstate the pattern.
What changes are included in this PR?
geoarrow.Dimensions- the x/y/z/m layout of the geometry datageoarrow.EdgeInterpolation- the intended interpolation algorithm for edges in geometriesgeoarrow.CRSType- the various formats that GeoArrow permits for spatial referencesgeoarrow.Metadata- a type used to store common metadata for all GeoArrow Geometry Types.geoarrow.GeoemetryValue- a basic interface that geometry values must implement.geoarrow.GeometryType- an interface that ties a native/encoded GeoArrow type to a Value formatgeoarrow.valueBuilder[V, G]that implements generic boilerplate for geometry buildersgeoarrow.geometryArraythat implements generic boilerplate for geometry arraysgeoarrow.WKBType- a type for WKB encoded geometry typesgeoarrow.WKBValue- represents WKB encoded geometry valuesgeoarrow.WKBArraya type alias for wkb arraysgeoarrow.WKBBuilder- a type alias for wkb buildersgeoarrow.PointTypea type for native encoded point geometry types.geoarrow.PointValuea type for native encoded point values.geoarrow.PointArraya type alias for point arraysgeoarrow.PointBuilder- a type alias for point buildersAre these changes tested?
Yes - tests implemented for each type following the patterns set by the rest of the extensions.
Are there any user-facing changes?
Yes - the addition of new extension types.
Additional disclaimer: I wrote all of the WKB and most of the point implementation by hand, but did use Claude Code to fill out the tests and apply some light uniformity refactors to mirror patterns in this repo.
@zeroshade @paleolimbot @Mandukhai-Alimaa