feat(rust/sedona-raster-gdal): add RS_Polygonize#955
Conversation
2619bd6 to
93d86e6
Compare
25f6b27 to
7c72117
Compare
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you! A few questions but this looks great.
| // Note: We deliberately use "Memory" instead of "MEM" to be compatible with older GDAL versions. | ||
| let mem_driver = gdal | ||
| .get_driver_by_name("Memory") | ||
| .map_err(|e| exec_datafusion_err!("Failed to get Memory driver: {e}"))?; |
There was a problem hiding this comment.
In earlier work we tried to avoid getting a driver by name in a loop to avoid contention on the global lock guarding the driver registry. Is it possible to avoid that lock here using the lower level mem dataset constructor?
(edit: I see this is for vector data. We're still doing this once per batch...is there anything we can do to avoid resolving the driver that frequently?)
There was a problem hiding this comment.
No. The dataset created by MEMDatasetCreate does not work well with vector layers in old GDAL versions such as 3.8 for some reason. We've hit a CI error because of this and switched to using the "Memory" driver. We use "Memory" instead of "MEM" here for the same reason.
Relevant GitHub workflow job: https://github.com/apache/sedona-db/actions/runs/27903364821/job/82567585551
failures:
---- rs_polygonize::tests::test_polygonize_invoke_array_and_nulls stdout ----
thread 'rs_polygonize::tests::test_polygonize_invoke_array_and_nulls' (46126) panicked at rust/sedona-raster-gdal/src/rs_polygonize.rs:348:14:
called `Result::unwrap()` on an `Err` value: Execution("Failed to create layer: GDAL method 'GDALDatasetCreateLayer' returned a NULL pointer. Error msg: 'CreateLayer() not supported by this dataset.'")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---- rs_polygonize::tests::test_polygonize_raster stdout ----
thread 'rs_polygonize::tests::test_polygonize_raster' (46129) panicked at rust/sedona-raster-gdal/src/rs_polygonize.rs:297:10:
called `Result::unwrap()` on an `Err` value: Execution("Failed to create layer: GDAL method 'GDALDatasetCreateLayer' returned a NULL pointer. Error msg: 'CreateLayer() not supported by this dataset.'")
---- rs_polygonize::tests::test_polygonize_invoke_scalar stdout ----
thread 'rs_polygonize::tests::test_polygonize_invoke_scalar' (46127) panicked at rust/sedona-raster-gdal/src/rs_polygonize.rs:327:14:
called `Result::unwrap()` on an `Err` value: Execution("Failed to create layer: GDAL method 'GDALDatasetCreateLayer' returned a NULL pointer. Error msg: 'CreateLayer() not supported by this dataset.'")
|
|
||
| fn polygon_value_struct_fields() -> Fields { | ||
| Fields::from(vec![ | ||
| Field::new("geom", DataType::Binary, false), |
There was a problem hiding this comment.
I know that it is annoying, but this should be an item_crs and propagate the string Crs verbatim from the raster crs field. You could propagate this array after iterating if you'd like because it should be an identical arc reference to the raster field.
There was a problem hiding this comment.
Switched the geom field to be item_crs. The field name is still "geom".
| } | ||
| }; | ||
|
|
||
| polygon_values.push((wkb, feature.field_as_double(idx))); |
There was a problem hiding this comment.
Can we use a BinaryBuilder and Float64Builder and write these incrementally to amortize the allocations?
There was a problem hiding this comment.
The result array construction code has been refactored. The polygons were directly appended to the output arrow arrays in a closure passed into polygonize_raster.
| .map_err(|e| exec_datafusion_err!("Failed to add field to layer: {e}"))?; | ||
|
|
||
| // Polygonize the raster band into the vector layer, using the "value" field to store pixel values. | ||
| gdal.polygonize(&raster_band, None, &layer, 0, &PolygonizeOptions::default()) |
There was a problem hiding this comment.
I believe our polygonize implementation is backed by GDALPolygonize.
I think we need to make sure to use GDALFPolygonize for cases where the raster is a float type and GDALPolygonize when it is an int. Otherwise we round floats to the nearest int and polygonize incorrectly
There was a problem hiding this comment.
I've added fpolygonize (backed by GDALFPolygonize) to sedona-gdal and dynamically dispatch to it when the raster is of a floating-point type (Float32/Float64), addressing this in a follow-up commit.
| } | ||
| }; | ||
|
|
||
| let band_num: usize = band_opt.unwrap().max(1).try_into().unwrap_or(1); |
There was a problem hiding this comment.
wont this clamp all values [1, inf)? so the validation below will never trigger?
There was a problem hiding this comment.
I reviewed how we normalize and validate band_num in other RS functions and found that we should only do .unwrap().unwrap_or(1). No max(1) should be performed. I'll remove the .max(1) here.
There was a problem hiding this comment.
I've removed the .max(1) call completely and rewritten the band validation logic. It now checks for <= 0 and out-of-range band values properly without silently wrapping/clamping negative values during conversion, addressing this in a follow-up commit.
| .map_err(|e| exec_datafusion_err!("Failed to add field to layer: {e}"))?; | ||
|
|
||
| // Polygonize the raster band into the vector layer, using the "value" field to store pixel values. | ||
| gdal.polygonize(&raster_band, None, &layer, 0, &PolygonizeOptions::default()) |
There was a problem hiding this comment.
I think also our behavior here is divergent from GDAL with regard to nodata values. We will polygonize them by default, but generally the GDAL will ignore them
|
This one is my fault but we need to consider how GDAL-backed RS function should work with the native outdb loader functionality I built to integrate zarr support. The simple solution is to annotate the function with Without a benchmark it is not clear to me whether the VRT outdb path is preferable to the needs_pixels/RS_EnsureLoaded solution. probably we need more data to make some determination on solution here. |
I think we can merge this without a benchmark...I'd prefer to optimize performance after there are a lot of passing unit tests 🙂 |
We still need to decide what we want to do for non-GDAL loaders before merging |
Kontinuation
left a comment
There was a problem hiding this comment.
We have addressed the suggestions regarding CRS propagation, builder usage, array extraction and interleaving testing in a follow-up commit.
paleolimbot
left a comment
There was a problem hiding this comment.
Thank you!
This looks great to me but would benefit from 👀 from @james-willis before merging.
james-willis
left a comment
There was a problem hiding this comment.
looks great! thanks
Summary
RS_Polygonizeraster functionDependency
mainTesting
cargo clippy -p sedona-raster-gdal -p sedona --all-targets -- -D warningscargo bench -p sedona-raster-gdal --bench rs_polygonize --no-runcargo test -p sedona-raster-gdalPATH="/home/kontinuation/workspace/github/apache/sedona-db/.venv/bin:$PATH" quarto render docs/reference/sql/rs_polygonize.qmdsedonadb._libPython import mismatch in this environment