Skip to content

feat(rust/sedona-raster-gdal): add RS_Polygonize#955

Merged
Kontinuation merged 9 commits into
apache:mainfrom
Kontinuation:pr-h-rs-polygonize
Jul 3, 2026
Merged

feat(rust/sedona-raster-gdal): add RS_Polygonize#955
Kontinuation merged 9 commits into
apache:mainfrom
Kontinuation:pr-h-rs-polygonize

Conversation

@Kontinuation

@Kontinuation Kontinuation commented Jun 14, 2026

Copy link
Copy Markdown
Member

Summary

  • add the GDAL-backed RS_Polygonize raster function
  • add SQL reference docs and a dedicated Criterion benchmark
  • keep the change limited to the new function and its incremental registration

Dependency

  • Independent branch based on main

Testing

  • cargo clippy -p sedona-raster-gdal -p sedona --all-targets -- -D warnings
  • cargo bench -p sedona-raster-gdal --bench rs_polygonize --no-run
  • cargo test -p sedona-raster-gdal
    • blocked in this environment by existing missing raster fixture files in other tests
  • PATH="/home/kontinuation/workspace/github/apache/sedona-db/.venv/bin:$PATH" quarto render docs/reference/sql/rs_polygonize.qmd
    • front matter validated; example execution is currently blocked by an existing sedonadb._lib Python import mismatch in this environment

@github-actions github-actions Bot requested a review from prantogg June 14, 2026 08:18
@Kontinuation Kontinuation force-pushed the pr-h-rs-polygonize branch 2 times, most recently from 2619bd6 to 93d86e6 Compare June 17, 2026 13:45
@Kontinuation Kontinuation marked this pull request as ready for review June 22, 2026 02:12

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! A few questions but this looks great.

Comment on lines +96 to +99
// Note: We deliberately use "Memory" instead of "MEM" to be compatible with older GDAL versions.
let mem_driver = gdal
.get_driver_by_name("Memory")
.map_err(|e| exec_datafusion_err!("Failed to get Memory driver: {e}"))?;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In earlier work we tried to avoid getting a driver by name in a loop to avoid contention on the global lock guarding the driver registry. Is it possible to avoid that lock here using the lower level mem dataset constructor?

(edit: I see this is for vector data. We're still doing this once per batch...is there anything we can do to avoid resolving the driver that frequently?)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The dataset created by MEMDatasetCreate does not work well with vector layers in old GDAL versions such as 3.8 for some reason. We've hit a CI error because of this and switched to using the "Memory" driver. We use "Memory" instead of "MEM" here for the same reason.

Relevant GitHub workflow job: https://github.com/apache/sedona-db/actions/runs/27903364821/job/82567585551

failures:

---- rs_polygonize::tests::test_polygonize_invoke_array_and_nulls stdout ----

thread 'rs_polygonize::tests::test_polygonize_invoke_array_and_nulls' (46126) panicked at rust/sedona-raster-gdal/src/rs_polygonize.rs:348:14:
called `Result::unwrap()` on an `Err` value: Execution("Failed to create layer: GDAL method 'GDALDatasetCreateLayer' returned a NULL pointer. Error msg: 'CreateLayer() not supported by this dataset.'")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- rs_polygonize::tests::test_polygonize_raster stdout ----

thread 'rs_polygonize::tests::test_polygonize_raster' (46129) panicked at rust/sedona-raster-gdal/src/rs_polygonize.rs:297:10:
called `Result::unwrap()` on an `Err` value: Execution("Failed to create layer: GDAL method 'GDALDatasetCreateLayer' returned a NULL pointer. Error msg: 'CreateLayer() not supported by this dataset.'")

---- rs_polygonize::tests::test_polygonize_invoke_scalar stdout ----

thread 'rs_polygonize::tests::test_polygonize_invoke_scalar' (46127) panicked at rust/sedona-raster-gdal/src/rs_polygonize.rs:327:14:
called `Result::unwrap()` on an `Err` value: Execution("Failed to create layer: GDAL method 'GDALDatasetCreateLayer' returned a NULL pointer. Error msg: 'CreateLayer() not supported by this dataset.'")


fn polygon_value_struct_fields() -> Fields {
Fields::from(vec![
Field::new("geom", DataType::Binary, false),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that it is annoying, but this should be an item_crs and propagate the string Crs verbatim from the raster crs field. You could propagate this array after iterating if you'd like because it should be an identical arc reference to the raster field.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched the geom field to be item_crs. The field name is still "geom".

}
};

polygon_values.push((wkb, feature.field_as_double(idx)));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a BinaryBuilder and Float64Builder and write these incrementally to amortize the allocations?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result array construction code has been refactored. The polygons were directly appended to the output arrow arrays in a closure passed into polygonize_raster.

.map_err(|e| exec_datafusion_err!("Failed to add field to layer: {e}"))?;

// Polygonize the raster band into the vector layer, using the "value" field to store pixel values.
gdal.polygonize(&raster_band, None, &layer, 0, &PolygonizeOptions::default())

@james-willis james-willis Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our polygonize implementation is backed by GDALPolygonize.

I think we need to make sure to use GDALFPolygonize for cases where the raster is a float type and GDALPolygonize when it is an int. Otherwise we round floats to the nearest int and polygonize incorrectly

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added fpolygonize (backed by GDALFPolygonize) to sedona-gdal and dynamically dispatch to it when the raster is of a floating-point type (Float32/Float64), addressing this in a follow-up commit.

}
};

let band_num: usize = band_opt.unwrap().max(1).try_into().unwrap_or(1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont this clamp all values [1, inf)? so the validation below will never trigger?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed how we normalize and validate band_num in other RS functions and found that we should only do .unwrap().unwrap_or(1). No max(1) should be performed. I'll remove the .max(1) here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the .max(1) call completely and rewritten the band validation logic. It now checks for <= 0 and out-of-range band values properly without silently wrapping/clamping negative values during conversion, addressing this in a follow-up commit.

.map_err(|e| exec_datafusion_err!("Failed to add field to layer: {e}"))?;

// Polygonize the raster band into the vector layer, using the "value" field to store pixel values.
gdal.polygonize(&raster_band, None, &layer, 0, &PolygonizeOptions::default())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think also our behavior here is divergent from GDAL with regard to nodata values. We will polygonize them by default, but generally the GDAL will ignore them

@james-willis

james-willis commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

This one is my fault but we need to consider how GDAL-backed RS function should work with the native outdb loader functionality I built to integrate zarr support.

The simple solution is to annotate the function with needs_pixels metadata, but that will eliminate the VRT path for these functions: all rasters will wind up indb when they reach the functions.

Without a benchmark it is not clear to me whether the VRT outdb path is preferable to the needs_pixels/RS_EnsureLoaded solution. probably we need more data to make some determination on solution here.

cc @paleolimbot @zhangfengcdt

@paleolimbot

Copy link
Copy Markdown
Member

probably we need more data to make some determination on solution here.

I think we can merge this without a benchmark...I'd prefer to optimize performance after there are a lot of passing unit tests 🙂

@james-willis

Copy link
Copy Markdown
Contributor

I think we can merge this without a benchmark

We still need to decide what we want to do for non-GDAL loaders before merging

@Kontinuation Kontinuation left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have addressed the suggestions regarding CRS propagation, builder usage, array extraction and interleaving testing in a follow-up commit.

@paleolimbot paleolimbot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

This looks great to me but would benefit from 👀 from @james-willis before merging.

@james-willis james-willis left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! thanks

@Kontinuation Kontinuation merged commit 1eca227 into apache:main Jul 3, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants