Skip to content

[SEDONA-756] feat: raster Python serde and with_bands() support#2956

Open
prantogg wants to merge 1 commit into
apache:masterfrom
prantogg:pranav/feature/raster-python-serde
Open

[SEDONA-756] feat: raster Python serde and with_bands() support#2956
prantogg wants to merge 1 commit into
apache:masterfrom
prantogg:pranav/feature/raster-python-serde

Conversation

@prantogg
Copy link
Copy Markdown
Contributor

@prantogg prantogg commented May 15, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [SEDONA-XXX] my subject.

What changes were proposed in this PR?

Returning raster data from Python UDFs currently requires .tolist() + RS_MakeRaster, which forces Float64 promotion, creates 262K Python float objects per 512×512 tile, and loses all raster metadata (CRS, nodata, affine transform).

This PR adds:

  • raster_serde.serialize() — Writes InDbSedonaRaster to Sedona's binary format, byte-compatible with JVM Serde.deserialize(). Uses cache-and-replay for opaque Kryo blobs (categories, properties, colorModel).
  • InDbSedonaRaster.with_bands() — Creates a new raster with replaced pixel data (NumPy array) but preserved spatial metadata. Band count and dtype may differ from the source.
  • RasterType.serialize() — Delegates to raster_serde.serialize() instead of raising NotImplementedError.
  • DeepCopiedRenderedImage.reconcileColorModel() (JVM) — Fixes colorModel/sampleModel mismatches at deserialization time when Python UDFs change band count or dtype.
  • KryoUtil.skipUTF8String() and GridSampleDimensionSerializer.skip() — Utility methods for navigating Kryo streams without full deserialization.

Benchmarked on Apple M2 Pro, 4-band rasters, median of 50 iterations:

Tile Size Old .tolist() (ms) New serialize() (ms) Speedup Old mem (KB) New mem (KB) Mem ratio
64×64 0.16 0.04 3.6× 384 66 5.8×
256×256 2.56 0.11 23× 6,144 1,026
512×512 11.63 0.66 18× 24,576 4,098

How was this patch tested?

  • 8 with_bands() tests (band count changes, dtype changes, metadata survival)
  • 2 serialize round-trip tests
  • 1 JVM serde test (colorModel mismatch handling)
  • Passes all existing tests

Did this PR include necessary documentation updates?

  • Yes, updated the "Writing Python UDF" section in docs/tutorial/raster.md to show the new raster-to-raster UDF pattern using with_bands().

@prantogg prantogg force-pushed the pranav/feature/raster-python-serde branch from ed31473 to 782a8e1 Compare May 16, 2026 00:11
@prantogg prantogg changed the title feat: raster Python serde and with_bands() support [SEDONA-756] feat: raster Python serde and with_bands() support May 16, 2026
@prantogg prantogg force-pushed the pranav/feature/raster-python-serde branch 3 times, most recently from 782a8e1 to 288bfe7 Compare May 16, 2026 01:34
@prantogg prantogg marked this pull request as ready for review May 16, 2026 02:22
@prantogg prantogg requested a review from jiayuasu as a code owner May 16, 2026 02:22
Add Python-side serialize() for InDbSedonaRaster, enabling Python UDFs
to return raster objects directly instead of the lossy .tolist() +
RS_MakeRaster workaround. Rasters now round-trip as contiguous bytes
preserving native dtypes and all metadata (CRS, nodata, affine, etc.).

Add with_bands() to InDbSedonaRaster for replacing pixel data (NumPy
array) while preserving spatial metadata. Band count and dtype may
differ from the source raster.

Add reconcileColorModel() to DeepCopiedRenderedImage (JVM) to fix
colorModel/sampleModel mismatches at deserialization when Python UDFs
change band count or dtype.

Cherry-picked from wherobots/wherobots-compute@e08bde1da08 with
vectorized UDF wiring excluded.
@jiayuasu jiayuasu force-pushed the pranav/feature/raster-python-serde branch from 288bfe7 to c52c18a Compare May 17, 2026 07:22
@jiayuasu
Copy link
Copy Markdown
Member

Hi @prantogg — heads up, I rebased this branch onto current master to resolve the conflict in docs/tutorial/raster.md that surfaced after #2954 merged. New head: c52c18ae2f (your authorship preserved).

Resolution choices for the UDF section:

  • Kept the new ### Raster to scalar / ### Raster to raster split, the with_bands() example, and the in-db/out-db warning admonition from your PR.
  • Dropped the old "returning a raster isn't supported, use RS_MakeRaster workaround" paragraph that [GH-2804] Raster tutorial: end-to-end running example with visuals #2954's rewrite was still carrying — now obsolete with with_bands().
  • Kept the rewrite's ## Performance heading instead of reverting to ## Performance optimization, since the rename was unrelated to this feature.
  • Removed a misleading [SedonaRaster.with_bands()] link that pointed to RS_MakeRaster.md; with_bands() is a Python method, not a SQL function with its own page.
  • Mirrored the same edits in docs/tutorial/raster.zh.md (which [GH-2804] Raster tutorial: end-to-end running example with visuals #2954 added and was previously out of sync with the English content).

If you have local commits on top of pranav/feature/raster-python-serde you'll need to git fetch && git reset --hard origin/pranav/feature/raster-python-serde (or rebase your local work onto the new tip) before continuing. Sorry for the force-push.

"""Serialize an InDbSedonaRaster to the Sedona binary format.

The output bytes are compatible with the JVM's Serde.deserialize().
Only InDbSedonaRaster is supported. OutDb and LazyLoad rasters
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove references to outdb since they don't exist here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants