Skip to content

beta(0.4.0): lightweight-tier hardening — Serverless, geometry consistency, reader defaults#40

Merged
mjohns-databricks merged 29 commits into
mainfrom
beta/0.4.0
Jun 19, 2026
Merged

beta(0.4.0): lightweight-tier hardening — Serverless, geometry consistency, reader defaults#40
mjohns-databricks merged 29 commits into
mainfrom
beta/0.4.0

Conversation

@mjohns-databricks

Copy link
Copy Markdown
Collaborator

Summary

Lightweight-tier hardening for v0.4.0 since the last beta/0.4.0 → main merge (#33). This batch makes the pure-Python tier genuinely Serverless-ready, makes geometry handling consistent and robust across every function, fixes a footgun in the raster reader default, and ports the xView example to the lightweight API (validated end-to-end on Serverless). No new functions — these refine the existing 0.4.0 surface.

Serverless support for the lightweight tier

  • Installs and runs on Databricks Serverless (environment v5). mapbox-vector-tile pinned to 2.1.x so protobuf stays <6 (Spark-Connect compatibility); idna<3.8 to avoid the core-package-change notice. Verified end-to-end on Serverless (install → register → execute).
  • Install docs use the quoted PEP 508 form everywhere%pip install "geobrix[light] @ file:///Volumes/.../geobrix-0.4.0-py3-none-any.whl" — because the path-with-extra form ('...whl[light]') fails on Serverless. Added a warning that the heavyweight tier needs the JAR + GDAL init script, not just the wheel.

Geometry handling — consistent + robust (light↔heavy parity)

  • Every geom-accepting function accepts WKB / EWKB / WKT / EWKT via a single shared decoder (gbx._geom), in both tiers. Previously some lightweight functions accepted only WKB.
  • Geometry×raster ops align to the raster CRS and handle non-overlap gracefully. rst_clip, rst_sample, and rst_viewshed reproject the input geometry from its SRID to the raster's CRS (matching heavyweight GDAL), and a geometry that doesn't overlap the raster returns null/empty instead of raising.

Raster reader / writer

  • sizeInMB now defaults to -1 (no split) in both tiers — one whole-image tile per file instead of auto-splitting at 16 MB (which silently multi-tiled larger rasters and broke path-keyed joins). Set a positive value to opt into tiling; a single tile exceeding the ~2 GB Spark cell limit fails with an actionable message.
  • Lightweight source column is dbfs:-scheme-qualified to match binaryFile / the heavyweight reader (clean cross-tier joins); native file ops strip the scheme internally.
  • gtiff_gbx writer supports nameCol for deterministic output filenames (parity with the heavy GDAL writer).

Example + docs

  • xView clipping example ported to the lightweight API (gtiff_gbx reader/writer, Serverless-safe) and validated on Serverless: 450 Yacht clips, one whole-image tile per raster.
  • Beta release notes updated (§ v0.4.0); function-reference pages get a title: frontmatter so the browser tab shows the page title instead of the logo <img> markup; reader-doc filterRegex escaping fix; intro/badge/sidebar polish.

Verification

  • JAR-gated cross-tier parity suites + the pyrx/pygx/ds test suites pass in Docker; new tests cover the four-encoding contract, geom×raster CRS-alignment + graceful non-overlap, and the no-split reader default.
  • Binding parity 154/154; doc-coverage and user-facing-voice gates green.
  • xView end-to-end validated on Serverless (two successful runs).

This pull request and its description were written by Isaac.

Michael Johns added 29 commits June 17, 2026 10:24
geobrix[light] on Databricks Serverless (env v5) failed to install: the
mapbox-vector-tile<3 pin resolved to 2.2.0, which forces protobuf>=6.31.1
and upgrades the immutable base protobuf 5.29.4 -> 6.x. That conflicts
with Serverless's Spark-Connect/gRPC stack (grpcio-status /
googleapis-common-protos pin protobuf<6), so pip emits an ERROR conflict
report and protobuf 6 would break Spark Connect. mapbox-vector-tile 2.1.0
keeps protobuf at the base 5.29.4 and still has the 2.x default_options /
native-typed-attr encode API, so MVT stays on the lightweight/Serverless
tier. Pin >=2.1,<2.2 in both the [light] and [test] extras.

Adds notebooks/tests/serverless_light_smoke.py: a reusable Serverless
(env v5) probe that submits a one-off run to validate the [light] install
+ exercise the API (Spark-Connect health, versions, MVT encode, pyrx/pyvx
register); supports --diagnose / --probe-mvt / --func-validate / --env-deps
/ --isolate-register modes.

Co-authored-by: Isaac
The install snippets showed the path-with-extra form
('/Volumes/.../...whl[light]') which fails on Serverless — %pip keeps the
surrounding quotes so pip reads [light] as part of the filename. Switch
installation/quick-start/README to the named, quoted PEP 508 form (one
argument), which installs cleanly on Serverless/standard/ARM, and add a
warning admonition explaining the gotcha. Also drop the misleading
'%pip install geobrix' (not on PyPI) from the VectorX install table.

Co-authored-by: Isaac
Drop the hardcoded personal /Users/<email> notebook path (flagged by the
internals-leak check); derive it from w.current_user.me().user_name and
read HOST from DATABRICKS_HOST when set.

Co-authored-by: Isaac
Source host/profile/Volume coordinates from databricks_cluster_config.env
(GBX_BUNDLE_VOLUME_*, DATABRICKS_CONFIG_PROFILE) and derive host from the
profile, so no workspace URL, Volume path, or profile name is hardcoded
in this committed file.

Co-authored-by: Isaac
Authenticate with WorkspaceClient(profile=...) (the configured CLI
profile) instead of minting/injecting a bearer token; keep
DATABRICKS_CONFIG_PROFILE OUT of os.environ (when present the CLI auth
takes a broken refresh path). Replace the catalog.listFunctions() count
(which fails on Serverless/UC with a DataType.fromDDL parse error,
unrelated to GeoBrix) with a real pyrx execution: build a tiny GeoTIFF
and read its width through the Column API; characterize listFunctions
separately.

Co-authored-by: Isaac
The SDK profile path refreshes+rotates the single-use OAuth refresh token
on every client creation, which breaks across repeated runs. Use the
CLI's cached access token (databricks auth token) + host from cfg, env
kept clean of DATABRICKS_CONFIG_PROFILE.

Co-authored-by: Isaac
rst_fromcontent(content, driver) requires the GDAL driver name; the probe
omitted it and would always fail the pyrx-exec check. Pass lit('GTiff').

Co-authored-by: Isaac
idna is transitive (requests/anyio/httpx) with no upper bound, so pip
pulls the latest (3.18) and shadows Serverless v5's base idna 3.7, firing
the 'a core Python package changed: idna' notebook notice. Nothing in the
stack needs >3.7, so cap <3.8 to keep the base in place.

Co-authored-by: Isaac
The raster_gbx 'read with options' snippet wrote the regex as a raw
string r".*\\.tif$" inside a non-raw triple-quote, so the raw-loader
rendered TWO backslashes; a user copying it got r".*\\.tif$" (literal
backslash) which matches nothing (FileNotFoundError: no files matched).
Make the constant a raw triple-quote and use r".*\.tif$" so the
rendered example is the correct single-backslash escaped-dot regex.

Co-authored-by: Isaac
The light raster readers emitted source as a bare /Volumes/... path
(os.path.abspath), but Spark binaryFile and the heavy gdal reader emit
dbfs:/Volumes/... So a light-produced DataFrame failed to join (0 rows)
against a binaryFile/heavy path column. Add to_spark_uri() (mirrors the
Hadoop convention: /Volumes -> dbfs:/Volumes, /dbfs -> dbfs:, other
schemes + local paths unchanged) and apply it to the OUTPUT source
column only; rasterio still reads the bare FUSE path.

Co-authored-by: Isaac
Columns store dbfs:-qualified paths (to_spark_uri); every light place that
opens/writes a path via rasterio/pyogrio/os/GDAL now strips the scheme back
to the bare FUSE path via to_local_path (rst_fromfile, color-relief table,
raster reader listing + writer, vector reader/writer, pmtiles writer). Keeps
object-store schemes (s3/abfss/gs/http/vsi) untouched. Mirrors the heavy
convention (Hadoop-qualified columns, cleanPath for native opens).

Co-authored-by: Isaac
Use the light tier end-to-end: pip install geobrix[light] from the Volume,
import pyrx + register readers/writers, read rasters via the gtiff_gbx
DataSource (which yields the tile directly, no rst_fromfile), comment out
the Serverless-unsupported spark.conf.set lines, replace binaryFile
thumbnail .display() with .limit(1).show(vertical=True), and add a
FORCE_REBUILD flag so the tableExists guards don't skip steps. Join labels
to rasters on a normalized key so the clip count is non-zero (the source
column is now dbfs:-qualified, matching binaryFile/heavy).

Co-authored-by: Isaac
Replace the manual foreachPartition file-write with the lightweight
gtiff_gbx DataSource writer. Deterministic names via nameCol: select the
exact (source, tile) schema with source = index_right_type-id_feature-id,
so the writer emits <source>.tif (ext defaults to tif).

Co-authored-by: Isaac
Light rst_clip/rst_sample assumed WKB bytes (bytes(geom_wkb)) and threw
TypeError on a WKT/EWKT string, but heavy accepts all four encodings (the
xView example passes EWKT). Add a shapely-only gbx._geom.geom_to_wkb that
decodes WKB/EWKB bytes or WKT/EWKT str to WKB bytes; use it in both UDFs.
Centralize parse_geom in gbx._geom (pyvx re-exports) so pyrx needs no pyvx
import (no MVT-dep leak into a pyrx-only install).

Co-authored-by: Isaac
Route every remaining user-geometry input through the shared gbx._geom
decoder so encodings are consistent tier-wide (no per-function surprises):
viewshed observer_geom, gridfrompoints(+agg), dtmfromgeoms(+agg) and the
TIN point/breakline decoders. pygx._geom now re-exports the shared decoder
(BNG/quadbin/custom inherit). Output/encode paths and already-decoded core
paths unchanged.

Co-authored-by: Isaac
Floated red/white badge top-right of the Installation title.

Co-authored-by: Isaac
Re-run convenience: skip tables that already exist instead of always
rebuilding; set True to force a full rebuild.

Co-authored-by: Isaac
xview_object_clip exists from prior runs, so FORCE_REBUILD=False would
skip it and never exercise the rst_clip EWKT fix. Force just that cell to
always rebuild (overwrite); the upstream raster/object tables still skip.

Co-authored-by: Isaac
Light clip did a bare rasterio.mask with no reprojection, so an EWKT/EWKB
cutline in a different CRS than the raster raised 'Input shapes do not
overlap raster'. Mirror heavy RST_Clip: read the cutline SRID and reproject
to the raster CRS (rasterio.warp.transform_geom) before masking; fall back
to as-is when SRID is 0/unknown or the raster has no CRS. _clip_udf now
passes the SRID-bearing parsed geom (geom_to_wkb dropped the SRID).

Co-authored-by: Isaac
Audit + fix the 'geometry not aligned to raster' class: every function
combining a geom with a raster now (1) reprojects the geom from its SRID to
the raster CRS, and (2) returns null/empty instead of hard-crashing when the
geom does not overlap the raster (matching heavy GDAL). Covers clip (graceful
non-overlap), sample (reproject + graceful), viewshed, and the
rasterize/dtmfromgeoms/gridfrompoints constructors as applicable.

Co-authored-by: Isaac
gtiff_gbx split larger images into 4 window-tiles at the default 16MB, so
the image_file join paired each label with all 4 windows and rst_clip hit
tiles that don't contain the label. Read one whole-image tile per .tif
(large sizeInMB), matching the heavy rst_fromfile one-tile-per-image flow,
so each label clips against the tile that contains it.

Co-authored-by: Isaac
The 'Docs updated for v0.4.0' badge reads better on the intro landing
page than on installation; move it there.

Co-authored-by: Isaac
The one-line import swap is symmetric but the install is not: clarify
that the heavyweight tier needs the JAR + GDAL init script, not just
the wheel.

Co-authored-by: Isaac
Both tiers default the raster reader to no-split (one whole-image tile per
file) instead of the 16MB auto-split, which silently multi-tiled larger
rasters and broke path-keyed joins. sizeInMB<=0 = whole image; set a
positive MB value to opt into tiling. A single tile that would exceed the
~2GB Spark cell limit fails with an actionable 'set sizeInMB' message.
tileSize option left unchanged.

Co-authored-by: Isaac
Capitalize the intro sidebar entry (sidebar_label: Intro) and include the
execution-tiers tier-overview edits.

Co-authored-by: Isaac
Refresh the example's Last Modified date and add a note linking to the
Execution Tiers page from the Setup section.

Co-authored-by: Isaac
The function-reference H1 is an <img> logo; Docusaurus derived the page
<title> from it, so the browser tab showed the raw '<img src={...}'
markup. Add a title frontmatter (RasterX/GridX/VectorX Function Reference)
which takes precedence for <title> while the logo H1 still renders.

Co-authored-by: Isaac
Default to a full rebuild every run (validated: 450 Yacht clips, one
whole-image tile per raster). Restore cmd 33's guard to the standard
FORCE_REBUILD-or-not-exists form (the temporary 'if True' is no longer
needed). The clip write already uses the gtiff_gbx writer with nameCol.

Co-authored-by: Isaac
Fold the post-merge lightweight-tier hardening into the 0.4.0 notes:
Serverless install support (quoted PEP 508 + protobuf<6 pin), geometry
inputs accept WKB/EWKB/WKT/EWKT everywhere, geom x raster ops reproject to
the raster CRS + handle non-overlap gracefully, and the raster reader now
defaults to no-split (sizeInMB=-1). Plus gtiff_gbx nameCol + dbfs path
column where relevant. Also correct the limitations page so the
Serverless/Classic/ARM compute requirements read as heavyweight-only.

Co-authored-by: Isaac
@mjohns-databricks mjohns-databricks requested a review from a team as a code owner June 19, 2026 00:26
@mjohns-databricks mjohns-databricks merged commit d71e7e6 into main Jun 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant