Commit d71e7e6
authored
beta(0.4.0): lightweight-tier hardening — Serverless, geometry consistency, reader defaults (#40)
* fix(light): pin mapbox-vector-tile 2.1.x for Serverless (protobuf<6)
geobrix[light] on Databricks Serverless (env v5) failed to install: the
mapbox-vector-tile<3 pin resolved to 2.2.0, which forces protobuf>=6.31.1
and upgrades the immutable base protobuf 5.29.4 -> 6.x. That conflicts
with Serverless's Spark-Connect/gRPC stack (grpcio-status /
googleapis-common-protos pin protobuf<6), so pip emits an ERROR conflict
report and protobuf 6 would break Spark Connect. mapbox-vector-tile 2.1.0
keeps protobuf at the base 5.29.4 and still has the 2.x default_options /
native-typed-attr encode API, so MVT stays on the lightweight/Serverless
tier. Pin >=2.1,<2.2 in both the [light] and [test] extras.
Adds notebooks/tests/serverless_light_smoke.py: a reusable Serverless
(env v5) probe that submits a one-off run to validate the [light] install
+ exercise the API (Spark-Connect health, versions, MVT encode, pyrx/pyvx
register); supports --diagnose / --probe-mvt / --func-validate / --env-deps
/ --isolate-register modes.
Co-authored-by: Isaac
* docs: quote the PEP 508 'geobrix[light] @ file://' install everywhere
The install snippets showed the path-with-extra form
('/Volumes/.../...whl[light]') which fails on Serverless — %pip keeps the
surrounding quotes so pip reads [light] as part of the filename. Switch
installation/quick-start/README to the named, quoted PEP 508 form (one
argument), which installs cleanly on Serverless/standard/ARM, and add a
warning admonition explaining the gotcha. Also drop the misleading
'%pip install geobrix' (not on PyPI) from the VectorX install table.
Co-authored-by: Isaac
* fix(bench): derive serverless smoke notebook path from current user
Drop the hardcoded personal /Users/<email> notebook path (flagged by the
internals-leak check); derive it from w.current_user.me().user_name and
read HOST from DATABRICKS_HOST when set.
Co-authored-by: Isaac
* fix(bench): read serverless smoke config from gitignored env file
Source host/profile/Volume coordinates from databricks_cluster_config.env
(GBX_BUNDLE_VOLUME_*, DATABRICKS_CONFIG_PROFILE) and derive host from the
profile, so no workspace URL, Volume path, or profile name is hardcoded
in this committed file.
Co-authored-by: Isaac
* fix(bench): serverless smoke auth via profile + pyrx exec probe
Authenticate with WorkspaceClient(profile=...) (the configured CLI
profile) instead of minting/injecting a bearer token; keep
DATABRICKS_CONFIG_PROFILE OUT of os.environ (when present the CLI auth
takes a broken refresh path). Replace the catalog.listFunctions() count
(which fails on Serverless/UC with a DataType.fromDDL parse error,
unrelated to GeoBrix) with a real pyrx execution: build a tiny GeoTIFF
and read its width through the Column API; characterize listFunctions
separately.
Co-authored-by: Isaac
* fix(bench): serverless smoke mints CLI token once (no SDK refresh churn)
The SDK profile path refreshes+rotates the single-use OAuth refresh token
on every client creation, which breaks across repeated runs. Use the
CLI's cached access token (databricks auth token) + host from cfg, env
kept clean of DATABRICKS_CONFIG_PROFILE.
Co-authored-by: Isaac
* fix(bench): serverless smoke passes driver to rst_fromcontent
rst_fromcontent(content, driver) requires the GDAL driver name; the probe
omitted it and would always fail the pyrx-exec check. Pass lit('GTiff').
Co-authored-by: Isaac
* fix(light): cap idna<3.8 to keep Serverless base unchanged
idna is transitive (requests/anyio/httpx) with no upper bound, so pip
pulls the latest (3.18) and shadows Serverless v5's base idna 3.7, firing
the 'a core Python package changed: idna' notebook notice. Nothing in the
stack needs >3.7, so cap <3.8 to keep the base in place.
Co-authored-by: Isaac
* docs(readers): fix filterRegex example escaping (raw-loader showed \\.)
The raster_gbx 'read with options' snippet wrote the regex as a raw
string r".*\\.tif$" inside a non-raw triple-quote, so the raw-loader
rendered TWO backslashes; a user copying it got r".*\\.tif$" (literal
backslash) which matches nothing (FileNotFoundError: no files matched).
Make the constant a raw triple-quote and use r".*\.tif$" so the
rendered example is the correct single-backslash escaped-dot regex.
Co-authored-by: Isaac
* fix(ds): light raster reader source column dbfs:-qualified
The light raster readers emitted source as a bare /Volumes/... path
(os.path.abspath), but Spark binaryFile and the heavy gdal reader emit
dbfs:/Volumes/... So a light-produced DataFrame failed to join (0 rows)
against a binaryFile/heavy path column. Add to_spark_uri() (mirrors the
Hadoop convention: /Volumes -> dbfs:/Volumes, /dbfs -> dbfs:, other
schemes + local paths unchanged) and apply it to the OUTPUT source
column only; rasterio still reads the bare FUSE path.
Co-authored-by: Isaac
* fix(light): strip dbfs:/file: scheme before native file ops
Columns store dbfs:-qualified paths (to_spark_uri); every light place that
opens/writes a path via rasterio/pyogrio/os/GDAL now strips the scheme back
to the bare FUSE path via to_local_path (rst_fromfile, color-relief table,
raster reader listing + writer, vector reader/writer, pmtiles writer). Keeps
object-store schemes (s3/abfss/gs/http/vsi) untouched. Mirrors the heavy
convention (Hadoop-qualified columns, cleanPath for native opens).
Co-authored-by: Isaac
* docs(xview): port example to the lightweight API
Use the light tier end-to-end: pip install geobrix[light] from the Volume,
import pyrx + register readers/writers, read rasters via the gtiff_gbx
DataSource (which yields the tile directly, no rst_fromfile), comment out
the Serverless-unsupported spark.conf.set lines, replace binaryFile
thumbnail .display() with .limit(1).show(vertical=True), and add a
FORCE_REBUILD flag so the tableExists guards don't skip steps. Join labels
to rasters on a normalized key so the clip count is non-zero (the source
column is now dbfs:-qualified, matching binaryFile/heavy).
Co-authored-by: Isaac
* docs(xview): write clipped rasters via gtiff_gbx writer + nameCol
Replace the manual foreachPartition file-write with the lightweight
gtiff_gbx DataSource writer. Deterministic names via nameCol: select the
exact (source, tile) schema with source = index_right_type-id_feature-id,
so the writer emits <source>.tif (ext defaults to tif).
Co-authored-by: Isaac
* fix(pyrx): rst_clip/rst_sample accept WKB/EWKB/WKT/EWKT (parity)
Light rst_clip/rst_sample assumed WKB bytes (bytes(geom_wkb)) and threw
TypeError on a WKT/EWKT string, but heavy accepts all four encodings (the
xView example passes EWKT). Add a shapely-only gbx._geom.geom_to_wkb that
decodes WKB/EWKB bytes or WKT/EWKT str to WKB bytes; use it in both UDFs.
Centralize parse_geom in gbx._geom (pyvx re-exports) so pyrx needs no pyvx
import (no MVT-dep leak into a pyrx-only install).
Co-authored-by: Isaac
* fix(light): all geom-accepting functions handle WKB/EWKB/WKT/EWKT
Route every remaining user-geometry input through the shared gbx._geom
decoder so encodings are consistent tier-wide (no per-function surprises):
viewshed observer_geom, gridfrompoints(+agg), dtmfromgeoms(+agg) and the
TIN point/breakline decoders. pygx._geom now re-exports the shared decoder
(BNG/quadbin/custom inherit). Output/encode paths and already-decoded core
paths unchanged.
Co-authored-by: Isaac
* docs(installation): add 'Docs updated for v0.4.0 (coming soon)' badge
Floated red/white badge top-right of the Installation title.
Co-authored-by: Isaac
* docs(xview): default FORCE_REBUILD=False (skip already-built tables)
Re-run convenience: skip tables that already exist instead of always
rebuilding; set True to force a full rebuild.
Co-authored-by: Isaac
* docs(xview): force the clip step (cmd 33) to rebuild on re-run
xview_object_clip exists from prior runs, so FORCE_REBUILD=False would
skip it and never exercise the rst_clip EWKT fix. Force just that cell to
always rebuild (overwrite); the upstream raster/object tables still skip.
Co-authored-by: Isaac
* fix(pyrx): rst_clip reprojects cutline to raster CRS (heavy parity)
Light clip did a bare rasterio.mask with no reprojection, so an EWKT/EWKB
cutline in a different CRS than the raster raised 'Input shapes do not
overlap raster'. Mirror heavy RST_Clip: read the cutline SRID and reproject
to the raster CRS (rasterio.warp.transform_geom) before masking; fall back
to as-is when SRID is 0/unknown or the raster has no CRS. _clip_udf now
passes the SRID-bearing parsed geom (geom_to_wkb dropped the SRID).
Co-authored-by: Isaac
* fix(pyrx): geom x raster ops align CRS + handle non-overlap gracefully
Audit + fix the 'geometry not aligned to raster' class: every function
combining a geom with a raster now (1) reprojects the geom from its SRID to
the raster CRS, and (2) returns null/empty instead of hard-crashing when the
geom does not overlap the raster (matching heavy GDAL). Covers clip (graceful
non-overlap), sample (reproject + graceful), viewshed, and the
rasterize/dtmfromgeoms/gridfrompoints constructors as applicable.
Co-authored-by: Isaac
* docs(xview): read whole-image tiles (sizeInMB) so clips align to labels
gtiff_gbx split larger images into 4 window-tiles at the default 16MB, so
the image_file join paired each label with all 4 windows and rst_clip hit
tiles that don't contain the label. Read one whole-image tile per .tif
(large sizeInMB), matching the heavy rst_fromfile one-tile-per-image flow,
so each label clips against the tile that contains it.
Co-authored-by: Isaac
* docs: move v0.4.0 badge from installation to intro page
The 'Docs updated for v0.4.0' badge reads better on the intro landing
page than on installation; move it there.
Co-authored-by: Isaac
* docs(execution-tiers): warn heavyweight needs JAR+init, not the wheel
The one-line import swap is symmetric but the install is not: clarify
that the heavyweight tier needs the JAR + GDAL init script, not just
the wheel.
Co-authored-by: Isaac
* feat(readers): raster sizeInMB default -1 (no split; one tile per file)
Both tiers default the raster reader to no-split (one whole-image tile per
file) instead of the 16MB auto-split, which silently multi-tiled larger
rasters and broke path-keyed joins. sizeInMB<=0 = whole image; set a
positive MB value to opt into tiling. A single tile that would exceed the
~2GB Spark cell limit fails with an actionable 'set sizeInMB' message.
tileSize option left unchanged.
Co-authored-by: Isaac
* docs: Intro sidebar label + execution-tiers wording
Capitalize the intro sidebar entry (sidebar_label: Intro) and include the
execution-tiers tier-overview edits.
Co-authored-by: Isaac
* docs(xview): refresh last-modified + link to Execution Tiers
Refresh the example's Last Modified date and add a note linking to the
Execution Tiers page from the Setup section.
Co-authored-by: Isaac
* docs(api): set title frontmatter so browser tab isn't the logo JSX
The function-reference H1 is an <img> logo; Docusaurus derived the page
<title> from it, so the browser tab showed the raw '<img src={...}'
markup. Add a title frontmatter (RasterX/GridX/VectorX Function Reference)
which takes precedence for <title> while the logo H1 still renders.
Co-authored-by: Isaac
* docs(xview): FORCE_REBUILD=True default + restore clip-cell guard
Default to a full rebuild every run (validated: 450 Yacht clips, one
whole-image tile per raster). Restore cmd 33's guard to the standard
FORCE_REBUILD-or-not-exists form (the temporary 'if True' is no longer
needed). The clip write already uses the gtiff_gbx writer with nameCol.
Co-authored-by: Isaac
* docs(release-notes): v0.4.0 lightweight hardening notes
Fold the post-merge lightweight-tier hardening into the 0.4.0 notes:
Serverless install support (quoted PEP 508 + protobuf<6 pin), geometry
inputs accept WKB/EWKB/WKT/EWKT everywhere, geom x raster ops reproject to
the raster CRS + handle non-overlap gracefully, and the raster reader now
defaults to no-split (sizeInMB=-1). Plus gtiff_gbx nameCol + dbfs path
column where relevant. Also correct the limitations page so the
Serverless/Classic/ARM compute requirements read as heavyweight-only.
Co-authored-by: Isaac
---------
Co-authored-by: Michael Johns <user.name>45 files changed
Lines changed: 2100 additions & 257 deletions
File tree
- docs
- docs
- api
- readers
- tests/python/readers
- notebooks
- examples/xview
- tests
- python/geobrix
- src/databricks/labs/gbx
- ds
- pygx
- pyrx
- core
- pyvx
- test
- ds
- pygx
- pyrx
- pyvx
- src
- main/scala/com/databricks/labs/gbx/rasterx
- ds/gdal
- operations
- test/scala/com/databricks/labs/gbx/rasterx/operations
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
| 60 | + | |
61 | 61 | | |
62 | 62 | | |
| 63 | + | |
| 64 | + | |
63 | 65 | | |
64 | 66 | | |
65 | 67 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
23 | 27 | | |
24 | 28 | | |
25 | 29 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
| |||
78 | 79 | | |
79 | 80 | | |
80 | 81 | | |
81 | | - | |
| 82 | + | |
82 | 83 | | |
83 | 84 | | |
84 | 85 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
21 | 26 | | |
22 | 27 | | |
23 | 28 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
45 | 56 | | |
46 | 57 | | |
47 | 58 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
| 3 | + | |
3 | 4 | | |
4 | 5 | | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
8 | 23 | | |
9 | 24 | | |
10 | 25 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
79 | | - | |
80 | | - | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
81 | 83 | | |
82 | 84 | | |
83 | 85 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
25 | 27 | | |
26 | 28 | | |
27 | 29 | | |
28 | 30 | | |
29 | 31 | | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
30 | 41 | | |
31 | 42 | | |
32 | 43 | | |
| |||
0 commit comments