databrickslabs
diff --git a/‎.github/actions/pyrx_build/action.yml‎
Lines changed: 3 additions & 2 deletions b/‎.github/actions/pyrx_build/action.yml‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎README.md‎
Lines changed: 6 additions & 4 deletions b/‎README.md‎
Lines changed: 6 additions & 4 deletions
diff --git a/‎docs/docs/api/execution-tiers.mdx‎
Lines changed: 35 additions & 0 deletions b/‎docs/docs/api/execution-tiers.mdx‎
Lines changed: 35 additions & 0 deletions
diff --git a/‎docs/docs/api/raster-functions.mdx‎
Lines changed: 26 additions & 3 deletions b/‎docs/docs/api/raster-functions.mdx‎
Lines changed: 26 additions & 3 deletions
diff --git a/‎docs/docs/api/stac.mdx‎
Lines changed: 253 additions & 0 deletions b/‎docs/docs/api/stac.mdx‎
Lines changed: 253 additions & 0 deletions
@@ -63,5 +63,6 @@ runs:
         # The lightweight tier is exercised ONLY here (the heavy phase skips these
         # dirs via test/conftest.py collect_ignore). Every light test dir must be
         # listed: pyrx, ds, pyvx, pygx (light GridX), pmtiles_light (light
-        # pmtiles_agg). See test/conftest.py for the maintained condition.
-        pytest test/pyrx test/ds test/pyvx test/pygx test/pmtiles_light -m "not integration" -v
+        # pmtiles_agg), stac (light STAC client, [stac] extra). See
+        # test/conftest.py for the maintained condition.
+        pytest test/pyrx test/ds test/pyvx test/pygx test/pmtiles_light test/stac -m "not integration" -v
@@ -43,13 +43,15 @@ All SQL functions register with a `gbx_` prefix (e.g. `gbx_rst_clip`, `gbx_bng_c
 
 GeoBrix supports both current Databricks Runtime LTS releases:
 
-| DBR LTS | Ubuntu | Spark | Python | Scala | Java | GeoBrix |
-|---|---|---|---|---|---|---|
-| **17.3 LTS** | 24.04 | 4.0.0 | 3.12.3 | 2.13.16 | 17 | ✅ Supported |
-| **18 LTS** | 24.04 | 4.1.0 | 3.12.3 | 2.13.16 | 21 | ✅ Supported |
+| DBR LTS | Ubuntu | Spark | Python | Scala | Java | Serverless env | GeoBrix |
+|---|---|---|---|---|---|---|---|
+| **17.3 LTS** | 24.04 | 4.0.0 | 3.12.3 | 2.13.16 | 17 | **5+** (Py 3.12) | ✅ Supported |
+| **18 LTS** | 24.04 | 4.1.0 | 3.12.3 | 2.13.16 | 21 | **5+** (Py 3.12) | ✅ Supported |
 
 A **single wheel + single JAR** runs on both: Scala 2.13.16 matches both runtimes, the JAR is compiled to Java-17 bytecode so it loads on both JVMs, and Spark is a `provided` dependency.
 
+The **Serverless env** column is the minimum Serverless environment version for the lightweight tier: **version 5+** provides Python 3.12, which the `[light]` dependencies require (Python ≥ 3.11). Older environment versions (Python 3.10) can't install `geobrix[light]`. Env v5 release notes: [AWS](https://docs.databricks.com/aws/en/release-notes/serverless/environment-version/five) · [Azure](https://learn.microsoft.com/azure/databricks/release-notes/serverless/environment-version/five) · [GCP](https://docs.databricks.com/gcp/en/release-notes/serverless/environment-version/five).
+
 > **DBR 19 LTS is coming soon**, built on **Ubuntu 26.04**. The **lightweight** tier (pure-Python, rasterio's bundled GDAL) will be unaffected; the **heavyweight** tier's native GDAL/OGR libraries are compiled against the cluster OS, so they will need to be rebuilt for the new base image.
 
 ## Quick start (lightweight)
 
@@ -24,6 +24,41 @@ After an explicit `rx.register(spark)`, the SQL names are identical too (`gbx_rs
 The one-line *import* swap is symmetric, but the *install* is not. The **lightweight** tier is just the `[light]` wheel (`%pip`, no JAR, no init script). The **heavyweight** tier additionally requires the **GeoBrix JAR as a cluster library and the GDAL init script** on a **classic x86 cluster** — the wheel alone will not resolve the import or the JVM expressions. See [Installation](../installation) for the heavyweight setup.
 :::
 
+## Registering a subset (`only=`)
+
+`register()` installs every `gbx_*` SQL name for the tier. To register just the functions a session uses, pass `only=` (lightweight tiers — `pyrx`, `pygx`, `pyvx`):
+
+```python
+from databricks.labs.gbx.pyrx import functions as rx
+
+rx.register(spark, only=["rst_slope", "rst_clip"])   # just these two
+rx.register(spark)                                   # all (default)
+```
+
+Names are case-insensitive and accept either the SQL name (`gbx_rst_slope`) or the short form (`rst_slope`). An unrecognized name raises `ValueError` (typo guard). `only=[]` registers nothing.
+
+**Readers and writers** register through a separate entry point and take `only=` too — selected by **format name** (with or without the `_gbx` suffix):
+
+```python
+from databricks.labs.gbx.ds import register as ds_register
+
+ds_register.register(spark, only=["raster_gbx", "gtiff_gbx"])  # just these formats
+ds_register.register(spark, only=["shapefile"])               # 'shapefile' -> 'shapefile_gbx'
+ds_register.register(spark)                                    # all readers/writers (default)
+```
+
+**Mixing tiers per function.** Because both tiers share the `gbx_*` names (last registration wins), you can register the heavyweight set and then override individual functions with the lightweight implementation:
+
+```python
+from databricks.labs.gbx.rasterx import functions as heavy
+from databricks.labs.gbx.pyrx    import functions as light
+
+heavy.register(spark)                       # all heavy gbx_rst_*
+light.register(spark, only=["rst_slope"])   # gbx_rst_slope now lightweight
+```
+
+The reverse — re-registering a few **heavy** functions over a lightweight session — is not yet available; `only=` is currently a lightweight-tier feature (heavy registers its full set). Mixing works because both tiers use the same tile struct and GTiff payload, so a tile produced by one tier flows into a function from the other.
+
 ## Tradeoffs
 
 | Aspect | Heavyweight (rasterx) | Lightweight (pyrx) |
 
@@ -766,10 +766,16 @@ Create or load rasters from path, binary content, or bands (4 total).
 
 ### rst_fromfile
 
-<Tier both/>
+<Tier light/>
 
-:::note Lightweight tier (pyrx)
-Powered by **rasterio**. Opens the raster at `path` and re-encodes it as a GeoTIFF tile; the `driver` arg is a format hint (rasterio auto-detects on open). A missing/unreadable path returns null.
+:::tip Loading rasters at scale? Use the [Raster Reader](../readers/raster).
+`rst_fromfile` is a **convenience** for pulling columnar raster paths into a tile column inline. To ingest rasters as a normal Spark job — partitioned parallel reads, optional tiling (`sizeInMB`), and FUSE-safe Volume staging — use the **[Raster Reader](../readers/raster)** (`raster_gbx` / `gtiff_gbx`, or heavyweight `gdal` / `gtiff_gdal`): `spark.read.format("raster_gbx").load(path)`.
+:::
+
+:::note Lightweight tier only (pyrx) — callable from Python and SQL
+Powered by **rasterio**. Opens the raster at `path` and re-encodes it as a GeoTIFF tile; the `driver` arg is a format hint (rasterio auto-detects on open). A missing/unreadable path returns null. Requires `geobrix[light]`.
+
+`gbx_rst_fromfile` has **no heavyweight (JVM) implementation**. On Databricks the executor JVM cannot read a Unity Catalog Volume (`/Volumes/...`) FUSE path — the UC credential is held only by Spark's managed Python worker — so the function is registered as a Python UDF even when you call it from SQL. With `geobrix[light]` installed it is available in SQL (`SELECT gbx_rst_fromfile(...)`) and in Python (`rx.rst_fromfile(...)`); without `[light]` it is not registered, and the Python binding raises with guidance.
 :::
 
 Load a raster from a file path.
@@ -784,6 +790,23 @@ Load a raster from a file path.
 
 <CodeFromTest language="sql" source="docs/tests/python/api/rasterx_functions_sql.py" testFile="docs/tests/python/api/test_rasterx_functions_sql.py" functionName="rst_fromfile_sql_example" outputConstant="rst_fromfile_sql_example_output" code={rasterxSqlCode} />
 
+:::tip Portable alternative — `binaryFile` + `rst_fromcontent`
+If `geobrix[light]` is not installed, or you want a tier-agnostic path that works on any compute, read the bytes with Spark's built-in `binaryFile` reader and build the tile from content. This reads `/Volumes` reliably (the reader runs in Spark, which holds the credential) and works in both tiers:
+
+```python
+df = (
+    spark.read.format("binaryFile")
+    .load("/Volumes/main/geobrix_samples/geobrix-examples/nyc/*.tif")
+    .selectExpr("path", "gbx_rst_fromcontent(content, 'GTiff') AS tile")
+)
+```
+
+```sql
+SELECT path, gbx_rst_fromcontent(content, 'GTiff') AS tile
+FROM read_files('/Volumes/main/geobrix_samples/geobrix-examples/nyc/', format => 'binaryFile')
+```
+:::
+
 ---
 
 ### rst_fromcontent
 
@@ -0,0 +1,253 @@
+---
+sidebar_position: 10
+title: STAC Client
+---
+
+# STAC Client
+
+`databricks.labs.gbx.stac.StacClient` is a lightweight, Serverless-safe client for **distributed STAC search**, **resilient asset download**, and **repair** of invalid files — against any STAC catalog (default: [Planetary Computer](https://planetarycomputer.microsoft.com/)).
+
+Where a single-node STAC script serializes search requests and downloads, `StacClient` fans both operations out across the Spark cluster — one task per AOI row (search) and one task per asset (download). On Serverless, parallelism is controlled via `partitions=` and `DataFrame.repartition()`, with no `spark.conf.set` calls.
+
+:::info Opt-in extra
+`StacClient` requires `geobrix[light,stac]`. The `[stac]` extra pulls in `pystac-client`, `planetary-computer`, `tenacity`, and `requests`. Serverless environment version 5 (Python 3.12) is required.
+:::
+
+## Installation
+
+```bash
+pip install "geobrix[light,stac]"
+```
+
+From a Databricks notebook (Serverless or classic):
+
+```python
+%pip install --quiet "geobrix[light,stac] @ file:///Volumes/<catalog>/<schema>/<volume>/geobrix-0.4.0-py3-none-any.whl"
+```
+
+## Import
+
+```python
+from databricks.labs.gbx.stac import StacClient
+```
+
+## Constructor
+
+```python
+StacClient(
+    catalog="https://planetarycomputer.microsoft.com/api/stac/v1",  # default
+    sign="planetary_computer",   # 'planetary_computer' | None | callable(href)->href
+)
+```
+
+| Parameter | Default | Description |
+|---|---|---|
+| `catalog` | Planetary Computer | URL of any STAC API endpoint. |
+| `sign` | `"planetary_computer"` | Signing strategy. `"planetary_computer"` uses `planetary_computer.sign_inplace`; `None` skips signing (public catalogs); or pass any `callable(href: str) -> str`. |
+
+---
+
+## Methods
+
+### `search`
+
+Fan out AOI rows to the STAC catalog and return one row per `(input-row, item, asset)`.
+
+```python
+assets_df = client.search(
+    df,                              # DataFrame with a GeoJSON-geometry column
+    geojson_col="geojson",           # column name holding the GeoJSON geometry string
+    collections=["sentinel-2-l2a"],  # list of STAC collection IDs
+    datetime="2022-06-01/2022-09-01",# ISO datetime or range (STAC datetime syntax)
+    partitions=512,                  # repartition fan-out; no spark.conf
+)
+```
+
+**Parameters:**
+
+| Parameter | Type | Default | Description |
+|---|---|---|---|
+| `df` | `DataFrame` | — | Input rows. Each row is one AOI. |
+| `geojson_col` | `str` | — | Column containing GeoJSON geometry strings (intersects filter). |
+| `collections` | `List[str]` | — | STAC collection IDs to search. |
+| `datetime` | `str` | — | ISO datetime or range (`"YYYY-MM-DD"` or `"start/end"`). |
+| `partitions` | `int` | `512` | Target partition count for the fan-out repartition. |
+
+**Output columns** (in addition to all input columns, carried through):
+
+| Column | Type | Description |
+|---|---|---|
+| `item_id` | string | STAC item identifier. |
+| `date` | string | Acquisition date from `properties.datetime`. |
+| `item_bbox` | string | Item bounding box (GeoJSON). |
+| `asset_name` | string | Asset key (e.g. `"B02"`, `"B03"`). |
+| `href` | string | Asset download URL at search time (may expire; re-signed per attempt in `download`). |
+| `item_properties` | string | Full item properties JSON. |
+
+One row is emitted per `(input-row, item, asset)`. The same STAC item reached via multiple AOI rows produces multiple rows — `download` deduplicates internally to unique `(item_id, asset_name)`.
+
+---
+
+### `download`
+
+Resilient, validated asset download — one Spark task per asset. Deduplicates to unique `(item_id, asset_name)` so the same item reached via multiple AOIs is fetched exactly once.
+
+```python
+files_df = client.download(
+    assets_df,
+    out_dir,                                    # UC Volume path (FUSE-mounted)
+    asset_names=["B02", "B03", "B04", "B08"],   # None = all assets present in df
+    name="{asset_name}_{item_id}.tif",          # filename template
+    validate=True,                              # rasterio read-validation per file
+    max_tries=5,
+    partitions=None,                            # default: one task per asset
+)
+```
+
+**Parameters:**
+
+| Parameter | Type | Default | Description |
+|---|---|---|---|
+| `df` | `DataFrame` | — | Must contain `item_id` and `asset_name` columns. (`href` from `search` output is accepted but not required — the href is re-signed per attempt from `item_id` + `asset_name`.) |
+| `out_dir` | `str` | — | Destination directory (e.g. a UC Volume FUSE path). |
+| `asset_names` | `List[str]` or `None` | `None` | Filter to these asset keys. `None` downloads all assets present in the DataFrame. |
+| `name` | `str` | `"{asset_name}_{item_id}.tif"` | Filename template. Supports `{asset_name}` and `{item_id}` placeholders. |
+| `validate` | `bool` | `True` | Open and decode a raster window after download to reject throttled error bodies and truncated files that a size check would pass. |
+| `max_tries` | `int` | `5` | Maximum download attempts per asset (exponential backoff between attempts). |
+| `partitions` | `int` or `None` | `None` | Explicit repartition before download. `None` sets one partition per unique asset. |
+
+**Output columns:**
+
+| Column | Type | Description |
+|---|---|---|
+| `item_id` | string | STAC item identifier. |
+| `asset_name` | string | Asset key. |
+| `out_file_path` | string | Absolute path of the written file on the Volume. |
+| `out_file_sz` | long | File size in bytes (`0` if the download failed). |
+| `is_out_file_valid` | boolean | `true` if the file passed read-validation; `false` otherwise. |
+| `last_update` | timestamp | Time of the download attempt. |
+
+**Resilience behavior:**
+
+- The href is **re-signed on every attempt** — signed URLs from `search` may expire before a retry; `download` always re-derives a fresh URL from `item_id` + `asset_name`.
+- HTTP errors (`4xx`/`5xx`, including throttle responses) trigger `tenacity` exponential backoff and retry up to `max_tries`.
+- Each file is staged to **worker-local disk first**; it is copied to the Volume only after passing read-validation. No partial or corrupt file is written to the Volume.
+- Files that already exist and are valid (`is_out_file_valid = true`) are **skipped** — the operation is idempotent.
+- A failed asset (exhausted retries, or read-validation failure) sets `is_out_file_valid = false` and `out_file_sz = 0`; use `repair` to retry those rows.
+
+---
+
+### `repair`
+
+Re-download invalid files and merge the results back to the Delta table.
+
+```python
+repaired = client.repair(
+    "band_b02",                         # Delta table name or DataFrame
+    where="is_out_file_valid = false",  # SQL filter over the table
+)
+```
+
+**Parameters:**
+
+| Parameter | Type | Default | Description |
+|---|---|---|---|
+| `table_or_df` | `str` or `DataFrame` | — | Delta table name or a DataFrame with `item_id`, `asset_name`, `out_file_path`, `is_out_file_valid`. |
+| `where` | `str` | `"is_out_file_valid = false"` | SQL predicate selecting rows to re-download. |
+
+**Behavior:** reads the table, filters to matching rows, re-runs the resilient download on that subset, then merges updated `out_file_path`, `out_file_sz`, `is_out_file_valid`, and `last_update` back into the Delta table. Returns the repaired subset as a DataFrame.
+
+---
+
+## End-to-end example
+
+This illustrates the full search → download → repair flow. The EO-series notebooks are the fully-executed, worked example — see [EO Series](../notebooks/eo-series).
+
+```python
+from databricks.labs.gbx.stac import StacClient
+from pyspark.sql import functions as F
+
+client = StacClient()   # default: Planetary Computer, sign=planetary_computer
+
+# 1 — Search: one row per (AOI cell, STAC item, asset)
+#     df_cells has a "geojson" column with one GeoJSON geometry per H3 cell
+assets_df = client.search(
+    df_cells,
+    geojson_col="geojson",
+    collections=["sentinel-2-l2a"],
+    datetime="2022-06-01/2022-06-30",
+    partitions=512,
+)
+# Write to Delta for an auditable handoff
+assets_df.write.mode("overwrite").saveAsTable("cell_assets")
+
+# 2 — Download: resilient, validated; one task per unique (item_id, asset_name)
+assets = spark.read.table("cell_assets")
+files_df = client.download(
+    assets,
+    out_dir="/Volumes/my_catalog/my_schema/data/alaska/B02",
+    asset_names=["B02"],
+    name="{asset_name}_{item_id}.tif",
+    validate=True,
+    max_tries=5,
+)
+
+# Join back per-item metadata (date) from the search output
+item_meta = assets.select("item_id", "date").distinct()
+band_df = (
+    files_df
+    .join(item_meta, on="item_id", how="left")
+    .withColumn("band_name", F.lit("B02"))
+    .select("item_id", "band_name", "date",
+            "out_file_path", "out_file_sz", "is_out_file_valid", "last_update")
+)
+band_df.write.mode("overwrite").saveAsTable("band_b02")
+
+# 3 — Repair: re-download any files that failed read-validation
+repaired = client.repair("band_b02", where="is_out_file_valid = false")
+```
+
+---
+
+## Serverless usage
+
+`StacClient` is designed for Serverless (environment version 5, Python 3.12):
+
+- **No `spark.conf.set`.** Parallelism is controlled entirely via `partitions=` in `search` and `download`, and via `DataFrame.repartition(N, "col")` in your notebook — **hash by a column**, since on Serverless a number-only `repartition(N)` is coalesced by AQE back toward one partition.
+- **No `.cache()` / `.persist()`.** Materialize search results and downloaded-file metadata to Delta tables — Delta time travel is a more durable alternative to in-memory caching and survives session restarts.
+- **One task per asset.** Each download task is independent; Serverless autoscaling routes tasks across available workers without pinning.
+
+```python
+# Serverless: write search results to Delta immediately (no caching)
+client.search(df_cells, geojson_col="geojson",
+              collections=["sentinel-2-l2a"], datetime="2022-06-01").write \
+      .mode("overwrite").saveAsTable("cell_assets")
+
+# Serverless: read back from Delta for the download step
+assets = spark.read.table("cell_assets")
+files_df = client.download(assets, out_dir="/Volumes/...", asset_names=["B02"])
+files_df.write.mode("overwrite").saveAsTable("band_b02")
+```
+
+:::note No doc-test backing for this page
+`StacClient` is a network integration client — it requires live STAC catalog access and real asset URLs. The illustrative code blocks above show the API surface; the [EO Series notebooks](../notebooks/eo-series) are the fully-executed, end-to-end example with real data.
+:::
+
+---
+
+## Non-goals
+
+The following are explicitly out of scope for the initial `StacClient`:
+
+- **No async / concurrent-within-task fetching.** Parallelism comes from Spark tasks, not `asyncio`/threads inside a UDF.
+- **No non-raster asset validation.** `validate=True` open-and-decodes a raster window; JSON, thumbnails, and vector sidecars are downloaded but not validated.
+- **No catalog or item publishing.** `StacClient` is a read/consume client.
+- **No credential management.** Auth is expressed through the `sign` parameter; token storage and refresh are the caller's responsibility.
+
+---
+
+## See also
+
+- [EO Series notebooks](../notebooks/eo-series) — the worked end-to-end example (search → download → repair → tessellate → stack).
+- [Execution Tiers](./execution-tiers) — lightweight vs heavyweight comparison.
+- [RasterX Function Reference](./raster-functions) — `rst_h3_tessellate`, `rst_fromcontent`, `rst_merge_agg`, and the rest of the raster processing functions used downstream of STAC downloads.