Skip to content

Commit 97ed3fc

Browse files
GeoBrix 0.4.0 (beta): STAC light API, register(only=), lightweight gbx_rst_fromfile (#41)
* docs(stac): design spec for lightweight STAC API Approved design for databricks.labs.gbx.stac: a catalog-agnostic, Serverless-safe StacClient (search + resilient download + repair) that consolidates the EO-series library.py/config_nb STAC helpers behind one importable surface, folding in the download read-validation/retry and repartition-not-spark.conf patterns proven on Serverless. Co-authored-by: Isaac * docs(stac): implementation plan for lightweight STAC API TDD, bite-sized plan for databricks.labs.gbx.stac (search + resilient download + repair) with the [stac] extra, injectable catalog for unit tests, and a marked PC integration test. EO-series refactor + executed notebooks follow as the next phase. Co-authored-by: Isaac * feat(stac): package skeleton + [stac] extra Add databricks.labs.gbx.stac package with StacClient stub and [stac] optional-dependencies extra (pystac-client, planetary-computer, tenacity). Tenacity is declared explicitly per plan deviation note (may also arrive transitively but is declared for supply-chain clarity). Co-authored-by: Isaac * feat(stac): pluggable signing strategies Add _sign.py with resolve_signer/resolve_modifier. Supports 'planetary_computer', None (identity), or any callable. Lazy import of planetary_computer so the module loads without [stac] deps installed. Co-authored-by: Isaac * feat(stac): search parsers + per-AOI search with retry Add _search.py with parse_item, extract_assets, and search_one (tenacity-backed retry, injectable catalog, empty-list fallback on permanent failure so one bad AOI does not kill the job). Co-authored-by: Isaac * feat(stac): resilient download (HTTP-aware fetch + read-validate + retry) Add _download.py: download_href (raise_for_status so throttle/403 raises instead of writing error body), read_validate (open+decode window, never size-only), fetch_validate_publish (local-stage, validate, publish with backoff; re-signs each attempt via href_fn). Co-authored-by: Isaac * feat(stac): StacClient search/download/repair orchestration Full StacClient with search (repartition-based fan-out + UDF), download (resilient per-asset fetch), and repair (Delta MERGE). Test injection via _catalog_opener runs on the driver to avoid pickling test-local callables into Spark worker processes. Lazy pyspark imports throughout keep the module importable without pyspark installed (required for _sign host tests). Co-authored-by: Isaac * test(stac): serverless-safety guard + remove stac __init__ from tests Add test_serverless_no_spark_config.py (static check that stac module contains no spark.conf.set / cache / persist calls). Remove test/stac/ __init__.py: its presence made cloudpickle serialize _FakeCatalog as 'stac.test_client', causing ModuleNotFoundError in Spark UDF workers. Co-authored-by: Isaac * test(stac): env-v5 (Python 3.12) compatibility check + smoke Add host pin-sanity test (checks [stac] extra declares pystac-client >=0.7 + planetary-computer >=1.0) and a Serverless env-v5 smoke source that installs [stac] and verifies import + StacClient construction without network (live run to be performed separately by the controller). Co-authored-by: Isaac * fix(stac): C1/I1/I2/I3/I5/I7/M1/M3/M4 hardening of search/download/repair C1: UDF fan-out path (repartition+explode+_ITEM_SCHEMA/_ASSET_SCHEMA) now exercised end-to-end; _catalog_opener still keeps driver-shim for simpler test injection; _get_fn= added to download() so HTTP fetcher is injectable in UDF workers without process-level monkeypatching. I1: validate=False publishes bytes without rasterio decode; both paths tested. I2: download() emits last_update via F.current_timestamp(); repair MERGE includes last_update in whenMatchedUpdate. I3: idempotency short-circuit in fetch_validate_publish — skips re-fetch if target exists and passes validity check. I5: _sanitize_filename via os.path.basename prevents path traversal. I7: removed unused Row import and dead inner type-import block. M1: requests>=2.25 declared in [stac] extra in pyproject.toml. M4: search_one emits warnings.warn on swallowed exceptions. Co-authored-by: Isaac * test(stac): full test suite for C1/I1/I2/I3/I5/M3/M4 + UDF fan-out path C1: UDF-path tests ship a fake pystac_client.py via addPyFile so workers use the stub instead of the real package; exercises repartition, explode, _ITEM_SCHEMA/_ASSET_SCHEMA, dropDuplicates, and the download _fetch UDF with _get_fn= injection (no real network calls). I1: validate=True/False tests (reject garbage / publish non-decodable bytes). I2: last_update column presence asserted on download output. I3: idempotency tests — second call with existing valid file skips fetch. I5: path-traversal test with item_id containing ../ stays under out_dir. M3: test/stac/__init__.py restored; cloudpickle resolves via addPyFile stub so package __init__.py is compatible with the UDF pickling approach. M4: search_one warning test asserts RuntimeError is surfaced as a warning. _fake_catalog.py: module-level importable fake catalog for driver-shim tests. Co-authored-by: Isaac * refactor(eo-series): config_nb installs [light,stac] + StacClient; drop redundant STAC/download helpers - %pip install now uses geobrix[light,stac] (pystac-client, planetary-computer, tenacity come via the [stac] extra); removes standalone pystac/pystac_client/ planetary_computer/tenacity from the second pip line; keeps folium/mapclassify/ geopandas/rich (viz deps not bundled in [stac]) - Adds StacClient import to the shared imports cell; adds a new cell that instantiates stac_client = StacClient() (Planetary Computer default, auto-signed) making it available to nb01/nb02 after %run ./config_nb - Removes download_band, update_assets, download_missing_assets from config_nb; keeps file_size, timestamp_filename, get_now_formatted, set_conf_safe, finalize_tiled_band_tbl, gen_tessellate_tiled_band, and all viz helpers - library.py: removes ps_client, get_items, get_assets, get_assets_for_cells, download_asset, download_asset_v2 and their now-unused imports (pystac_client, planetary_computer, tenacity); keeps all viz helpers intact Co-authored-by: Isaac * refactor(eo-series): drop dead get_unique_hrefs (old asset.* schema, no callers) Co-authored-by: Isaac * refactor(eo-series): nb01 search via StacClient.search Replace library.get_assets_for_cells with stac_client.search(df_cell_json, ...) inside the LAST_UPDATED/FORCE_REBUILD guard; use F.current_timestamp() (lazy, Serverless-safe) for provenance. Fix demo search cell to open a local _demo_cat instead of the removed library.ps_client. Update section markdowns to describe stac_client.search and one-row-per-(cell,item,asset) output. Clear stale show() outputs that referenced old schema columns (asset MAP, timestamp, item_collection, stac_version). Co-authored-by: Isaac * refactor(eo-series): nb02 download+repair via StacClient; rebuild band tables as join Replace download_band/download_missing_assets with a notebook-local build_band_table helper (filters cell_assets → stac_client.download → joins date metadata → saveAsTable band_<band>) and stac_client.repair for invalid-row recovery. Band table schema is now the contract columns only: item_id, band_name, date, out_file_path, out_file_sz, is_out_file_valid, last_update. Drops legacy columns (timestamp, h3_set, asset map, item_collection, stac_version, item_bbox, item_properties, out_dir_fuse, out_filename). Co-authored-by: Isaac * docs(stac): add STAC API page + eo-series page/README describe StacClient + [light,stac] - New docs/docs/api/stac.mdx: user-facing StacClient reference (search / download / repair), install instructions, API tables with exact signatures and output columns, end-to-end illustrative example, resilience behavior, Serverless notes, and non-goals. Registered in sidebars.js after PMTiles. - eo-series.mdx + README.md: install note updated to [light,stac]; nb01/nb02 descriptions updated to StacClient.search / .download / .repair; old download_band / update_assets / download_missing_assets references removed; StacClient linked from key-functions list and gotchas. Co-authored-by: Isaac * fix(eo-series): cell_assets selection robust to schema change nb01 under FORCE_REBUILD now purges stale cell_assets_* dirs before re-search (the search output schema changed in the StacClient refactor: asset map -> flat asset_name/href). Both nb01 and nb02 now pick the LATEST cell_assets_* dir instead of the first arbitrary os.listdir match, so nb02 never reads a stale prior-schema table (was: UNRESOLVED_COLUMN asset_name). Co-authored-by: Isaac * docs(notebooks): add narration + doc links to example notebooks Add 18 markdown cells across 5 example notebooks orienting readers on which GeoBrix APIs each code cell uses. Each insertion explains what the function does, what it returns, and links to the relevant docs page. Tier statements added to notebooks 03, 04, and config_nb. Built-in Databricks ST/H3 functions clearly distinguished from GeoBrix APIs. Co-authored-by: Isaac * fix(eo-series): nb02 band table date->DateType + overwriteSchema StacClient.search emits date as a string (dt[:10]); cast to DateType in build_band_table so band_<band> -> band_*_tile -> band_*_h3 -> band_stack stay DateType (matches the original nb03/nb04 contract; no nb04 change needed). overwriteSchema=true so the band table fully replaces a prior-release schema (the refactor shrank its column set) instead of failing on DELTA_FAILED_TO_MERGE. Co-authored-by: Isaac * fix(eo-series): drop stale band table under FORCE_REBUILD before rebuild A prior-release band_<band> managed table has a different (larger) column set; saveAsTable(overwrite, overwriteSchema) can still attempt a field-level merge against it. DROP TABLE IF EXISTS under force_rebuild guarantees a clean recreate (mirrors nb01's cell_assets purge), so the notebook self-heals across schema changes without external cleanup. Co-authored-by: Isaac * docs(release-notes): v0.4.0 notes StacClient + examples default to lightweight tier Co-authored-by: Isaac * fix(rasterx): rst_fromfile reads UC Volume/Workspace paths (issue #34) gbx_rst_fromfile('/Volumes/...','GTiff') silently returned a NULL tile on classic UC compute, then crashed downstream (rst_retile) with an opaque 'Long cannot be cast to InternalRow'. Three coordinated fixes: - HadoopUtils.cleanPath: route the supported FUSE fabric (/Volumes, /Workspace, dbfs:/Volumes, file:/) to the local file: connector. A scheme-less /Volumes path resolved against fs.defaultFS (dbfs:) and never reached the FUSE mount. Legacy pre-Volumes DBFS (/dbfs/, dbfs:/) is unsupported and coerced away from the retired mount (the old /dbfs->dbfs: mapping and /dbfs/Volume/ typo arm are removed); /dbfs/Volumes and dbfs:/Volumes aliases coerce to file:/Volumes. - RST_FromFile: read FUSE/local paths via NIO (Files.readAllBytes) instead of getFileSystem(serialized hConf)+readContent -- the same OS path GDAL JNI reads, independent of the driver-serialized fs config (the second defect in #34). Hadoop FS read retained as a fallback for non-file: schemes. - RST_ErrorHandler.safeEval: initCause the wrapped Error when crashing, and logWarning the swallowed throwable otherwise, so the real cause is no longer lost (issue #34 secondary). Adds cleanPath regression tests. Pending classic-UC-cluster validation against a real /Volumes GeoTIFF. Co-authored-by: Isaac * fix(rasterx): rst_fromfile non-foldable so /Volumes reads run on executors (issue #34) Root cause: RST_FromFile with LITERAL path args was treated as foldable by Catalyst. ConstantFolding evaluated it at plan time on the DRIVER — which cannot open a UC Volume FUSE path in the optimizer context — returning null instead of reading the raster. Fix: PrettyInvoke gains a nonFoldable param that overrides foldable=false, bypassing Catalyst's constant-folding at plan time. RST_FromFile sets override def foldable=false and calls invoke(..., nonFoldable=true). I/O expressions MUST evaluate at runtime on executors, never be folded on the driver. Co-authored-by: Isaac * fix(rasterx): rst_fromfile reads /Volumes via WorkspaceLocalFileSystem not NIO (issue #34) Raw NIO (java.nio.file.Files) reads of a /Volumes FUSE path are denied ('Operation not permitted') in an expression's execution context (UC ephemeral credential scope), even though the same executor reads the path in a plain RDD/UDF context. Route readBytes through the Hadoop FileSystem so file:/Volumes resolves to WorkspaceLocalFileSystem (the UC-credentialed FUSE connector), using a fresh executor-side Configuration (fs.file.impl is on the executor classpath, not necessarily the driver-serialized conf). Co-authored-by: Isaac * fix(rasterx): gdal/ogr readers read /Volumes via WorkspaceLocalFileSystem (issue #34) The HadoopUtils listing/copy helpers resolved file:/Volumes with the caller's config, which lacks fs.file.impl and fell back to RawLocalFileSystem (can't read the FUSE mount) -> FileNotFound in GDAL_Batch/OGR planInputPartitions. Route all FS resolution through fileSystemFor, overlaying the classpath's fs.file.impl=WorkspaceLocalFileSystem, so the gdal/ogr readers and copy/list helpers read /Volumes the same way rst_fromfile now does. Co-authored-by: Isaac * fix(rasterx): rst_fromfile delegates to pyrx Python loader for UC Volume reads (issue #34) The heavyweight (JVM) reader cannot read /Volumes: in the Catalyst execution context file: resolves to RawLocalFileSystem (the UC-credentialed WorkspaceLocalFileSystem FUSE connector is unreachable from the library JAR's classloader). The lightweight pyrx loader reads via a Python pandas_udf (the Python process holds the UC FUSE credential) and yields a schema-compatible tile. rx.rst_fromfile now delegates to pyrx.rst_fromfile when importable (requires geobrix[light]); else falls back to the Scala gbx_rst_fromfile UDF. Co-authored-by: Isaac * fix(rasterx): resolve file:/Volumes FS via live Spark conf, not serialized/bare config (issue #34) withFileSystem builds the Hadoop Configuration fresh on the executor from the live SparkEnv conf (SparkHadoopUtil.newConfiguration), which overlays spark.hadoop.* incl. Databricks fs.file.impl=WorkspaceLocalFileSystem, so file:/Volumes resolves to the UC FUSE connector even in the Catalyst codegen / DataSource-planning context (a bare or driver-serialized config falls back to RawLocalFileSystem there). FileSystem.newInstance bypasses the static FS cache. Co-authored-by: Isaac * fix(rasterx): register gbx_rst_fromfile as the pyrx Python UDF; graceful skip without [light] (issue #34) The JVM cannot read a UC Volume FUSE path in the Spark execution context (the UC credential is held only by Spark's managed Python worker), so the canonical impl of gbx_rst_fromfile is the pyrx Python UDF, which reads /Volumes (and /Workspace, DBFS, local). register() overrides the Scala gbx_rst_fromfile with it when pyrx ([light]) is importable; if absent, it skips gracefully and the Scala registration remains (reads local/DBFS/Workspace, not /Volumes). Co-authored-by: Isaac * refactor(util): revert dead /Volumes FS-forcing in HadoopUtils to baseline getFileSystem (issue #34) The withFileSystem SparkHadoopUtil/newInstance/fs.file.impl-pin machinery never achieved a JVM /Volumes read (the UC credential is Python-worker-only). /Volumes is now served by the pyrx Python UDF (gbx_rst_fromfile) / binaryFile+rst_fromcontent. Revert to plain getFileSystem(hconf) -> restores baseline local/DBFS/Workspace reader behavior with no churn. Keeps cleanPath normalization + non-foldable. Co-authored-by: Isaac * refactor(rasterx): make gbx_rst_fromfile lightweight-only (issue #34) On Databricks the executor JVM cannot read a UC Volume (/Volumes) FUSE path: the UC credential lives in the user-scoped Python/IO context, not the library's JVM code path. So the Scala RST_FromFile expression read the wrong filesystem and returned a null tile, which then crashed downstream (gbx_rst_retile: Long cannot be cast to InternalRow). There is no way to implement this in the JVM tier, so remove the Scala RST_FromFile entirely; gbx_rst_fromfile is now registered solely as the pyrx Python UDF (runs where the credential is available). It is callable from SQL and Python when geobrix[light] is installed; without [light] it is not registered and the rx.rst_fromfile binding raises with guidance. - Remove RST_FromFile.scala + its registration and the Scala column/ scalar helpers in rasterx/functions.scala. - pyrx _fromfile_udf reads bytes sequentially into a rasterio MemoryFile (FUSE-safe: a tiled/COG GTiff over a Volume seeks to block offsets, which Volume FUSE can't serve). - Migrate Scala tests off the removed expression: new udfs.rasterFromPath test fixture (reads local bytes JVM-side -> rst_fromcontent), used across the rasterx eval/structure suites; drop the heavy-tier rst_fromfile bench dispatch. rasterx.* suites green (513 passed). - check-binding-parity exempts gbx_rst_fromfile from the Scala arm (Python-registered); Python binding + function-info still required. - Docs: rst_fromfile marked lightweight-only with the credential reason and the portable binaryFile + gbx_rst_fromcontent alternative. Co-authored-by: Isaac * docs(spec): register(only=[...]) selective SQL registration (light tiers) Design for an optional `only` param on the lightweight register() functions (pyrx/pygx/pyvx) so a session can register a subset of a tier's SQL functions. Accepts both gbx_ and short names, raises on unknown names, preserves today's behavior when only=None via a grouped registrar map that runs per-sub-module availability guards only for selected groups. Heavy only= deferred. Co-authored-by: Isaac * docs(plan): register(only=[...]) implementation plan + readers/writers scope Task-by-task TDD plan for the light-tier only= feature. Extends scope to the DataSource registrar (ds.register) so readers/writers are selectable by format name too, via a shared _register helper (normalize_name + normalize_datasource_name + resolve_only + run_groups). Six tasks: helper, pyrx, pygx, pyvx, ds.register, docs. Co-authored-by: Isaac * docs(plan): extend Task 6 to add only= note in readers/writers overview Co-authored-by: Isaac * feat(register): shared only= normalize/validate/run-groups helper Adds _register.py with normalize_name, normalize_datasource_name, resolve_only, and run_groups — the shared helper that Tasks 2-6 will wire into pyrx/pygx/pyvx/ds register() to support only=[...] selective SQL function registration. 10 unit tests; TDD red→green. * feat(pyrx): register(only=[...]) selective SQL registration Refactor pyrx functions.register() into a grouped registrar map (_registrar_groups) covering all SQL_REGISTRY scalar/agg UDFs, the 17 UDTFs, and gbx_pmtiles_agg. Add the only= param wired to _register.run_groups so callers can register a subset by name (short or full SQL form, case-insensitive). only=None remains behavior-identical. Co-authored-by: Isaac * feat(pygx): register(only=[...]) selective SQL registration Refactors register() into _registrar_groups() with three guarded groups (quadbin/bng/custom) wired through _register.run_groups, mirroring the pyrx pattern from Task 2. Guards close over _env so they are monkeypatchable for test isolation. only=None preserves existing behavior; only=[...] validates names (case-insensitive, short or full SQL form) and runs only the guards whose group has a selected function. 10 quadbin + 23 bng + 7 custom = 40 entries; array-output return types (ArrayType(LongType()), ArrayType(StringType()), ArrayType(QUADBIN_CELL_SCHEMA), ArrayType(BNG_CHIP_SCHEMA)) preserved verbatim from the pre-refactor body. TDD: 5 new tests in test_register_only.py (RED then GREEN); 150/150 existing pygx tests pass (14 JAR-gated deselected as expected). Co-authored-by: Isaac * feat(pyvx): register(only=[...]) selective SQL registration * feat(ds): register(only=[...]) selective reader/writer registration Adds `only=[...]` to `ds.register.register` so callers can register a subset of the 9 light DataSources by format name (e.g. `raster_gbx` or the bare `raster` form). Filters via `_register.resolve_only` with `normalize_datasource_name` and preserves `_SOURCES` iteration order for the `only=None` path. Co-authored-by: Isaac * docs: document register(only=[...]) for functions and readers/writers Adds a "Registering a subset (only=)" section to execution-tiers.mdx covering function-level only= (pyrx/pygx/pyvx), ds_register only= by format name with/without _gbx suffix, ValueError on unknown names, and the heavy+light mixing pattern. Extends the "Register first" admonitions in readers/overview.mdx and writers/overview.mdx with a concise only= example and the ValueError note. * test(ds): parenthesize _format_ok boolean for clarity (no behavior change) Co-authored-by: Isaac * refactor(bench): repartition by column (Serverless-safe, no number-only) Number-only repartition(N) is AQE-coalesced toward 1 partition on Databricks Serverless; hash repartition by column is respected. Switch all three repartition sites in runner.py to repartition by F.col("tile") -- the sole column on df_all -- and update matching test files to use the geometry column (geom_0 / x+y) so partitions stay non-empty. Co-authored-by: Isaac * docs(examples): repartition by column (Serverless-safe, no number-only) Switch all 8 number-only repartition sites in readers/examples.py to hash repartition by an existing column: F.col("source") for GDAL readers, F.col("geom_0") for OGR-based readers (shapefile_ogr, gpkg, file_gdb_ogr, ogr). Add F import alongside the existing expr import. Co-authored-by: Isaac * refactor(bench): repartition by F.rand() not the tile struct Hashing the tile struct (which carries the raster bytes) as the repartition key would hash megabytes per row and skew the very timings the bench measures; F.rand() gives uniform AQE-respected spread. tile is the only column on df_all. Co-authored-by: Isaac * fix(stac): sign asset hrefs once — drop double-sign that 403'd all downloads Search path now carries raw (unsigned) hrefs — the _items UDF opens the catalog without the sign_inplace modifier so hrefs are stored unsigned in the search result DataFrame. Download path signs the carried raw href once per attempt via resolve_signer (planetary_computer.sign). This eliminates the prior pattern of calling get_item (which requires a collection parameter on the PC STAC API and raised on every call, causing silently-swallowed exceptions and out_file_path=None for all assets) followed by a double-sign of the already-modifier-signed href. Also surface swallowed errors in fetch_validate_publish via logging.warning so a future all-false symptom leaves a reason in executor logs. Verified green on Databricks Serverless env-v5 (Ketchikan B02 assets, total=3, valid=3). Co-authored-by: Isaac * fix(stac): re-sign by stripping stale SAS token; nb02 carries href planetary_computer.sign is a NO-OP on an already-tokened URL — it returns an existing (possibly EXPIRED) SAS query unchanged instead of refreshing it. A search-time-signed href stored in a cell_assets table therefore 403s at download time (verified on real Ketchikan data: stored href -> 403 XML; strip-then-sign -> 200 image/tiff). resolve_signer now strips any existing query string before signing so PC always mints a fresh token (raw hrefs are unchanged). Also: eo-series nb02 build_band_table now selects 'href' into the download input — the carry-href download signs the carried href, so dropping it left nothing to sign (all files invalid). Co-authored-by: Isaac * fix(stac): download/repair fail loudly on missing href (no silent all-invalid) A missing href column was silently null-filled, making every download invalid with no explanation (the eo-series nb02 trap: build_band_table dropped href). Now download() and repair() raise an actionable ValueError naming the missing column(s) and pointing to the cell_assets-derived input / build_band_table(force_rebuild=True). Co-authored-by: Isaac * fix(lint): resolve flake8 F821/F841 + cross-version f-string; formatting - bench/runner.py: the F.rand() repartition (Serverless fan-out) used F without importing it in _explain_spark_path -> F821. Add the local functions-as-F import. - stac/client.py: drop the now-unused `sign` local in search() -> F841 (search opens the catalog without the signing modifier; signing happens once at download). - bench/readers.py: compute the basename outside the f-string -- a backslash in an f-string expression is a SyntaxError before Python 3.12 and trips flake8 on older hosts. - isort/black (container) formatting on rasterx/functions.py and the two writer tests; add python/geobrix/test/__init__.py. Co-authored-by: Isaac * docs(examples): executed eo-series + xView notebooks (lightweight tier, Serverless) Re-ran the EO Series (01-04 + config_nb) and xView walkthroughs end-to-end on Databricks Serverless (environment v5, geobrix[light,stac]); committed with their executed outputs. nb02 carries href into StacClient.download (the carry-href contract); nb04 stacks bands via union+pivot then repartition(N, cellid, date) before rst_frombands (Serverless-safe parallelism; the old spark.conf tuning is a no-op there). READMEs add a Serverless execution-strategy section. Includes the eo-series→StacClient refactor plan. Co-authored-by: Isaac * docs: Serverless repartition guidance, rst_fromfile lightweight-only, env-v5 - Serverless parallelism is repartition(N, col) not spark.conf (no-op there): geojsonl writer note + STAC/eo-series shuffle guidance corrected (number-only repartition is AQE-coalesced). - rst_fromfile is lightweight-tier only (Python UDF; JVM can't read UC Volumes), but gbx_rst_fromfile SQL is registered when geobrix[light] is present; raster-functions page leads with a Raster Reader link (it's a single-file convenience) + binaryFile alternative; release-notes bullet added. - README + installation: Serverless environment v5 (Python 3.12) requirement for [light]. Co-authored-by: Isaac * feat(pyrx): robustness for real-AOI/Serverless raster + vector reads - pyrx/core/agg.py: merge_tiles reconciles multi-CRS groups (reproject mismatched tiles to a deterministic reference CRS before rasterio.merge) and frombands aligns bands onto the reference grid before stacking (handles UTM-zone-boundary AOIs / uneven coverage). - ds/raster.py: FUSE-safe whole-image read (sequential byte read + MemoryFile; a tiled/COG GTiff over a UC Volume seeks to block offsets the FUSE mount can't serve). - ds/vector.py: optional bbox + OGR `where` pushdown to pyogrio (read less; keeps a large single-file vector read under the ~1 GB Serverless per-UDF memory cap). - pyrx/core/xyz.py: import rio-tiler lazily so `import pyrx` (and non-XYZ rst_* functions) works on Serverless env-v5, where rio-tiler 9.x's PEP-728 TypedDict trips the immutable typing_extensions pin; only rst_tilexyz / rst_xyzpyramid need it at call time. - pyrx/core/tessellate.py + the UDTF: skip cells that clip to empty/all-nodata (None) so no null-raster tile row is emitted. Co-authored-by: Isaac * test(stac): cover StacClient.repair() filter + href guard Adds direct unit tests for repair() (the QC test-completeness gap): one asserts it filters to the invalid rows and re-downloads them carrying item_id/asset_name/href (download stubbed, no network); one asserts the fail-loud ValueError when href/asset_name are absent (e.g. a band table). 2/2 green in Docker. Co-authored-by: Isaac * test(bench): fix BenchDispatch count 107->106 (rst_fromfile, #34) The #34 refactor made rst_fromfile lightweight-only (the JVM cannot read UC Volumes) and dropped it from the heavy bench registry BenchDispatch.cats, taking BenchDispatch.all from 107 to 106. The "registry covers the ds-in functions" test still asserted == 107 and was never updated in that commit, so the heavy Scala phase failed. Correct the assertion to 106 and the comment tally (bucket-C C1/C2: 6->5). The product SQL count is unchanged at 107 -- gbx_rst_fromfile is still registered via the light tier -- so binding-parity/diagram-coverage are unaffected; only the heavy-benchmarkable set legitimately diverges. Verified: BenchDispatchTest suite 3/3 green in geobrix-dev. Co-authored-by: Isaac * ci(stac): scope STAC tests to the lightweight tier (heavy skips) The heavy Python CI phase collected test/stac and failed at collection with "ModuleNotFoundError: No module named 'rasterio'" (test_download.py imports rasterio; requirements-ci.txt ships none of the light-tier deps). STAC is a lightweight-only API ([stac] extra, JAR-free) and should be tested in the light phase, not the heavy one. The maintained mechanism is test/conftest.py's _LIGHT_TEST_DIRS collect_ignore (heavy env, no rasterio -> dir skipped) paired with the explicit dir list in the light job (.github/actions/pyrx_build). STAC had been added to NEITHER, so it was about to run in no job at all. Register it in both: - conftest.py: add "stac" to _LIGHT_TEST_DIRS -> heavy phase skips it cleanly. - pyrx_build/action.yml: add test/stac to the light pytest dir list -> it runs. - requirements-pyrx-ci.{in,txt}: add requests==2.32.3 (the only missing runtime dep -- stac/_download.py imports it at module top). pystac-client and planetary-computer are NOT added: the suite stubs them (test/stac/ _fake_catalog.py + an addPyFile pystac_client stub + monkeypatch). Lock regenerated with --generate-hashes; only requests + urllib3 + charset-normalizer added (certifi/idna already present), no pin drift. Verified: test/stac 28 passed in geobrix-dev; lint (isort/black/flake8) clean. Co-authored-by: Isaac * ci(stac): add tenacity to the light lock (search_one retry dep) With test/stac now in the lightweight CI selection, the light job failed with "ModuleNotFoundError: No module named 'tenacity'": stac/_search.py search_one wraps its catalog call in a tenacity @Retry, and test_search exercises that path. tenacity is a [stac] extra dep that was absent from requirements-pyrx-ci (the dev container masked this -- it has the full [stac] extra installed; the light CI installs only the lock). Add tenacity==9.1.4 (matches requirements-dev-container.in, within the pyproject >=8,<10 constraint) and regenerate the hash-pinned lock. tenacity is the sole net addition (no new transitives), no existing pin drifted. Verified CI-faithfully: built a clean venv from ONLY requirements-pyrx-ci.txt (--require-hashes) + the package, then ran the full light selection (pytest test/pyrx test/ds test/pyvx test/pygx test/pmtiles_light test/stac -m "not integration") -> 816 passed, 2 skipped, exit 0. No further missing deps. Co-authored-by: Isaac * test(rasterx): load tiles via heavy-native rst_fromcontent, not rst_fromfile Issue #34 made rst_fromfile lightweight-only: rasterx.rst_fromfile (and SQL gbx_rst_fromfile) delegate to the pyrx Python UDF, which imports pandas/rasterio at module top. The heavy CI Python env ships neither, so every test/rasterx test that loaded a tile via rst_fromfile failed at runtime with "ModuleNotFoundError: No module named 'pandas'" (56 failures). The Scala tests were migrated off rst_fromfile during #34; the Python rasterx tests were missed. Add test/rasterx/_helpers.py with heavy-native loaders that decode a local test raster's bytes via the Scala/GDAL gbx_rst_fromcontent (no pyrx/pandas/rasterio): - tile_from_path(rx, f, path, driver) -- drop-in for rst_fromfile(lit(path),...) - read_bytes(path) -- raw bytes for building a content column in multi-row frames Migrate all ~63 sites across 13 files (44 Column-form, 5 SQL-form rewritten to binaryFile + gbx_rst_fromcontent, 7 multi-row bytes). Keeps the heavy/light tier separation intact -- no lightweight deps added to the heavy lock. Verified in geobrix-dev: test/rasterx 93 passed normally AND 93 passed with pandas+rasterio blocked at builtins.__import__ (proves no light-dep pull); lint (black/isort/flake8) clean. Co-authored-by: Isaac --------- Co-authored-by: Michael Johns <user.name>
2 parents d71e7e6 + f9f2823 commit 97ed3fc

107 files changed

Lines changed: 8208 additions & 7203 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/actions/pyrx_build/action.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -63,5 +63,6 @@ runs:
6363
# The lightweight tier is exercised ONLY here (the heavy phase skips these
6464
# dirs via test/conftest.py collect_ignore). Every light test dir must be
6565
# listed: pyrx, ds, pyvx, pygx (light GridX), pmtiles_light (light
66-
# pmtiles_agg). See test/conftest.py for the maintained condition.
67-
pytest test/pyrx test/ds test/pyvx test/pygx test/pmtiles_light -m "not integration" -v
66+
# pmtiles_agg), stac (light STAC client, [stac] extra). See
67+
# test/conftest.py for the maintained condition.
68+
pytest test/pyrx test/ds test/pyvx test/pygx test/pmtiles_light test/stac -m "not integration" -v

README.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,13 +43,15 @@ All SQL functions register with a `gbx_` prefix (e.g. `gbx_rst_clip`, `gbx_bng_c
4343

4444
GeoBrix supports both current Databricks Runtime LTS releases:
4545

46-
| DBR LTS | Ubuntu | Spark | Python | Scala | Java | GeoBrix |
47-
|---|---|---|---|---|---|---|
48-
| **17.3 LTS** | 24.04 | 4.0.0 | 3.12.3 | 2.13.16 | 17 | ✅ Supported |
49-
| **18 LTS** | 24.04 | 4.1.0 | 3.12.3 | 2.13.16 | 21 | ✅ Supported |
46+
| DBR LTS | Ubuntu | Spark | Python | Scala | Java | Serverless env | GeoBrix |
47+
|---|---|---|---|---|---|---|---|
48+
| **17.3 LTS** | 24.04 | 4.0.0 | 3.12.3 | 2.13.16 | 17 | **5+** (Py 3.12) | ✅ Supported |
49+
| **18 LTS** | 24.04 | 4.1.0 | 3.12.3 | 2.13.16 | 21 | **5+** (Py 3.12) | ✅ Supported |
5050

5151
A **single wheel + single JAR** runs on both: Scala 2.13.16 matches both runtimes, the JAR is compiled to Java-17 bytecode so it loads on both JVMs, and Spark is a `provided` dependency.
5252

53+
The **Serverless env** column is the minimum Serverless environment version for the lightweight tier: **version 5+** provides Python 3.12, which the `[light]` dependencies require (Python ≥ 3.11). Older environment versions (Python 3.10) can't install `geobrix[light]`. Env v5 release notes: [AWS](https://docs.databricks.com/aws/en/release-notes/serverless/environment-version/five) · [Azure](https://learn.microsoft.com/azure/databricks/release-notes/serverless/environment-version/five) · [GCP](https://docs.databricks.com/gcp/en/release-notes/serverless/environment-version/five).
54+
5355
> **DBR 19 LTS is coming soon**, built on **Ubuntu 26.04**. The **lightweight** tier (pure-Python, rasterio's bundled GDAL) will be unaffected; the **heavyweight** tier's native GDAL/OGR libraries are compiled against the cluster OS, so they will need to be rebuilt for the new base image.
5456
5557
## Quick start (lightweight)

docs/docs/api/execution-tiers.mdx

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,41 @@ After an explicit `rx.register(spark)`, the SQL names are identical too (`gbx_rs
2424
The one-line *import* swap is symmetric, but the *install* is not. The **lightweight** tier is just the `[light]` wheel (`%pip`, no JAR, no init script). The **heavyweight** tier additionally requires the **GeoBrix JAR as a cluster library and the GDAL init script** on a **classic x86 cluster** — the wheel alone will not resolve the import or the JVM expressions. See [Installation](../installation) for the heavyweight setup.
2525
:::
2626

27+
## Registering a subset (`only=`)
28+
29+
`register()` installs every `gbx_*` SQL name for the tier. To register just the functions a session uses, pass `only=` (lightweight tiers — `pyrx`, `pygx`, `pyvx`):
30+
31+
```python
32+
from databricks.labs.gbx.pyrx import functions as rx
33+
34+
rx.register(spark, only=["rst_slope", "rst_clip"]) # just these two
35+
rx.register(spark) # all (default)
36+
```
37+
38+
Names are case-insensitive and accept either the SQL name (`gbx_rst_slope`) or the short form (`rst_slope`). An unrecognized name raises `ValueError` (typo guard). `only=[]` registers nothing.
39+
40+
**Readers and writers** register through a separate entry point and take `only=` too — selected by **format name** (with or without the `_gbx` suffix):
41+
42+
```python
43+
from databricks.labs.gbx.ds import register as ds_register
44+
45+
ds_register.register(spark, only=["raster_gbx", "gtiff_gbx"]) # just these formats
46+
ds_register.register(spark, only=["shapefile"]) # 'shapefile' -> 'shapefile_gbx'
47+
ds_register.register(spark) # all readers/writers (default)
48+
```
49+
50+
**Mixing tiers per function.** Because both tiers share the `gbx_*` names (last registration wins), you can register the heavyweight set and then override individual functions with the lightweight implementation:
51+
52+
```python
53+
from databricks.labs.gbx.rasterx import functions as heavy
54+
from databricks.labs.gbx.pyrx import functions as light
55+
56+
heavy.register(spark) # all heavy gbx_rst_*
57+
light.register(spark, only=["rst_slope"]) # gbx_rst_slope now lightweight
58+
```
59+
60+
The reverse — re-registering a few **heavy** functions over a lightweight session — is not yet available; `only=` is currently a lightweight-tier feature (heavy registers its full set). Mixing works because both tiers use the same tile struct and GTiff payload, so a tile produced by one tier flows into a function from the other.
61+
2762
## Tradeoffs
2863

2964
| Aspect | Heavyweight (rasterx) | Lightweight (pyrx) |

docs/docs/api/raster-functions.mdx

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -766,10 +766,16 @@ Create or load rasters from path, binary content, or bands (4 total).
766766

767767
### rst_fromfile
768768

769-
<Tier both/>
769+
<Tier light/>
770770

771-
:::note Lightweight tier (pyrx)
772-
Powered by **rasterio**. Opens the raster at `path` and re-encodes it as a GeoTIFF tile; the `driver` arg is a format hint (rasterio auto-detects on open). A missing/unreadable path returns null.
771+
:::tip Loading rasters at scale? Use the [Raster Reader](../readers/raster).
772+
`rst_fromfile` is a **convenience** for pulling columnar raster paths into a tile column inline. To ingest rasters as a normal Spark job — partitioned parallel reads, optional tiling (`sizeInMB`), and FUSE-safe Volume staging — use the **[Raster Reader](../readers/raster)** (`raster_gbx` / `gtiff_gbx`, or heavyweight `gdal` / `gtiff_gdal`): `spark.read.format("raster_gbx").load(path)`.
773+
:::
774+
775+
:::note Lightweight tier only (pyrx) — callable from Python and SQL
776+
Powered by **rasterio**. Opens the raster at `path` and re-encodes it as a GeoTIFF tile; the `driver` arg is a format hint (rasterio auto-detects on open). A missing/unreadable path returns null. Requires `geobrix[light]`.
777+
778+
`gbx_rst_fromfile` has **no heavyweight (JVM) implementation**. On Databricks the executor JVM cannot read a Unity Catalog Volume (`/Volumes/...`) FUSE path — the UC credential is held only by Spark's managed Python worker — so the function is registered as a Python UDF even when you call it from SQL. With `geobrix[light]` installed it is available in SQL (`SELECT gbx_rst_fromfile(...)`) and in Python (`rx.rst_fromfile(...)`); without `[light]` it is not registered, and the Python binding raises with guidance.
773779
:::
774780

775781
Load a raster from a file path.
@@ -784,6 +790,23 @@ Load a raster from a file path.
784790

785791
<CodeFromTest language="sql" source="docs/tests/python/api/rasterx_functions_sql.py" testFile="docs/tests/python/api/test_rasterx_functions_sql.py" functionName="rst_fromfile_sql_example" outputConstant="rst_fromfile_sql_example_output" code={rasterxSqlCode} />
786792

793+
:::tip Portable alternative — `binaryFile` + `rst_fromcontent`
794+
If `geobrix[light]` is not installed, or you want a tier-agnostic path that works on any compute, read the bytes with Spark's built-in `binaryFile` reader and build the tile from content. This reads `/Volumes` reliably (the reader runs in Spark, which holds the credential) and works in both tiers:
795+
796+
```python
797+
df = (
798+
spark.read.format("binaryFile")
799+
.load("/Volumes/main/geobrix_samples/geobrix-examples/nyc/*.tif")
800+
.selectExpr("path", "gbx_rst_fromcontent(content, 'GTiff') AS tile")
801+
)
802+
```
803+
804+
```sql
805+
SELECT path, gbx_rst_fromcontent(content, 'GTiff') AS tile
806+
FROM read_files('/Volumes/main/geobrix_samples/geobrix-examples/nyc/', format => 'binaryFile')
807+
```
808+
:::
809+
787810
---
788811

789812
### rst_fromcontent

docs/docs/api/stac.mdx

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
---
2+
sidebar_position: 10
3+
title: STAC Client
4+
---
5+
6+
# STAC Client
7+
8+
`databricks.labs.gbx.stac.StacClient` is a lightweight, Serverless-safe client for **distributed STAC search**, **resilient asset download**, and **repair** of invalid files — against any STAC catalog (default: [Planetary Computer](https://planetarycomputer.microsoft.com/)).
9+
10+
Where a single-node STAC script serializes search requests and downloads, `StacClient` fans both operations out across the Spark cluster — one task per AOI row (search) and one task per asset (download). On Serverless, parallelism is controlled via `partitions=` and `DataFrame.repartition()`, with no `spark.conf.set` calls.
11+
12+
:::info Opt-in extra
13+
`StacClient` requires `geobrix[light,stac]`. The `[stac]` extra pulls in `pystac-client`, `planetary-computer`, `tenacity`, and `requests`. Serverless environment version 5 (Python 3.12) is required.
14+
:::
15+
16+
## Installation
17+
18+
```bash
19+
pip install "geobrix[light,stac]"
20+
```
21+
22+
From a Databricks notebook (Serverless or classic):
23+
24+
```python
25+
%pip install --quiet "geobrix[light,stac] @ file:///Volumes/<catalog>/<schema>/<volume>/geobrix-0.4.0-py3-none-any.whl"
26+
```
27+
28+
## Import
29+
30+
```python
31+
from databricks.labs.gbx.stac import StacClient
32+
```
33+
34+
## Constructor
35+
36+
```python
37+
StacClient(
38+
catalog="https://planetarycomputer.microsoft.com/api/stac/v1", # default
39+
sign="planetary_computer", # 'planetary_computer' | None | callable(href)->href
40+
)
41+
```
42+
43+
| Parameter | Default | Description |
44+
|---|---|---|
45+
| `catalog` | Planetary Computer | URL of any STAC API endpoint. |
46+
| `sign` | `"planetary_computer"` | Signing strategy. `"planetary_computer"` uses `planetary_computer.sign_inplace`; `None` skips signing (public catalogs); or pass any `callable(href: str) -> str`. |
47+
48+
---
49+
50+
## Methods
51+
52+
### `search`
53+
54+
Fan out AOI rows to the STAC catalog and return one row per `(input-row, item, asset)`.
55+
56+
```python
57+
assets_df = client.search(
58+
df, # DataFrame with a GeoJSON-geometry column
59+
geojson_col="geojson", # column name holding the GeoJSON geometry string
60+
collections=["sentinel-2-l2a"], # list of STAC collection IDs
61+
datetime="2022-06-01/2022-09-01",# ISO datetime or range (STAC datetime syntax)
62+
partitions=512, # repartition fan-out; no spark.conf
63+
)
64+
```
65+
66+
**Parameters:**
67+
68+
| Parameter | Type | Default | Description |
69+
|---|---|---|---|
70+
| `df` | `DataFrame` || Input rows. Each row is one AOI. |
71+
| `geojson_col` | `str` || Column containing GeoJSON geometry strings (intersects filter). |
72+
| `collections` | `List[str]` || STAC collection IDs to search. |
73+
| `datetime` | `str` || ISO datetime or range (`"YYYY-MM-DD"` or `"start/end"`). |
74+
| `partitions` | `int` | `512` | Target partition count for the fan-out repartition. |
75+
76+
**Output columns** (in addition to all input columns, carried through):
77+
78+
| Column | Type | Description |
79+
|---|---|---|
80+
| `item_id` | string | STAC item identifier. |
81+
| `date` | string | Acquisition date from `properties.datetime`. |
82+
| `item_bbox` | string | Item bounding box (GeoJSON). |
83+
| `asset_name` | string | Asset key (e.g. `"B02"`, `"B03"`). |
84+
| `href` | string | Asset download URL at search time (may expire; re-signed per attempt in `download`). |
85+
| `item_properties` | string | Full item properties JSON. |
86+
87+
One row is emitted per `(input-row, item, asset)`. The same STAC item reached via multiple AOI rows produces multiple rows — `download` deduplicates internally to unique `(item_id, asset_name)`.
88+
89+
---
90+
91+
### `download`
92+
93+
Resilient, validated asset download — one Spark task per asset. Deduplicates to unique `(item_id, asset_name)` so the same item reached via multiple AOIs is fetched exactly once.
94+
95+
```python
96+
files_df = client.download(
97+
assets_df,
98+
out_dir, # UC Volume path (FUSE-mounted)
99+
asset_names=["B02", "B03", "B04", "B08"], # None = all assets present in df
100+
name="{asset_name}_{item_id}.tif", # filename template
101+
validate=True, # rasterio read-validation per file
102+
max_tries=5,
103+
partitions=None, # default: one task per asset
104+
)
105+
```
106+
107+
**Parameters:**
108+
109+
| Parameter | Type | Default | Description |
110+
|---|---|---|---|
111+
| `df` | `DataFrame` || Must contain `item_id` and `asset_name` columns. (`href` from `search` output is accepted but not required — the href is re-signed per attempt from `item_id` + `asset_name`.) |
112+
| `out_dir` | `str` || Destination directory (e.g. a UC Volume FUSE path). |
113+
| `asset_names` | `List[str]` or `None` | `None` | Filter to these asset keys. `None` downloads all assets present in the DataFrame. |
114+
| `name` | `str` | `"{asset_name}_{item_id}.tif"` | Filename template. Supports `{asset_name}` and `{item_id}` placeholders. |
115+
| `validate` | `bool` | `True` | Open and decode a raster window after download to reject throttled error bodies and truncated files that a size check would pass. |
116+
| `max_tries` | `int` | `5` | Maximum download attempts per asset (exponential backoff between attempts). |
117+
| `partitions` | `int` or `None` | `None` | Explicit repartition before download. `None` sets one partition per unique asset. |
118+
119+
**Output columns:**
120+
121+
| Column | Type | Description |
122+
|---|---|---|
123+
| `item_id` | string | STAC item identifier. |
124+
| `asset_name` | string | Asset key. |
125+
| `out_file_path` | string | Absolute path of the written file on the Volume. |
126+
| `out_file_sz` | long | File size in bytes (`0` if the download failed). |
127+
| `is_out_file_valid` | boolean | `true` if the file passed read-validation; `false` otherwise. |
128+
| `last_update` | timestamp | Time of the download attempt. |
129+
130+
**Resilience behavior:**
131+
132+
- The href is **re-signed on every attempt** — signed URLs from `search` may expire before a retry; `download` always re-derives a fresh URL from `item_id` + `asset_name`.
133+
- HTTP errors (`4xx`/`5xx`, including throttle responses) trigger `tenacity` exponential backoff and retry up to `max_tries`.
134+
- Each file is staged to **worker-local disk first**; it is copied to the Volume only after passing read-validation. No partial or corrupt file is written to the Volume.
135+
- Files that already exist and are valid (`is_out_file_valid = true`) are **skipped** — the operation is idempotent.
136+
- A failed asset (exhausted retries, or read-validation failure) sets `is_out_file_valid = false` and `out_file_sz = 0`; use `repair` to retry those rows.
137+
138+
---
139+
140+
### `repair`
141+
142+
Re-download invalid files and merge the results back to the Delta table.
143+
144+
```python
145+
repaired = client.repair(
146+
"band_b02", # Delta table name or DataFrame
147+
where="is_out_file_valid = false", # SQL filter over the table
148+
)
149+
```
150+
151+
**Parameters:**
152+
153+
| Parameter | Type | Default | Description |
154+
|---|---|---|---|
155+
| `table_or_df` | `str` or `DataFrame` || Delta table name or a DataFrame with `item_id`, `asset_name`, `out_file_path`, `is_out_file_valid`. |
156+
| `where` | `str` | `"is_out_file_valid = false"` | SQL predicate selecting rows to re-download. |
157+
158+
**Behavior:** reads the table, filters to matching rows, re-runs the resilient download on that subset, then merges updated `out_file_path`, `out_file_sz`, `is_out_file_valid`, and `last_update` back into the Delta table. Returns the repaired subset as a DataFrame.
159+
160+
---
161+
162+
## End-to-end example
163+
164+
This illustrates the full search → download → repair flow. The EO-series notebooks are the fully-executed, worked example — see [EO Series](../notebooks/eo-series).
165+
166+
```python
167+
from databricks.labs.gbx.stac import StacClient
168+
from pyspark.sql import functions as F
169+
170+
client = StacClient() # default: Planetary Computer, sign=planetary_computer
171+
172+
# 1 — Search: one row per (AOI cell, STAC item, asset)
173+
# df_cells has a "geojson" column with one GeoJSON geometry per H3 cell
174+
assets_df = client.search(
175+
df_cells,
176+
geojson_col="geojson",
177+
collections=["sentinel-2-l2a"],
178+
datetime="2022-06-01/2022-06-30",
179+
partitions=512,
180+
)
181+
# Write to Delta for an auditable handoff
182+
assets_df.write.mode("overwrite").saveAsTable("cell_assets")
183+
184+
# 2 — Download: resilient, validated; one task per unique (item_id, asset_name)
185+
assets = spark.read.table("cell_assets")
186+
files_df = client.download(
187+
assets,
188+
out_dir="/Volumes/my_catalog/my_schema/data/alaska/B02",
189+
asset_names=["B02"],
190+
name="{asset_name}_{item_id}.tif",
191+
validate=True,
192+
max_tries=5,
193+
)
194+
195+
# Join back per-item metadata (date) from the search output
196+
item_meta = assets.select("item_id", "date").distinct()
197+
band_df = (
198+
files_df
199+
.join(item_meta, on="item_id", how="left")
200+
.withColumn("band_name", F.lit("B02"))
201+
.select("item_id", "band_name", "date",
202+
"out_file_path", "out_file_sz", "is_out_file_valid", "last_update")
203+
)
204+
band_df.write.mode("overwrite").saveAsTable("band_b02")
205+
206+
# 3 — Repair: re-download any files that failed read-validation
207+
repaired = client.repair("band_b02", where="is_out_file_valid = false")
208+
```
209+
210+
---
211+
212+
## Serverless usage
213+
214+
`StacClient` is designed for Serverless (environment version 5, Python 3.12):
215+
216+
- **No `spark.conf.set`.** Parallelism is controlled entirely via `partitions=` in `search` and `download`, and via `DataFrame.repartition(N, "col")` in your notebook — **hash by a column**, since on Serverless a number-only `repartition(N)` is coalesced by AQE back toward one partition.
217+
- **No `.cache()` / `.persist()`.** Materialize search results and downloaded-file metadata to Delta tables — Delta time travel is a more durable alternative to in-memory caching and survives session restarts.
218+
- **One task per asset.** Each download task is independent; Serverless autoscaling routes tasks across available workers without pinning.
219+
220+
```python
221+
# Serverless: write search results to Delta immediately (no caching)
222+
client.search(df_cells, geojson_col="geojson",
223+
collections=["sentinel-2-l2a"], datetime="2022-06-01").write \
224+
.mode("overwrite").saveAsTable("cell_assets")
225+
226+
# Serverless: read back from Delta for the download step
227+
assets = spark.read.table("cell_assets")
228+
files_df = client.download(assets, out_dir="/Volumes/...", asset_names=["B02"])
229+
files_df.write.mode("overwrite").saveAsTable("band_b02")
230+
```
231+
232+
:::note No doc-test backing for this page
233+
`StacClient` is a network integration client — it requires live STAC catalog access and real asset URLs. The illustrative code blocks above show the API surface; the [EO Series notebooks](../notebooks/eo-series) are the fully-executed, end-to-end example with real data.
234+
:::
235+
236+
---
237+
238+
## Non-goals
239+
240+
The following are explicitly out of scope for the initial `StacClient`:
241+
242+
- **No async / concurrent-within-task fetching.** Parallelism comes from Spark tasks, not `asyncio`/threads inside a UDF.
243+
- **No non-raster asset validation.** `validate=True` open-and-decodes a raster window; JSON, thumbnails, and vector sidecars are downloaded but not validated.
244+
- **No catalog or item publishing.** `StacClient` is a read/consume client.
245+
- **No credential management.** Auth is expressed through the `sign` parameter; token storage and refresh are the caller's responsibility.
246+
247+
---
248+
249+
## See also
250+
251+
- [EO Series notebooks](../notebooks/eo-series) — the worked end-to-end example (search → download → repair → tessellate → stack).
252+
- [Execution Tiers](./execution-tiers) — lightweight vs heavyweight comparison.
253+
- [RasterX Function Reference](./raster-functions)`rst_h3_tessellate`, `rst_fromcontent`, `rst_merge_agg`, and the rest of the raster processing functions used downstream of STAC downloads.

0 commit comments

Comments
 (0)