Commit 97ed3fc
authored
GeoBrix 0.4.0 (beta): STAC light API, register(only=), lightweight gbx_rst_fromfile (#41)
* docs(stac): design spec for lightweight STAC API
Approved design for databricks.labs.gbx.stac: a catalog-agnostic,
Serverless-safe StacClient (search + resilient download + repair) that
consolidates the EO-series library.py/config_nb STAC helpers behind one
importable surface, folding in the download read-validation/retry and
repartition-not-spark.conf patterns proven on Serverless.
Co-authored-by: Isaac
* docs(stac): implementation plan for lightweight STAC API
TDD, bite-sized plan for databricks.labs.gbx.stac (search + resilient
download + repair) with the [stac] extra, injectable catalog for unit
tests, and a marked PC integration test. EO-series refactor + executed
notebooks follow as the next phase.
Co-authored-by: Isaac
* feat(stac): package skeleton + [stac] extra
Add databricks.labs.gbx.stac package with StacClient stub and
[stac] optional-dependencies extra (pystac-client, planetary-computer,
tenacity). Tenacity is declared explicitly per plan deviation note
(may also arrive transitively but is declared for supply-chain clarity).
Co-authored-by: Isaac
* feat(stac): pluggable signing strategies
Add _sign.py with resolve_signer/resolve_modifier. Supports
'planetary_computer', None (identity), or any callable. Lazy
import of planetary_computer so the module loads without [stac]
deps installed.
Co-authored-by: Isaac
* feat(stac): search parsers + per-AOI search with retry
Add _search.py with parse_item, extract_assets, and search_one
(tenacity-backed retry, injectable catalog, empty-list fallback on
permanent failure so one bad AOI does not kill the job).
Co-authored-by: Isaac
* feat(stac): resilient download (HTTP-aware fetch + read-validate + retry)
Add _download.py: download_href (raise_for_status so throttle/403
raises instead of writing error body), read_validate (open+decode
window, never size-only), fetch_validate_publish (local-stage,
validate, publish with backoff; re-signs each attempt via href_fn).
Co-authored-by: Isaac
* feat(stac): StacClient search/download/repair orchestration
Full StacClient with search (repartition-based fan-out + UDF),
download (resilient per-asset fetch), and repair (Delta MERGE).
Test injection via _catalog_opener runs on the driver to avoid
pickling test-local callables into Spark worker processes.
Lazy pyspark imports throughout keep the module importable
without pyspark installed (required for _sign host tests).
Co-authored-by: Isaac
* test(stac): serverless-safety guard + remove stac __init__ from tests
Add test_serverless_no_spark_config.py (static check that stac module
contains no spark.conf.set / cache / persist calls). Remove test/stac/
__init__.py: its presence made cloudpickle serialize _FakeCatalog as
'stac.test_client', causing ModuleNotFoundError in Spark UDF workers.
Co-authored-by: Isaac
* test(stac): env-v5 (Python 3.12) compatibility check + smoke
Add host pin-sanity test (checks [stac] extra declares pystac-client
>=0.7 + planetary-computer >=1.0) and a Serverless env-v5 smoke source
that installs [stac] and verifies import + StacClient construction
without network (live run to be performed separately by the controller).
Co-authored-by: Isaac
* fix(stac): C1/I1/I2/I3/I5/I7/M1/M3/M4 hardening of search/download/repair
C1: UDF fan-out path (repartition+explode+_ITEM_SCHEMA/_ASSET_SCHEMA) now
exercised end-to-end; _catalog_opener still keeps driver-shim for simpler
test injection; _get_fn= added to download() so HTTP fetcher is injectable
in UDF workers without process-level monkeypatching.
I1: validate=False publishes bytes without rasterio decode; both paths tested.
I2: download() emits last_update via F.current_timestamp(); repair MERGE
includes last_update in whenMatchedUpdate.
I3: idempotency short-circuit in fetch_validate_publish — skips re-fetch if
target exists and passes validity check.
I5: _sanitize_filename via os.path.basename prevents path traversal.
I7: removed unused Row import and dead inner type-import block.
M1: requests>=2.25 declared in [stac] extra in pyproject.toml.
M4: search_one emits warnings.warn on swallowed exceptions.
Co-authored-by: Isaac
* test(stac): full test suite for C1/I1/I2/I3/I5/M3/M4 + UDF fan-out path
C1: UDF-path tests ship a fake pystac_client.py via addPyFile so workers
use the stub instead of the real package; exercises repartition, explode,
_ITEM_SCHEMA/_ASSET_SCHEMA, dropDuplicates, and the download _fetch UDF
with _get_fn= injection (no real network calls).
I1: validate=True/False tests (reject garbage / publish non-decodable bytes).
I2: last_update column presence asserted on download output.
I3: idempotency tests — second call with existing valid file skips fetch.
I5: path-traversal test with item_id containing ../ stays under out_dir.
M3: test/stac/__init__.py restored; cloudpickle resolves via addPyFile stub
so package __init__.py is compatible with the UDF pickling approach.
M4: search_one warning test asserts RuntimeError is surfaced as a warning.
_fake_catalog.py: module-level importable fake catalog for driver-shim tests.
Co-authored-by: Isaac
* refactor(eo-series): config_nb installs [light,stac] + StacClient; drop redundant STAC/download helpers
- %pip install now uses geobrix[light,stac] (pystac-client, planetary-computer,
tenacity come via the [stac] extra); removes standalone pystac/pystac_client/
planetary_computer/tenacity from the second pip line; keeps folium/mapclassify/
geopandas/rich (viz deps not bundled in [stac])
- Adds StacClient import to the shared imports cell; adds a new cell that
instantiates stac_client = StacClient() (Planetary Computer default, auto-signed)
making it available to nb01/nb02 after %run ./config_nb
- Removes download_band, update_assets, download_missing_assets from config_nb;
keeps file_size, timestamp_filename, get_now_formatted, set_conf_safe,
finalize_tiled_band_tbl, gen_tessellate_tiled_band, and all viz helpers
- library.py: removes ps_client, get_items, get_assets, get_assets_for_cells,
download_asset, download_asset_v2 and their now-unused imports
(pystac_client, planetary_computer, tenacity); keeps all viz helpers intact
Co-authored-by: Isaac
* refactor(eo-series): drop dead get_unique_hrefs (old asset.* schema, no callers)
Co-authored-by: Isaac
* refactor(eo-series): nb01 search via StacClient.search
Replace library.get_assets_for_cells with stac_client.search(df_cell_json, ...)
inside the LAST_UPDATED/FORCE_REBUILD guard; use F.current_timestamp() (lazy,
Serverless-safe) for provenance. Fix demo search cell to open a local _demo_cat
instead of the removed library.ps_client. Update section markdowns to describe
stac_client.search and one-row-per-(cell,item,asset) output. Clear stale show()
outputs that referenced old schema columns (asset MAP, timestamp, item_collection,
stac_version).
Co-authored-by: Isaac
* refactor(eo-series): nb02 download+repair via StacClient; rebuild band tables as join
Replace download_band/download_missing_assets with a notebook-local
build_band_table helper (filters cell_assets → stac_client.download →
joins date metadata → saveAsTable band_<band>) and stac_client.repair
for invalid-row recovery. Band table schema is now the contract columns
only: item_id, band_name, date, out_file_path, out_file_sz,
is_out_file_valid, last_update. Drops legacy columns (timestamp, h3_set,
asset map, item_collection, stac_version, item_bbox, item_properties,
out_dir_fuse, out_filename).
Co-authored-by: Isaac
* docs(stac): add STAC API page + eo-series page/README describe StacClient + [light,stac]
- New docs/docs/api/stac.mdx: user-facing StacClient reference (search /
download / repair), install instructions, API tables with exact signatures
and output columns, end-to-end illustrative example, resilience behavior,
Serverless notes, and non-goals. Registered in sidebars.js after PMTiles.
- eo-series.mdx + README.md: install note updated to [light,stac]; nb01/nb02
descriptions updated to StacClient.search / .download / .repair; old
download_band / update_assets / download_missing_assets references removed;
StacClient linked from key-functions list and gotchas.
Co-authored-by: Isaac
* fix(eo-series): cell_assets selection robust to schema change
nb01 under FORCE_REBUILD now purges stale cell_assets_* dirs before re-search
(the search output schema changed in the StacClient refactor: asset map ->
flat asset_name/href). Both nb01 and nb02 now pick the LATEST cell_assets_*
dir instead of the first arbitrary os.listdir match, so nb02 never reads a
stale prior-schema table (was: UNRESOLVED_COLUMN asset_name).
Co-authored-by: Isaac
* docs(notebooks): add narration + doc links to example notebooks
Add 18 markdown cells across 5 example notebooks orienting readers on
which GeoBrix APIs each code cell uses. Each insertion explains what
the function does, what it returns, and links to the relevant docs page.
Tier statements added to notebooks 03, 04, and config_nb. Built-in
Databricks ST/H3 functions clearly distinguished from GeoBrix APIs.
Co-authored-by: Isaac
* fix(eo-series): nb02 band table date->DateType + overwriteSchema
StacClient.search emits date as a string (dt[:10]); cast to DateType in
build_band_table so band_<band> -> band_*_tile -> band_*_h3 -> band_stack stay
DateType (matches the original nb03/nb04 contract; no nb04 change needed).
overwriteSchema=true so the band table fully replaces a prior-release schema
(the refactor shrank its column set) instead of failing on DELTA_FAILED_TO_MERGE.
Co-authored-by: Isaac
* fix(eo-series): drop stale band table under FORCE_REBUILD before rebuild
A prior-release band_<band> managed table has a different (larger) column set;
saveAsTable(overwrite, overwriteSchema) can still attempt a field-level merge
against it. DROP TABLE IF EXISTS under force_rebuild guarantees a clean recreate
(mirrors nb01's cell_assets purge), so the notebook self-heals across schema
changes without external cleanup.
Co-authored-by: Isaac
* docs(release-notes): v0.4.0 notes StacClient + examples default to lightweight tier
Co-authored-by: Isaac
* fix(rasterx): rst_fromfile reads UC Volume/Workspace paths (issue #34)
gbx_rst_fromfile('/Volumes/...','GTiff') silently returned a NULL tile on
classic UC compute, then crashed downstream (rst_retile) with an opaque
'Long cannot be cast to InternalRow'. Three coordinated fixes:
- HadoopUtils.cleanPath: route the supported FUSE fabric (/Volumes, /Workspace,
dbfs:/Volumes, file:/) to the local file: connector. A scheme-less /Volumes
path resolved against fs.defaultFS (dbfs:) and never reached the FUSE mount.
Legacy pre-Volumes DBFS (/dbfs/, dbfs:/) is unsupported and coerced away from
the retired mount (the old /dbfs->dbfs: mapping and /dbfs/Volume/ typo arm are
removed); /dbfs/Volumes and dbfs:/Volumes aliases coerce to file:/Volumes.
- RST_FromFile: read FUSE/local paths via NIO (Files.readAllBytes) instead of
getFileSystem(serialized hConf)+readContent -- the same OS path GDAL JNI reads,
independent of the driver-serialized fs config (the second defect in #34).
Hadoop FS read retained as a fallback for non-file: schemes.
- RST_ErrorHandler.safeEval: initCause the wrapped Error when crashing, and
logWarning the swallowed throwable otherwise, so the real cause is no longer
lost (issue #34 secondary).
Adds cleanPath regression tests. Pending classic-UC-cluster validation against
a real /Volumes GeoTIFF.
Co-authored-by: Isaac
* fix(rasterx): rst_fromfile non-foldable so /Volumes reads run on executors (issue #34)
Root cause: RST_FromFile with LITERAL path args was treated as foldable by Catalyst.
ConstantFolding evaluated it at plan time on the DRIVER — which cannot open a UC Volume
FUSE path in the optimizer context — returning null instead of reading the raster.
Fix: PrettyInvoke gains a nonFoldable param that overrides foldable=false, bypassing
Catalyst's constant-folding at plan time. RST_FromFile sets override def foldable=false
and calls invoke(..., nonFoldable=true). I/O expressions MUST evaluate at runtime on
executors, never be folded on the driver.
Co-authored-by: Isaac
* fix(rasterx): rst_fromfile reads /Volumes via WorkspaceLocalFileSystem not NIO (issue #34)
Raw NIO (java.nio.file.Files) reads of a /Volumes FUSE path are denied
('Operation not permitted') in an expression's execution context (UC ephemeral
credential scope), even though the same executor reads the path in a plain
RDD/UDF context. Route readBytes through the Hadoop FileSystem so file:/Volumes
resolves to WorkspaceLocalFileSystem (the UC-credentialed FUSE connector), using
a fresh executor-side Configuration (fs.file.impl is on the executor classpath,
not necessarily the driver-serialized conf).
Co-authored-by: Isaac
* fix(rasterx): gdal/ogr readers read /Volumes via WorkspaceLocalFileSystem (issue #34)
The HadoopUtils listing/copy helpers resolved file:/Volumes with the caller's
config, which lacks fs.file.impl and fell back to RawLocalFileSystem (can't read
the FUSE mount) -> FileNotFound in GDAL_Batch/OGR planInputPartitions. Route all
FS resolution through fileSystemFor, overlaying the classpath's
fs.file.impl=WorkspaceLocalFileSystem, so the gdal/ogr readers and copy/list
helpers read /Volumes the same way rst_fromfile now does.
Co-authored-by: Isaac
* fix(rasterx): rst_fromfile delegates to pyrx Python loader for UC Volume reads (issue #34)
The heavyweight (JVM) reader cannot read /Volumes: in the Catalyst execution
context file: resolves to RawLocalFileSystem (the UC-credentialed
WorkspaceLocalFileSystem FUSE connector is unreachable from the library JAR's
classloader). The lightweight pyrx loader reads via a Python pandas_udf (the
Python process holds the UC FUSE credential) and yields a schema-compatible
tile. rx.rst_fromfile now delegates to pyrx.rst_fromfile when importable
(requires geobrix[light]); else falls back to the Scala gbx_rst_fromfile UDF.
Co-authored-by: Isaac
* fix(rasterx): resolve file:/Volumes FS via live Spark conf, not serialized/bare config (issue #34)
withFileSystem builds the Hadoop Configuration fresh on the executor from the
live SparkEnv conf (SparkHadoopUtil.newConfiguration), which overlays
spark.hadoop.* incl. Databricks fs.file.impl=WorkspaceLocalFileSystem, so
file:/Volumes resolves to the UC FUSE connector even in the Catalyst codegen /
DataSource-planning context (a bare or driver-serialized config falls back to
RawLocalFileSystem there). FileSystem.newInstance bypasses the static FS cache.
Co-authored-by: Isaac
* fix(rasterx): register gbx_rst_fromfile as the pyrx Python UDF; graceful skip without [light] (issue #34)
The JVM cannot read a UC Volume FUSE path in the Spark execution context (the UC
credential is held only by Spark's managed Python worker), so the canonical impl
of gbx_rst_fromfile is the pyrx Python UDF, which reads /Volumes (and /Workspace,
DBFS, local). register() overrides the Scala gbx_rst_fromfile with it when pyrx
([light]) is importable; if absent, it skips gracefully and the Scala
registration remains (reads local/DBFS/Workspace, not /Volumes).
Co-authored-by: Isaac
* refactor(util): revert dead /Volumes FS-forcing in HadoopUtils to baseline getFileSystem (issue #34)
The withFileSystem SparkHadoopUtil/newInstance/fs.file.impl-pin machinery never
achieved a JVM /Volumes read (the UC credential is Python-worker-only). /Volumes
is now served by the pyrx Python UDF (gbx_rst_fromfile) / binaryFile+rst_fromcontent.
Revert to plain getFileSystem(hconf) -> restores baseline local/DBFS/Workspace
reader behavior with no churn. Keeps cleanPath normalization + non-foldable.
Co-authored-by: Isaac
* refactor(rasterx): make gbx_rst_fromfile lightweight-only (issue #34)
On Databricks the executor JVM cannot read a UC Volume (/Volumes) FUSE
path: the UC credential lives in the user-scoped Python/IO context, not
the library's JVM code path. So the Scala RST_FromFile expression read
the wrong filesystem and returned a null tile, which then crashed
downstream (gbx_rst_retile: Long cannot be cast to InternalRow).
There is no way to implement this in the JVM tier, so remove the Scala
RST_FromFile entirely; gbx_rst_fromfile is now registered solely as the
pyrx Python UDF (runs where the credential is available). It is callable
from SQL and Python when geobrix[light] is installed; without [light] it
is not registered and the rx.rst_fromfile binding raises with guidance.
- Remove RST_FromFile.scala + its registration and the Scala column/
scalar helpers in rasterx/functions.scala.
- pyrx _fromfile_udf reads bytes sequentially into a rasterio MemoryFile
(FUSE-safe: a tiled/COG GTiff over a Volume seeks to block offsets,
which Volume FUSE can't serve).
- Migrate Scala tests off the removed expression: new udfs.rasterFromPath
test fixture (reads local bytes JVM-side -> rst_fromcontent), used
across the rasterx eval/structure suites; drop the heavy-tier
rst_fromfile bench dispatch. rasterx.* suites green (513 passed).
- check-binding-parity exempts gbx_rst_fromfile from the Scala arm
(Python-registered); Python binding + function-info still required.
- Docs: rst_fromfile marked lightweight-only with the credential reason
and the portable binaryFile + gbx_rst_fromcontent alternative.
Co-authored-by: Isaac
* docs(spec): register(only=[...]) selective SQL registration (light tiers)
Design for an optional `only` param on the lightweight register() functions
(pyrx/pygx/pyvx) so a session can register a subset of a tier's SQL functions.
Accepts both gbx_ and short names, raises on unknown names, preserves today's
behavior when only=None via a grouped registrar map that runs per-sub-module
availability guards only for selected groups. Heavy only= deferred.
Co-authored-by: Isaac
* docs(plan): register(only=[...]) implementation plan + readers/writers scope
Task-by-task TDD plan for the light-tier only= feature. Extends scope to the
DataSource registrar (ds.register) so readers/writers are selectable by format
name too, via a shared _register helper (normalize_name + normalize_datasource_name
+ resolve_only + run_groups). Six tasks: helper, pyrx, pygx, pyvx, ds.register, docs.
Co-authored-by: Isaac
* docs(plan): extend Task 6 to add only= note in readers/writers overview
Co-authored-by: Isaac
* feat(register): shared only= normalize/validate/run-groups helper
Adds _register.py with normalize_name, normalize_datasource_name,
resolve_only, and run_groups — the shared helper that Tasks 2-6 will
wire into pyrx/pygx/pyvx/ds register() to support only=[...] selective
SQL function registration. 10 unit tests; TDD red→green.
* feat(pyrx): register(only=[...]) selective SQL registration
Refactor pyrx functions.register() into a grouped registrar map
(_registrar_groups) covering all SQL_REGISTRY scalar/agg UDFs, the 17
UDTFs, and gbx_pmtiles_agg. Add the only= param wired to
_register.run_groups so callers can register a subset by name (short or
full SQL form, case-insensitive). only=None remains behavior-identical.
Co-authored-by: Isaac
* feat(pygx): register(only=[...]) selective SQL registration
Refactors register() into _registrar_groups() with three guarded groups
(quadbin/bng/custom) wired through _register.run_groups, mirroring the
pyrx pattern from Task 2. Guards close over _env so they are
monkeypatchable for test isolation. only=None preserves existing
behavior; only=[...] validates names (case-insensitive, short or full
SQL form) and runs only the guards whose group has a selected function.
10 quadbin + 23 bng + 7 custom = 40 entries; array-output return types
(ArrayType(LongType()), ArrayType(StringType()), ArrayType(QUADBIN_CELL_SCHEMA),
ArrayType(BNG_CHIP_SCHEMA)) preserved verbatim from the pre-refactor body.
TDD: 5 new tests in test_register_only.py (RED then GREEN); 150/150
existing pygx tests pass (14 JAR-gated deselected as expected).
Co-authored-by: Isaac
* feat(pyvx): register(only=[...]) selective SQL registration
* feat(ds): register(only=[...]) selective reader/writer registration
Adds `only=[...]` to `ds.register.register` so callers can register a
subset of the 9 light DataSources by format name (e.g. `raster_gbx` or
the bare `raster` form). Filters via `_register.resolve_only` with
`normalize_datasource_name` and preserves `_SOURCES` iteration order
for the `only=None` path.
Co-authored-by: Isaac
* docs: document register(only=[...]) for functions and readers/writers
Adds a "Registering a subset (only=)" section to execution-tiers.mdx
covering function-level only= (pyrx/pygx/pyvx), ds_register only= by
format name with/without _gbx suffix, ValueError on unknown names, and
the heavy+light mixing pattern. Extends the "Register first" admonitions
in readers/overview.mdx and writers/overview.mdx with a concise only=
example and the ValueError note.
* test(ds): parenthesize _format_ok boolean for clarity (no behavior change)
Co-authored-by: Isaac
* refactor(bench): repartition by column (Serverless-safe, no number-only)
Number-only repartition(N) is AQE-coalesced toward 1 partition on
Databricks Serverless; hash repartition by column is respected. Switch
all three repartition sites in runner.py to repartition by F.col("tile")
-- the sole column on df_all -- and update matching test files to use
the geometry column (geom_0 / x+y) so partitions stay non-empty.
Co-authored-by: Isaac
* docs(examples): repartition by column (Serverless-safe, no number-only)
Switch all 8 number-only repartition sites in readers/examples.py to
hash repartition by an existing column: F.col("source") for GDAL readers,
F.col("geom_0") for OGR-based readers (shapefile_ogr, gpkg, file_gdb_ogr,
ogr). Add F import alongside the existing expr import.
Co-authored-by: Isaac
* refactor(bench): repartition by F.rand() not the tile struct
Hashing the tile struct (which carries the raster bytes) as the repartition key would hash megabytes per row and skew the very timings the bench measures; F.rand() gives uniform AQE-respected spread. tile is the only column on df_all.
Co-authored-by: Isaac
* fix(stac): sign asset hrefs once — drop double-sign that 403'd all downloads
Search path now carries raw (unsigned) hrefs — the _items UDF opens the catalog
without the sign_inplace modifier so hrefs are stored unsigned in the search
result DataFrame. Download path signs the carried raw href once per attempt via
resolve_signer (planetary_computer.sign). This eliminates the prior pattern of
calling get_item (which requires a collection parameter on the PC STAC API and
raised on every call, causing silently-swallowed exceptions and out_file_path=None
for all assets) followed by a double-sign of the already-modifier-signed href.
Also surface swallowed errors in fetch_validate_publish via logging.warning so
a future all-false symptom leaves a reason in executor logs.
Verified green on Databricks Serverless env-v5 (Ketchikan B02 assets, total=3,
valid=3).
Co-authored-by: Isaac
* fix(stac): re-sign by stripping stale SAS token; nb02 carries href
planetary_computer.sign is a NO-OP on an already-tokened URL — it returns an
existing (possibly EXPIRED) SAS query unchanged instead of refreshing it. A
search-time-signed href stored in a cell_assets table therefore 403s at
download time (verified on real Ketchikan data: stored href -> 403 XML;
strip-then-sign -> 200 image/tiff). resolve_signer now strips any existing
query string before signing so PC always mints a fresh token (raw hrefs are
unchanged). Also: eo-series nb02 build_band_table now selects 'href' into the
download input — the carry-href download signs the carried href, so dropping it
left nothing to sign (all files invalid).
Co-authored-by: Isaac
* fix(stac): download/repair fail loudly on missing href (no silent all-invalid)
A missing href column was silently null-filled, making every download invalid with no explanation (the eo-series nb02 trap: build_band_table dropped href). Now download() and repair() raise an actionable ValueError naming the missing column(s) and pointing to the cell_assets-derived input / build_band_table(force_rebuild=True).
Co-authored-by: Isaac
* fix(lint): resolve flake8 F821/F841 + cross-version f-string; formatting
- bench/runner.py: the F.rand() repartition (Serverless fan-out) used F without
importing it in _explain_spark_path -> F821. Add the local functions-as-F import.
- stac/client.py: drop the now-unused `sign` local in search() -> F841 (search opens
the catalog without the signing modifier; signing happens once at download).
- bench/readers.py: compute the basename outside the f-string -- a backslash in an
f-string expression is a SyntaxError before Python 3.12 and trips flake8 on older hosts.
- isort/black (container) formatting on rasterx/functions.py and the two writer tests;
add python/geobrix/test/__init__.py.
Co-authored-by: Isaac
* docs(examples): executed eo-series + xView notebooks (lightweight tier, Serverless)
Re-ran the EO Series (01-04 + config_nb) and xView walkthroughs end-to-end on
Databricks Serverless (environment v5, geobrix[light,stac]); committed with their
executed outputs. nb02 carries href into StacClient.download (the carry-href contract);
nb04 stacks bands via union+pivot then repartition(N, cellid, date) before rst_frombands
(Serverless-safe parallelism; the old spark.conf tuning is a no-op there). READMEs add a
Serverless execution-strategy section. Includes the eo-series→StacClient refactor plan.
Co-authored-by: Isaac
* docs: Serverless repartition guidance, rst_fromfile lightweight-only, env-v5
- Serverless parallelism is repartition(N, col) not spark.conf (no-op there): geojsonl
writer note + STAC/eo-series shuffle guidance corrected (number-only repartition is
AQE-coalesced).
- rst_fromfile is lightweight-tier only (Python UDF; JVM can't read UC Volumes), but
gbx_rst_fromfile SQL is registered when geobrix[light] is present; raster-functions page
leads with a Raster Reader link (it's a single-file convenience) + binaryFile alternative;
release-notes bullet added.
- README + installation: Serverless environment v5 (Python 3.12) requirement for [light].
Co-authored-by: Isaac
* feat(pyrx): robustness for real-AOI/Serverless raster + vector reads
- pyrx/core/agg.py: merge_tiles reconciles multi-CRS groups (reproject mismatched tiles
to a deterministic reference CRS before rasterio.merge) and frombands aligns bands onto
the reference grid before stacking (handles UTM-zone-boundary AOIs / uneven coverage).
- ds/raster.py: FUSE-safe whole-image read (sequential byte read + MemoryFile; a tiled/COG
GTiff over a UC Volume seeks to block offsets the FUSE mount can't serve).
- ds/vector.py: optional bbox + OGR `where` pushdown to pyogrio (read less; keeps a large
single-file vector read under the ~1 GB Serverless per-UDF memory cap).
- pyrx/core/xyz.py: import rio-tiler lazily so `import pyrx` (and non-XYZ rst_* functions)
works on Serverless env-v5, where rio-tiler 9.x's PEP-728 TypedDict trips the immutable
typing_extensions pin; only rst_tilexyz / rst_xyzpyramid need it at call time.
- pyrx/core/tessellate.py + the UDTF: skip cells that clip to empty/all-nodata (None) so
no null-raster tile row is emitted.
Co-authored-by: Isaac
* test(stac): cover StacClient.repair() filter + href guard
Adds direct unit tests for repair() (the QC test-completeness gap): one asserts it filters to the invalid rows and re-downloads them carrying item_id/asset_name/href (download stubbed, no network); one asserts the fail-loud ValueError when href/asset_name are absent (e.g. a band table). 2/2 green in Docker.
Co-authored-by: Isaac
* test(bench): fix BenchDispatch count 107->106 (rst_fromfile, #34)
The #34 refactor made rst_fromfile lightweight-only (the JVM cannot read
UC Volumes) and dropped it from the heavy bench registry BenchDispatch.cats,
taking BenchDispatch.all from 107 to 106. The "registry covers the ds-in
functions" test still asserted == 107 and was never updated in that commit,
so the heavy Scala phase failed.
Correct the assertion to 106 and the comment tally (bucket-C C1/C2: 6->5).
The product SQL count is unchanged at 107 -- gbx_rst_fromfile is still
registered via the light tier -- so binding-parity/diagram-coverage are
unaffected; only the heavy-benchmarkable set legitimately diverges.
Verified: BenchDispatchTest suite 3/3 green in geobrix-dev.
Co-authored-by: Isaac
* ci(stac): scope STAC tests to the lightweight tier (heavy skips)
The heavy Python CI phase collected test/stac and failed at collection with
"ModuleNotFoundError: No module named 'rasterio'" (test_download.py imports
rasterio; requirements-ci.txt ships none of the light-tier deps). STAC is a
lightweight-only API ([stac] extra, JAR-free) and should be tested in the
light phase, not the heavy one.
The maintained mechanism is test/conftest.py's _LIGHT_TEST_DIRS collect_ignore
(heavy env, no rasterio -> dir skipped) paired with the explicit dir list in
the light job (.github/actions/pyrx_build). STAC had been added to NEITHER, so
it was about to run in no job at all. Register it in both:
- conftest.py: add "stac" to _LIGHT_TEST_DIRS -> heavy phase skips it cleanly.
- pyrx_build/action.yml: add test/stac to the light pytest dir list -> it runs.
- requirements-pyrx-ci.{in,txt}: add requests==2.32.3 (the only missing runtime
dep -- stac/_download.py imports it at module top). pystac-client and
planetary-computer are NOT added: the suite stubs them (test/stac/
_fake_catalog.py + an addPyFile pystac_client stub + monkeypatch). Lock
regenerated with --generate-hashes; only requests + urllib3 +
charset-normalizer added (certifi/idna already present), no pin drift.
Verified: test/stac 28 passed in geobrix-dev; lint (isort/black/flake8) clean.
Co-authored-by: Isaac
* ci(stac): add tenacity to the light lock (search_one retry dep)
With test/stac now in the lightweight CI selection, the light job failed with
"ModuleNotFoundError: No module named 'tenacity'": stac/_search.py search_one
wraps its catalog call in a tenacity @Retry, and test_search exercises that
path. tenacity is a [stac] extra dep that was absent from requirements-pyrx-ci
(the dev container masked this -- it has the full [stac] extra installed; the
light CI installs only the lock).
Add tenacity==9.1.4 (matches requirements-dev-container.in, within the
pyproject >=8,<10 constraint) and regenerate the hash-pinned lock. tenacity is
the sole net addition (no new transitives), no existing pin drifted.
Verified CI-faithfully: built a clean venv from ONLY requirements-pyrx-ci.txt
(--require-hashes) + the package, then ran the full light selection
(pytest test/pyrx test/ds test/pyvx test/pygx test/pmtiles_light test/stac
-m "not integration") -> 816 passed, 2 skipped, exit 0. No further missing deps.
Co-authored-by: Isaac
* test(rasterx): load tiles via heavy-native rst_fromcontent, not rst_fromfile
Issue #34 made rst_fromfile lightweight-only: rasterx.rst_fromfile (and SQL
gbx_rst_fromfile) delegate to the pyrx Python UDF, which imports pandas/rasterio
at module top. The heavy CI Python env ships neither, so every test/rasterx test
that loaded a tile via rst_fromfile failed at runtime with
"ModuleNotFoundError: No module named 'pandas'" (56 failures). The Scala tests
were migrated off rst_fromfile during #34; the Python rasterx tests were missed.
Add test/rasterx/_helpers.py with heavy-native loaders that decode a local test
raster's bytes via the Scala/GDAL gbx_rst_fromcontent (no pyrx/pandas/rasterio):
- tile_from_path(rx, f, path, driver) -- drop-in for rst_fromfile(lit(path),...)
- read_bytes(path) -- raw bytes for building a content column in multi-row frames
Migrate all ~63 sites across 13 files (44 Column-form, 5 SQL-form rewritten to
binaryFile + gbx_rst_fromcontent, 7 multi-row bytes). Keeps the heavy/light tier
separation intact -- no lightweight deps added to the heavy lock.
Verified in geobrix-dev: test/rasterx 93 passed normally AND 93 passed with
pandas+rasterio blocked at builtins.__import__ (proves no light-dep pull); lint
(black/isort/flake8) clean.
Co-authored-by: Isaac
---------
Co-authored-by: Michael Johns <user.name>107 files changed
Lines changed: 8208 additions & 7203 deletions
File tree
- .github/actions/pyrx_build
- docs
- docs
- api
- notebooks
- readers
- writers
- scripts
- superpowers
- plans
- specs
- tests/python/readers
- notebooks
- examples
- eo-series
- xview
- tests
- python/geobrix
- src/databricks/labs/gbx
- bench
- ds
- pygx
- pyrx
- core
- pyvx
- rasterx
- stac
- test
- ds
- pmtiles_bindings
- pygx
- pyrx
- pyvx
- rasterx
- stac
- src
- main/scala/com/databricks/labs/gbx
- expressions
- rasterx
- expressions/constructor
- util
- util
- test/scala
- com/databricks/labs/gbx
- bench
- rasterx
- expressions
- constructor
- util
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
63 | 63 | | |
64 | 64 | | |
65 | 65 | | |
66 | | - | |
67 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| 53 | + | |
| 54 | + | |
53 | 55 | | |
54 | 56 | | |
55 | 57 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
27 | 62 | | |
28 | 63 | | |
29 | 64 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
766 | 766 | | |
767 | 767 | | |
768 | 768 | | |
769 | | - | |
| 769 | + | |
770 | 770 | | |
771 | | - | |
772 | | - | |
| 771 | + | |
| 772 | + | |
| 773 | + | |
| 774 | + | |
| 775 | + | |
| 776 | + | |
| 777 | + | |
| 778 | + | |
773 | 779 | | |
774 | 780 | | |
775 | 781 | | |
| |||
784 | 790 | | |
785 | 791 | | |
786 | 792 | | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
787 | 810 | | |
788 | 811 | | |
789 | 812 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
0 commit comments