From 44d94a50b2e804ce5a23dee3eb168c123994eedd Mon Sep 17 00:00:00 2001 From: sharkinsspatial Date: Wed, 8 Apr 2026 18:06:12 -0400 Subject: [PATCH 1/5] Documentation updates. --- DEVELOP.md | 17 +++++++++++------ README.md | 8 +++++--- 2 files changed, 16 insertions(+), 9 deletions(-) diff --git a/DEVELOP.md b/DEVELOP.md index 887392a..d64a148 100644 --- a/DEVELOP.md +++ b/DEVELOP.md @@ -1,14 +1,19 @@ # Contributing -## Create example Zarr files - -From the root directory, run: +## Rust +From the project root, run: ```bash -uv run python scripts/generate_data.py +cargo test ``` -This will generate `data/zarr_store.zarr` from the root directory. +A suite of benchmarks are available (though the remote S3 benchmarks use data in a protected bucket and requires credentials). Benchmarks are in separate binaries and can be run via + +```bash +cargo datetime_local +cargo bbox_colunms_local +cargo bbox_local +``` ## Python bindings @@ -23,5 +28,5 @@ The `--no-project` is necessary to avoid building the Rust code (in release mode You need to add `--no-project` before any `uv run` command. For example, to run IPython: ```bash -uv run --no-project ipython +uv run pytest ``` diff --git a/README.md b/README.md index 1e2c2f7..7d62ac0 100644 --- a/README.md +++ b/README.md @@ -1,10 +1,12 @@ # zarr-datafusion-search -This is a prototype for being able to query _metadata_ about Zarr arrays using [DataFusion](https://datafusion.apache.org/), an extensible query engine written in Rust. +This is a prototype for querying STAC or CMR style _metadata_ about Zarr arrays and groups using [DataFusion](https://datafusion.apache.org/), an extensible query engine written in Rust. -## Zarr Schema +This concept was conceived by the team at [Earthmover](https://www.earthmover.io/) and is outlined in their whitepaper [Level 2 Data Collections in Zarr / Icechunk](https://docs.google.com/document/d/1tbT-B_yDGO74Tz_LstTSLJ6mw14uaswzl6v_tOyNzJg/edit?pli=1&tab=t.0#heading=h.awu8gjpaww08). -In particular, we assume there is a Zarr store with multiple 1-dimensional arrays in root group named `"meta"`. +## Schema + +To store this _metadata_, this project uses a convention where the Zarr store represents each metadata "field" with a 1-dimensional array in a root group named `"meta"`. Users can define arbitrary schemas where the 1-dimensional arrays each use a `dtype` that has an equivalent Arrow type in our supported [mappings](https://github.com/developmentseed/zarr-datafusion-search/issues/12). A concrete example would look like From ac5738d71debf1e552a22f1f640c649b8f9aeef4 Mon Sep 17 00:00:00 2001 From: sharkinsspatial Date: Wed, 8 Apr 2026 18:09:28 -0400 Subject: [PATCH 2/5] Include convention change note. --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 7d62ac0..61496e9 100644 --- a/README.md +++ b/README.md @@ -6,16 +6,16 @@ This concept was conceived by the team at [Earthmover](https://www.earthmover.io ## Schema -To store this _metadata_, this project uses a convention where the Zarr store represents each metadata "field" with a 1-dimensional array in a root group named `"meta"`. +To store this _metadata_, zarr-datafusion-search uses a convention where the Zarr store represents each metadata "field" with a 1-dimensional array in a root group named `"meta"`. -Users can define arbitrary schemas where the 1-dimensional arrays each use a `dtype` that has an equivalent Arrow type in our supported [mappings](https://github.com/developmentseed/zarr-datafusion-search/issues/12). A concrete example would look like +Users can define arbitrary schemas where the 1-dimensional arrays each use a `dtype` that has an equivalent Arrow type in our supported [mappings](https://github.com/developmentseed/zarr-datafusion-search/issues/12). A concrete example might look like - Inside a Zarr group named `"meta"` - A `datetime64[ms]` array named `"date"` with `n` timestamps named `"date"` with `n` timestamps. - A `VariableLengthUTF8` array named `"collection"` with `n` string values. - A `VariableLengthBytes` array named `"bbox"` with `n` binary values, where each value is a WKB-encoded Polygon (or MultiPolygon) with the bounding box of that Zarr record. -This data schema may change over time. +This project is under active development so these conventions may change. ## Python API From 4374973dad8f4f4ac298ae84045b9617239896f4 Mon Sep 17 00:00:00 2001 From: sharkinsspatial Date: Wed, 8 Apr 2026 18:13:04 -0400 Subject: [PATCH 3/5] Fix bench commands. --- DEVELOP.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/DEVELOP.md b/DEVELOP.md index d64a148..b769bd2 100644 --- a/DEVELOP.md +++ b/DEVELOP.md @@ -10,9 +10,9 @@ cargo test A suite of benchmarks are available (though the remote S3 benchmarks use data in a protected bucket and requires credentials). Benchmarks are in separate binaries and can be run via ```bash -cargo datetime_local -cargo bbox_colunms_local -cargo bbox_local +cargo bench --bench datetime_local +cargo bench --bench bbox_colunms_local +cargo bench --bench bbox_local ``` ## Python bindings From 74b3d050efbcd49edb1a383fb2c5548a6f790f6e Mon Sep 17 00:00:00 2001 From: sharkinsspatial Date: Thu, 9 Apr 2026 20:10:19 -0400 Subject: [PATCH 4/5] Include --no-project flag. --- DEVELOP.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/DEVELOP.md b/DEVELOP.md index b769bd2..3456422 100644 --- a/DEVELOP.md +++ b/DEVELOP.md @@ -25,8 +25,8 @@ uv run --no-project maturin develop --uv The `--no-project` is necessary to avoid building the Rust code (in release mode) an extra time before we even reach the `maturin develop` command. -You need to add `--no-project` before any `uv run` command. For example, to run IPython: +You need to add `--no-project` before any `uv run` command. For example ```bash -uv run pytest +uv run --no-project pytest ``` From 4f15e8002862817f4a53fe1b91a91492d34db083 Mon Sep 17 00:00:00 2001 From: sharkinsspatial Date: Thu, 9 Apr 2026 20:12:27 -0400 Subject: [PATCH 5/5] Remove whitepaper private link. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 61496e9..0d95fb8 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@ This is a prototype for querying STAC or CMR style _metadata_ about Zarr arrays and groups using [DataFusion](https://datafusion.apache.org/), an extensible query engine written in Rust. -This concept was conceived by the team at [Earthmover](https://www.earthmover.io/) and is outlined in their whitepaper [Level 2 Data Collections in Zarr / Icechunk](https://docs.google.com/document/d/1tbT-B_yDGO74Tz_LstTSLJ6mw14uaswzl6v_tOyNzJg/edit?pli=1&tab=t.0#heading=h.awu8gjpaww08). +This concept was conceived by the team at [Earthmover](https://www.earthmover.io/) and is outlined in their whitepaper Level 2 Data Collections in Zarr / Icechunk. ## Schema