diff --git a/DEVELOP.md b/DEVELOP.md index 887392a..3456422 100644 --- a/DEVELOP.md +++ b/DEVELOP.md @@ -1,14 +1,19 @@ # Contributing -## Create example Zarr files - -From the root directory, run: +## Rust +From the project root, run: ```bash -uv run python scripts/generate_data.py +cargo test ``` -This will generate `data/zarr_store.zarr` from the root directory. +A suite of benchmarks are available (though the remote S3 benchmarks use data in a protected bucket and requires credentials). Benchmarks are in separate binaries and can be run via + +```bash +cargo bench --bench datetime_local +cargo bench --bench bbox_colunms_local +cargo bench --bench bbox_local +``` ## Python bindings @@ -20,8 +25,8 @@ uv run --no-project maturin develop --uv The `--no-project` is necessary to avoid building the Rust code (in release mode) an extra time before we even reach the `maturin develop` command. -You need to add `--no-project` before any `uv run` command. For example, to run IPython: +You need to add `--no-project` before any `uv run` command. For example ```bash -uv run --no-project ipython +uv run --no-project pytest ``` diff --git a/README.md b/README.md index 1e2c2f7..0d95fb8 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,21 @@ # zarr-datafusion-search -This is a prototype for being able to query _metadata_ about Zarr arrays using [DataFusion](https://datafusion.apache.org/), an extensible query engine written in Rust. +This is a prototype for querying STAC or CMR style _metadata_ about Zarr arrays and groups using [DataFusion](https://datafusion.apache.org/), an extensible query engine written in Rust. -## Zarr Schema +This concept was conceived by the team at [Earthmover](https://www.earthmover.io/) and is outlined in their whitepaper Level 2 Data Collections in Zarr / Icechunk. -In particular, we assume there is a Zarr store with multiple 1-dimensional arrays in root group named `"meta"`. +## Schema -Users can define arbitrary schemas where the 1-dimensional arrays each use a `dtype` that has an equivalent Arrow type in our supported [mappings](https://github.com/developmentseed/zarr-datafusion-search/issues/12). A concrete example would look like +To store this _metadata_, zarr-datafusion-search uses a convention where the Zarr store represents each metadata "field" with a 1-dimensional array in a root group named `"meta"`. + +Users can define arbitrary schemas where the 1-dimensional arrays each use a `dtype` that has an equivalent Arrow type in our supported [mappings](https://github.com/developmentseed/zarr-datafusion-search/issues/12). A concrete example might look like - Inside a Zarr group named `"meta"` - A `datetime64[ms]` array named `"date"` with `n` timestamps named `"date"` with `n` timestamps. - A `VariableLengthUTF8` array named `"collection"` with `n` string values. - A `VariableLengthBytes` array named `"bbox"` with `n` binary values, where each value is a WKB-encoded Polygon (or MultiPolygon) with the bounding box of that Zarr record. -This data schema may change over time. +This project is under active development so these conventions may change. ## Python API