Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions DEVELOP.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# Contributing

## Create example Zarr files

From the root directory, run:
## Rust
From the project root, run:

```bash
uv run python scripts/generate_data.py
cargo test
```

This will generate `data/zarr_store.zarr` from the root directory.
A suite of benchmarks are available (though the remote S3 benchmarks use data in a protected bucket and requires credentials). Benchmarks are in separate binaries and can be run via

```bash
cargo bench --bench datetime_local
cargo bench --bench bbox_colunms_local
cargo bench --bench bbox_local
```

## Python bindings

Expand All @@ -20,8 +25,8 @@ uv run --no-project maturin develop --uv

The `--no-project` is necessary to avoid building the Rust code (in release mode) an extra time before we even reach the `maturin develop` command.

You need to add `--no-project` before any `uv run` command. For example, to run IPython:
You need to add `--no-project` before any `uv run` command. For example

```bash
uv run --no-project ipython
uv run --no-project pytest
```
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,21 @@
# zarr-datafusion-search

This is a prototype for being able to query _metadata_ about Zarr arrays using [DataFusion](https://datafusion.apache.org/), an extensible query engine written in Rust.
This is a prototype for querying STAC or CMR style _metadata_ about Zarr arrays and groups using [DataFusion](https://datafusion.apache.org/), an extensible query engine written in Rust.

## Zarr Schema
This concept was conceived by the team at [Earthmover](https://www.earthmover.io/) and is outlined in their whitepaper Level 2 Data Collections in Zarr / Icechunk.

In particular, we assume there is a Zarr store with multiple 1-dimensional arrays in root group named `"meta"`.
## Schema

Users can define arbitrary schemas where the 1-dimensional arrays each use a `dtype` that has an equivalent Arrow type in our supported [mappings](https://github.com/developmentseed/zarr-datafusion-search/issues/12). A concrete example would look like
To store this _metadata_, zarr-datafusion-search uses a convention where the Zarr store represents each metadata "field" with a 1-dimensional array in a root group named `"meta"`.

Users can define arbitrary schemas where the 1-dimensional arrays each use a `dtype` that has an equivalent Arrow type in our supported [mappings](https://github.com/developmentseed/zarr-datafusion-search/issues/12). A concrete example might look like

- Inside a Zarr group named `"meta"`
- A `datetime64[ms]` array named `"date"` with `n` timestamps named `"date"` with `n` timestamps.
- A `VariableLengthUTF8` array named `"collection"` with `n` string values.
- A `VariableLengthBytes` array named `"bbox"` with `n` binary values, where each value is a WKB-encoded Polygon (or MultiPolygon) with the bounding box of that Zarr record.

This data schema may change over time.
This project is under active development so these conventions may change.

## Python API

Expand Down
Loading