developmentseed · sharkinsspatial · Apr 25, 2026 · Apr 24, 2026 · Apr 24, 2026 · Apr 24, 2026
diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml
@@ -0,0 +1,29 @@
+name: Deploy docs
+
+on:
+  push:
+    branches:
+      - main
+  workflow_dispatch:
+
+permissions:
+  contents: write
+
+jobs:
+  deploy:
+    name: Deploy MkDocs to GitHub Pages
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - uses: astral-sh/setup-uv@v5
+        with:
+          enable-cache: true
+
+      - name: Deploy docs
+        env:
+          GIT_COMMITTER_NAME: CI
+          GIT_COMMITTER_EMAIL: ci-bot@example.com
+        run: uv run --group docs mkdocs gh-deploy --force
diff --git a/.gitignore b/.gitignore
@@ -156,6 +156,9 @@ venv.bak/
 # mkdocs documentation
 /site
 
+# Superpowers AI planning docs
+docs/superpowers/
+
 # mypy
 .mypy_cache/
 .dmypy.json

diff --git a/README.md b/README.md
@@ -4,6 +4,17 @@ This is a prototype for querying STAC or CMR style _metadata_ about Zarr arrays
 
 This concept was conceived by the team at [Earthmover](https://www.earthmover.io/) and is outlined in their whitepaper Level 2 Data Collections in Zarr / Icechunk.
 
+## Why
+
+The Earthmover whitepaper outlines several rationales for storing
+_metadata_ in a Zarr store.  The most compelling cases are
+
+- **Heterogeneous Arrays** - With the advent of Virtualizarr we are often representing chunks from source files that we don't control.  For Level 2 and Level 3 datasets like Sentinel 2 this means that virtual Zarr arrays have varying `dtypes`, `codecs` and `crs` values.
+If the source arrays are heterogeneous, they cannot be concatenated along a dimension to form a single datacube.  Because of this we need an alternative to select or discover these arrays other than the normal coordinate or dimensional slicing we use with datacubes.
+
+- **Synchornization** - Our current metadata management solutions (STAC, CMR, ODC) all use disconnected metadata stores which reference raw data assets in object storage.
+This can present problems as systems require complex, fragile orchestration to maintain consistency between metadata indexes and source data.  Using Icechunk as store can alleviate this as array data and metadata updates can be completed in a single atomic transaction.
+
 ## Schema
 
 To store this _metadata_, zarr-datafusion-search uses a convention where the Zarr store represents each metadata "field" with a 1-dimensional array in a root group named `"meta"`.
@@ -28,6 +39,12 @@ In addition, DataFusion-Python supports [_custom table providers_](https://dataf
 > you must use the same version of DataFusion-Python as the version of
 > DataFusion used to compile the custom table provider.
 
+## Installation
+```bash
+uv add zarr-datafusion-search
+```
+
+## Usage
 ```py
 from zarr_datafusion_search import ZarrTable
 from datafusion import SessionContext

diff --git a/docs/DEVELOP.md b/docs/DEVELOP.md
@@ -0,0 +1,36 @@
+# Contributing
+
+## Rust
+
+From the project root, run:
+
+```bash
+cargo test
+```
+
+A suite of benchmarks are available (though the remote S3 benchmarks use data in a
+protected bucket and require credentials). Benchmarks are in separate binaries and
+can be run via:
+
+```bash
+cargo bench --bench datetime_local
+cargo bench --bench bbox_colunms_local
+cargo bench --bench bbox_local
+```
+
+## Python bindings
+
+From the `python/` directory, run:
+
+```bash
+uv run --no-project maturin develop --uv
+```
+
+The `--no-project` flag is necessary to avoid building the Rust code (in release
+mode) an extra time before we even reach the `maturin develop` command.
+
+Prefix all `uv run` commands with `--no-project`. For example:
+
+```bash
+uv run --no-project pytest
+```
diff --git a/docs/api/zarr-table.md b/docs/api/zarr-table.md
@@ -0,0 +1,3 @@
+# ZarrTable
+
+::: zarr_datafusion_search.ZarrTable
diff --git a/docs/architecture/query-pipeline.md b/docs/architecture/query-pipeline.md
@@ -0,0 +1,48 @@
+# Query Pipeline
+
+The library uses [Apache Datafusion](https://datafusion.apache.org/) as the
+foundation of its expression parser and execution planner.  To interact with
+the custom Zarr schema described in [zarr.md](./zarr.md) we use a custom
+Datafusion
+[TableProvider](https://datafusion.apache.org/library-user-guide/custom-table-providers.html)
+
+
+The `ZarrTableProvider` uses several techniques to try and improve filter
+execution performance. There are 2 main drivers of execution cost.
+
+1. Retrieving and deserializing Zarr chunks.
+2. Evaluating expressions against chunks.
+
+#### Non-indexed columns
+In the most basic case, the TableProvider looks at the "columns" used in the
+expression predicate and retrieves each chunk for the metadata arrays of these
+"columns" to build a `RecordBatch` with a schema of just the predicate columns. 
+
+This `RecordBatch` is evaluated against the filter to find matching rows.  If it
+contains matching rows and there are additional projected "columns" in
+expression, the chunk for that column's array is loaded and the indexes for the
+matching rows are selected.
+
+These selected rows are combined and added to the `RecordBatchStream`. 
+
+![](../diagrams/chunk_scanning.png)
+
+#### Indexed columns
+While the approach described above can be highly performant for many types of
+columns and expressions, it can be slow for `dtypes` that are slow to
+deserialize (variable length types) or when the predicate evaluation is
+expensive (geodatafusion functions).
+
+In these cases we want to use additional metadata or materialized indexes to
+1. Reduce the number of chunks we retrieve and deserialize.
+2. Reduce the number of values we check against the filter predicate.
+
+Currently we support using [geo-index](https://github.com/georust/geo-index)
+R-tree indexes. In this case the "basic" pipeline is short circuited.  The
+`TableProvder` checks if the "column" in the predicate has an index.  If an
+index exists, it uses it find the "rows" that satisfy the predicate.  The
+`TableProvider` fetches only the chunks for those rows and then only decodes
+those portions of the chunks.  The filter predicate is never used against the
+raw values.
+
+![](../diagrams/index_scanning.png)
diff --git a/docs/architecture/zarr.md b/docs/architecture/zarr.md
@@ -0,0 +1,55 @@
+# Zarr Structure
+
+`zarr-datafusion-search` uses a set of 1D arrays to store metadata,
+treating them as if they were columns in a columnar storage format like
+Parquet. 
+
+#### Schema
+The library requires a special group named `meta`.  It discovers each array stored in the `meta` group and
+maps its Zarr v3 `dtype` to an Arrow type to build an
+[Arrow schema](https://arrow.apache.org/cookbook/py/schema.html). These are the
+current supported type mappings.
+
+## Supported dtype mappings
+
+| Zarr v3 dtype | Arrow type | Notes |
+|---|---|---|
+| `bool` | `Boolean` | |
+| `int8` | `Int8` | |
+| `int16` | `Int16` | |
+| `int32` | `Int32` | |
+| `int64` | `Int64` | |
+| `uint8` | `UInt8` | |
+| `uint16` | `UInt16` | |
+| `uint32` | `UInt32` | |
+| `uint64` | `UInt64` | |
+| `float16` | `Float16` | |
+| `float32` | `Float32` | |
+| `float64` | `Float64` | |
+| `bytes` | `BinaryView` | When the field name is `bbox`, mapped to WKB with EPSG:4326 CRS via the GeoArrow extension type instead |
+| `r<N>` (raw bits) | `BinaryView` | |
+| `string` | `Utf8View` | |
+| `numpy.datetime64[s]` | `Timestamp(Second, None)` | |
+| `numpy.datetime64[ms]` | `Timestamp(Millisecond, None)` | |
+| `numpy.datetime64[us]` | `Timestamp(Microsecond, None)` | |
+| `numpy.datetime64[ns]` | `Timestamp(Nanosecond, None)` | |
+| `complex64` | — | Not supported |
+| `complex128` | — | Not supported |
+
+### Chunking
+Because the `meta` arrays are combined by Datafusion into a single schema they
+need to maintain chunk alignment.  This means that during array creation time,
+the chunk size must be the same for each array.  And as data is appended, it needs to be appended to each array simultaneously.
+Because the chunks are aligned across arrays, the library can treat the combined chunks similar to a Parquet row group.
+
+This diagram demonstrates the chunk alignment
+![](../diagrams/chunk_scanning.png)
+
+### Indexes
+`zarr-datafusion-search` supports the optional use of materialized indexes to improve
+scanning performance.  We use the following convention, if a `meta`
+array is used in the filter predicate `zarr-datafusion-search` will look for a
+group called `indexes` and search for an array of the same name as the
+predicate array.  Currently the only index type supported are R-tree indexes
+generated by the [geo-index](https://github.com/georust/geo-index)
+library but we will continue to expand index support.
diff --git a/docs/assets/ds_favicon.png b/docs/assets/ds_favicon.png
diff --git a/docs/assets/logo_no_text.png b/docs/assets/logo_no_text.png
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# ZarrTable

		::: zarr_datafusion_search.ZarrTable