Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
name: Deploy docs

on:
push:
branches:
- main
workflow_dispatch:

permissions:
contents: write

jobs:
deploy:
name: Deploy MkDocs to GitHub Pages
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0

- uses: astral-sh/setup-uv@v5
with:
enable-cache: true

- name: Deploy docs
env:
GIT_COMMITTER_NAME: CI
GIT_COMMITTER_EMAIL: ci-bot@example.com
run: uv run --group docs mkdocs gh-deploy --force
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -156,6 +156,9 @@ venv.bak/
# mkdocs documentation
/site

# Superpowers AI planning docs
docs/superpowers/

# mypy
.mypy_cache/
.dmypy.json
Expand Down
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,17 @@ This is a prototype for querying STAC or CMR style _metadata_ about Zarr arrays

This concept was conceived by the team at [Earthmover](https://www.earthmover.io/) and is outlined in their whitepaper Level 2 Data Collections in Zarr / Icechunk.

## Why

The Earthmover whitepaper outlines several rationales for storing
_metadata_ in a Zarr store. The most compelling cases are

- **Heterogeneous Arrays** - With the advent of Virtualizarr we are often representing chunks from source files that we don't control. For Level 2 and Level 3 datasets like Sentinel 2 this means that virtual Zarr arrays have varying `dtypes`, `codecs` and `crs` values.
If the source arrays are heterogeneous, they cannot be concatenated along a dimension to form a single datacube. Because of this we need an alternative to select or discover these arrays other than the normal coordinate or dimensional slicing we use with datacubes.

- **Synchornization** - Our current metadata management solutions (STAC, CMR, ODC) all use disconnected metadata stores which reference raw data assets in object storage.
This can present problems as systems require complex, fragile orchestration to maintain consistency between metadata indexes and source data. Using Icechunk as store can alleviate this as array data and metadata updates can be completed in a single atomic transaction.

## Schema

To store this _metadata_, zarr-datafusion-search uses a convention where the Zarr store represents each metadata "field" with a 1-dimensional array in a root group named `"meta"`.
Expand All @@ -28,6 +39,12 @@ In addition, DataFusion-Python supports [_custom table providers_](https://dataf
> you must use the same version of DataFusion-Python as the version of
> DataFusion used to compile the custom table provider.

## Installation
```bash
uv add zarr-datafusion-search
```

## Usage
```py
from zarr_datafusion_search import ZarrTable
from datafusion import SessionContext
Expand Down
36 changes: 36 additions & 0 deletions docs/DEVELOP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Contributing

## Rust

From the project root, run:

```bash
cargo test
```

A suite of benchmarks are available (though the remote S3 benchmarks use data in a
protected bucket and require credentials). Benchmarks are in separate binaries and
can be run via:

```bash
cargo bench --bench datetime_local
cargo bench --bench bbox_colunms_local
cargo bench --bench bbox_local
```

## Python bindings

From the `python/` directory, run:

```bash
uv run --no-project maturin develop --uv
```

The `--no-project` flag is necessary to avoid building the Rust code (in release
mode) an extra time before we even reach the `maturin develop` command.

Prefix all `uv run` commands with `--no-project`. For example:

```bash
uv run --no-project pytest
```
3 changes: 3 additions & 0 deletions docs/api/zarr-table.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# ZarrTable

::: zarr_datafusion_search.ZarrTable
48 changes: 48 additions & 0 deletions docs/architecture/query-pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Query Pipeline

The library uses [Apache Datafusion](https://datafusion.apache.org/) as the
foundation of its expression parser and execution planner. To interact with
the custom Zarr schema described in [zarr.md](./zarr.md) we use a custom
Datafusion
[TableProvider](https://datafusion.apache.org/library-user-guide/custom-table-providers.html)


The `ZarrTableProvider` uses several techniques to try and improve filter
execution performance. There are 2 main drivers of execution cost.

1. Retrieving and deserializing Zarr chunks.
2. Evaluating expressions against chunks.

#### Non-indexed columns
In the most basic case, the TableProvider looks at the "columns" used in the
expression predicate and retrieves each chunk for the metadata arrays of these
"columns" to build a `RecordBatch` with a schema of just the predicate columns.

This `RecordBatch` is evaluated against the filter to find matching rows. If it
contains matching rows and there are additional projected "columns" in
expression, the chunk for that column's array is loaded and the indexes for the
matching rows are selected.

These selected rows are combined and added to the `RecordBatchStream`.

![](../diagrams/chunk_scanning.png)

#### Indexed columns
While the approach described above can be highly performant for many types of
columns and expressions, it can be slow for `dtypes` that are slow to
deserialize (variable length types) or when the predicate evaluation is
expensive (geodatafusion functions).

In these cases we want to use additional metadata or materialized indexes to
1. Reduce the number of chunks we retrieve and deserialize.
2. Reduce the number of values we check against the filter predicate.

Currently we support using [geo-index](https://github.com/georust/geo-index)
R-tree indexes. In this case the "basic" pipeline is short circuited. The
`TableProvder` checks if the "column" in the predicate has an index. If an
index exists, it uses it find the "rows" that satisfy the predicate. The
`TableProvider` fetches only the chunks for those rows and then only decodes
those portions of the chunks. The filter predicate is never used against the
raw values.

![](../diagrams/index_scanning.png)
55 changes: 55 additions & 0 deletions docs/architecture/zarr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Zarr Structure

`zarr-datafusion-search` uses a set of 1D arrays to store metadata,
treating them as if they were columns in a columnar storage format like
Parquet.

#### Schema
The library requires a special group named `meta`. It discovers each array stored in the `meta` group and
maps its Zarr v3 `dtype` to an Arrow type to build an
[Arrow schema](https://arrow.apache.org/cookbook/py/schema.html). These are the
current supported type mappings.

## Supported dtype mappings

| Zarr v3 dtype | Arrow type | Notes |
|---|---|---|
| `bool` | `Boolean` | |
| `int8` | `Int8` | |
| `int16` | `Int16` | |
| `int32` | `Int32` | |
| `int64` | `Int64` | |
| `uint8` | `UInt8` | |
| `uint16` | `UInt16` | |
| `uint32` | `UInt32` | |
| `uint64` | `UInt64` | |
| `float16` | `Float16` | |
| `float32` | `Float32` | |
| `float64` | `Float64` | |
| `bytes` | `BinaryView` | When the field name is `bbox`, mapped to WKB with EPSG:4326 CRS via the GeoArrow extension type instead |
| `r<N>` (raw bits) | `BinaryView` | |
| `string` | `Utf8View` | |
| `numpy.datetime64[s]` | `Timestamp(Second, None)` | |
| `numpy.datetime64[ms]` | `Timestamp(Millisecond, None)` | |
| `numpy.datetime64[us]` | `Timestamp(Microsecond, None)` | |
| `numpy.datetime64[ns]` | `Timestamp(Nanosecond, None)` | |
| `complex64` | — | Not supported |
| `complex128` | — | Not supported |

### Chunking
Because the `meta` arrays are combined by Datafusion into a single schema they
need to maintain chunk alignment. This means that during array creation time,
the chunk size must be the same for each array. And as data is appended, it needs to be appended to each array simultaneously.
Because the chunks are aligned across arrays, the library can treat the combined chunks similar to a Parquet row group.

This diagram demonstrates the chunk alignment
![](../diagrams/chunk_scanning.png)

### Indexes
`zarr-datafusion-search` supports the optional use of materialized indexes to improve
scanning performance. We use the following convention, if a `meta`
array is used in the filter predicate `zarr-datafusion-search` will look for a
group called `indexes` and search for an array of the same name as the
predicate array. Currently the only index type supported are R-tree indexes
generated by the [geo-index](https://github.com/georust/geo-index)
library but we will continue to expand index support.
Binary file added docs/assets/ds_favicon.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/assets/logo_no_text.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading