Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions executive-summary.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ This report assesses the current state and future direction of virtual store tec

## Summary of recommendations and current state

The performance of legacy scientific data formats is poor in a cloud environment, yet reprocessing or copying the full NASA archive is not feasible. Virtual stores address both problems: they provide cloud-optimized access via Zarr to existing data, without duplication.

Virtual store technology is ready for production use with consistently gridded data and is actively being developed for more complex data types.

Key recommendations include:
Expand Down
11 changes: 7 additions & 4 deletions index.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: "Virtual Stores at NASA"
subtitle: "Unifying access to NASA datasets"
subtitle: "Cloud-native access to archival formats"
authors: Aimee Barciauskas, Ed Armstrong, Amy Steiker, Owen Littlejohns, Daniel Kaufman, Chris Battisto, Hailiang Zhang, Christine Smit, Jack McNelis, Luis Lopez, Siddharth Chaudhary, Joseph H. Kennedy, Kim Fairbanks
title-block-style: plain
---
Expand All @@ -10,16 +10,19 @@ This report outlines the current state of virtual stores at NASA. It's intended
**If you are a program lead,** it is suggested to start here with the main points about virtual stores. Then it is recommended to visit the [Executive Summary](./executive_summary.qmd) and [Recommendations](./recommendations.qmd). From there, you will likely want to also make note of the [Limitations](./limitations.qmd) and [Governance](./governance.qmd) sections.
**If you are looking to understand virtual stores at NASA on a more technical level,**, please visit the Technical Aspects sections. The [Resources](./resources.qmd) may also be of interest.

## Vision: NASA datasets accessible through a single entrypoint
## Vision: Unlocking cloud-native access to archival formats

<img style="height: 150px; margin: 0px auto; display: block" alt="Simple Virtual Zarr Graphic" src="./graphics/simple-virtual-zarr.svg" />

Virtual stores deliver a single entrypoint to a dataset comprised of many files. For NASA datasets this enables:
Legacy scientific collections perform poorly in cloud environments because they were designed for local disk access rather than network-based access. Cloud-optimized formats like Zarr, COG, and cloud-optimized HDF5 address network-access-optimization, but reprocessing or duplicating NASA’s archives into new formats is not on NASA's roadmap due to the scale of historical data and the continued need to preserve self-describing, standalone archival files. Virtual stores bridge this gap by enabling efficient cloud-based access to existing archives without duplicating the underlying data.

A key benefit is that virtual stores also deliver a **single entrypoint** to datasets comprised of many files. Together, this enables:
Comment thread
owenlittlejohns marked this conversation as resolved.

* Better performance and reduced costs as less data — only the data the user needs — is sent over the internet.
* Less pre-processing to be "analysis-ready".
* Users do not have to know about the underlying data format or storage location.
* Greater interoperability through a common API for reading, writing and analyzing complex and heterogeneous NASA datasets.
* Better performance and reduced costs as less data -- only the data the user needs -- is sent over the internet.
* Virtual stores are the efficient starting point for new products, as pre-identified chunk indices simplify any rechunking or regridding.

## What are virtual stores and what do they enable?

Expand Down
30 changes: 26 additions & 4 deletions limitations.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,30 @@
title: "Known Limitations"
---

## Language and ecosystem constraints
## Early structural decisions have lasting performance consequences

Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores.
Virtual data stores (VDS) depend upon the chunks of the underlying files. Files' internal chunk structure, and consequently chunk manifests, cannot be optimized simultaneously for all use cases. Chunk shape and size directly determines which access patterns are fast and which are slow. Early structural decisions will benefit some access patterns and disadvantage others. There is no universally optimal chunking, only a least-worst one for the most common use case.

## Early structural decisions have lasting performance consequences
### Chunk size

Chunks that are too small cause excessive HTTP requests and computational overhead to decompress. Chunks that are too large transfer more data than needed.

For a more thorough explanation, see [Datacube Guide: Tiny data chunks](https://developmentseed.org/datacube-guide/worst-practices/tiny-chunks.html) and [Datacube Guide: Massive data chunks](https://developmentseed.org/datacube-guide/worst-practices/massive-chunks.html).

### Chunk shape

If a set of files has a chunk shape to optimize for spatial access it cannot simultaneously be optimized for access across the time dimension (i.e. time series).
Comment thread
owenlittlejohns marked this conversation as resolved.

A useful analogy: pancakes vs. churros.

File chunking and chunk manifests cannot simultaneously optimize for all use cases. Further, chunk manifests depend on the chunking already inherent to the files. You cannot create a chunk manifest to access a unit smaller than chunks in the underlying files. The implication is, for example, if a set of files is optimized for spatial access it cannot simultaneously be optimized for access across the time dimension (i.e. time series).
* A pancake chunk holds the full spatial extent at one timestep. Loading a global snapshot is fast because it's all in one chunk. Time series are slow because each timestep is stored in a separate chunk.
* A churro chunk holds many timesteps for a small spatial location. Time series are fast for a spatial subset, but global views are slow.

![pancakes and churros](./graphics/pancake_vs_churro_chunking_v2.svg)

This is a real problem in practice: many datasets store one file per timestep, which makes data collection straight-forward but is not optimized for time series access.

VDSs are often built after data product decisions have already been made. What you can still control is the manifest, where you can make changes to what variables are represented and in what composition. For examples, see [Virtual Stores at NASA](./nasa-applications.qmd).

## Chunk sizes must be consistent across files

Expand Down Expand Up @@ -41,5 +58,10 @@ If the GeoZarr community coalesce into a solution, it will need to be be impleme

Opening a virtual store backed by NASA data currently requires steps beyond standard Earthdata Login, specifically S3 credential configuration and tool-specific API calls to open the store before any data is accessed. Friction exists because virtual stores sit at the intersection of several authentication boundaries: the store itself (which may be in a public or protected S3 bucket), the source data files the store references (which typically require Earthdata Login credentials), and services (which may have their own authentication interfaces).

## Language and ecosystem constraints

Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores. Rust presents an organizational risk similar to what NASA has experienced with niche languages in other systems: supporting and extending Icechunk long-term would require NASA staff or contractors with Rust expertise, which is not yet widely available in the earth science community. Rust is seeing broader general adoption than some past niche languages, which reduces but does not eliminate this risk.


Until this is simplified to something comparable to the experience `earthaccess` provides for direct file access, credential complexity will remain a practical barrier to adoption — particularly for researchers who are not cloud-infrastructure specialists.

29 changes: 10 additions & 19 deletions recommendations.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,30 +2,17 @@
title: "Recommendations"
---

## Design files, chunks, and aggregated chunk manifests around typical use case patterns
## Design file chunk structure around typical access patterns

The most important design input for a new data product is understanding how the data will most commonly be accessed.
With virtual data stores (VDS), you cannot change the chunks of the underlying files. This matters because more chunks along a dimension means slower access across that dimension.

A useful analogy: pancakes vs. churros.

* A pancake chunk holds the full spatial extent at one timestep. Loading a global snapshot is fast because it's all in one chunk. Time series are slow because each timestep is stored in a separate chunk.
* A churro chunk holds many timesteps for a small spatial location. Time series are fast for a spatial subset, but global views are slow.

![pancakes and churros](./graphics/pancake_vs_churro_chunking_v2.svg)

This is a real problem in practice: many datasets store one file per timestep, which makes data collection straight-forward but is not optimized for time series access.

Chunk shape should be a deliberate design decision, made with awareness of the tradeoffs.

VDSs are often built after data product decisions have already been made. What you can still control is the manifest, where you can make changes to what variables are represented and in what composition. For examples, see [Virtual Stores at NASA](./nasa-applications.qmd).
While the focus of this document is virtual data stores, it is worth mentioning data product design decisions, since those decisions impact VDS performance. As noted in the [Limitations section](limitations.qmd#early-structural-decisions-have-lasting-performance-consequences), virtual data stores depend on the chunk structure of the underlying files. That's why it is recommended to design files with target access patterns in mind (chunk for access, not storage).
Comment thread
owenlittlejohns marked this conversation as resolved.

## Adopt icechunk

[Icechunk](icechunk.io) should be adopted but with risk mitigation measures. Icechunk is a transactional storage engine for Zarr. In other words, it is a way to manage Zarr stores the same way you would with many traditional databases. Icechunk technology supports the following operational needs of many NASA datasets:

* Safety: Changes to a store can be made safely through ACID transactions which ensure all dependent updates are either committed together or rolled back together. Corrupted data can easily be fixed by rolling back to a previous snapshot.
* Stability: Some production workflows may depend on Icechunk stores and pointing to a specific version ensures stability.
* **Incremental updating**: Icechunk is the only technology that supports safely appending new data to an existing virtual store — critical for active missions that continuously produce new granules. Without it, the alternatives are rebuilding the entire store on each update or accepting the risk of metadata falling out of sync with the data it describes.
* **Safety**: Changes to a store are made through ACID transactions, which ensure that all dependent updates (data and metadata) are committed together or rolled back together. This means a store will never be in a partially-updated state — corrupted data can be fixed by rolling back to a previous snapshot.
* **Reproducibility**: An Icechunk store can be pinned to a specific snapshot, so science workflows that depend on a particular version of the data are not broken by subsequent updates. Snapshots can be tagged for long-term reference.

Reference: [https://icechunk.io/en/stable/overview/](https://icechunk.io/en/stable/overview/).

Expand All @@ -39,7 +26,11 @@ More specifically, NASA should work on:

## Adopt GeoZarr standards

Adoption of [GeoZarr](https://geozarr.org/) is recommended to ensure interoperability with the developing GeoZarr ecosystem of tooling. GeoZarr is a specification for standardizing geospatial metadata in Zarr stores. Adopting these standards in VDSs is straight-forward as they already implement Zarr metadata.
Adoption of [GeoZarr](https://geozarr.org/) is recommended to ensure interoperability with the developing GeoZarr ecosystem of tooling. GeoZarr is to Zarr what GeoTIFF is to TIFF: Zarr has no built-in concept of coordinate reference systems or multi-resolution data for use in tiling and overviews. GeoZarr defines conventions for these so that geospatial tools can work with Zarr stores without each tool implementing its own custom metadata interpretation. Adopting these standards in VDSs is straight-forward as they already implement Zarr metadata.

Without standards, tooling must be customized per-dataset. GeoZarr enabled development of the [deck.gl-zarr](https://github.com/developmentseed/deck.gl-raster/tree/main/packages/deck.gl-zarr) library, which renders any GeoZarr-compliant dataset in the browser.

The main alternatives are CF conventions (used by NetCDF/HDF5) and STAC (for discovery-level metadata). GeoZarr draws from CF conventions but is designed specifically for Zarr, making it the natural choice for Zarr-based virtual stores.

## Leverage existing tools, services and available chunk metadata.

Expand Down
2 changes: 1 addition & 1 deletion technical-overview.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ title: "Technical Overview of Virtual Stores"

![](graphics/virtualizarr_pipeline.svg)

* [**Zarr**](https://zarr.dev/) is a chunked, compressed multi-dimensional array specification. Zarr is designed for cloud-native (network-addressable chunks in object storage) access. Zarr is the proposed specification for virtual stores at NASA.
* [**Zarr**](https://zarr.dev/) is a chunked, compressed multi-dimensional array specification. Zarr is designed for cloud-native (network-addressable chunks in object storage) access. Zarr is the proposed specification for virtual stores at NASA. Unlike Cloud-Optimized GeoTIFF (COG), which is optimized for 2D raster imagery but does not generalize to multi-dimensional scientific arrays, or cloud-optimized HDF5, which improves on legacy HDF5 but still relies on HDF5 library internals and byte-range seeking within a single file, Zarr stores each chunk as a separate object in cloud storage. This design enables highly parallel reads, straightforward scaling, and native compatibility with object storage APIs without specialized client libraries.
* **Chunk manifests** are lightweight metadata structures that describe a mapping from logical data space to where that data is stored. Another term for chunk manifests is an "indirection layer". Icechunk, Kerchunk, and DMR++ are common implementations of chunk manifests.
* [**Icechunk**](https://icechunk.io/) is a transactional storage engine for Zarr arrays. Icechunk stores chunk manifests which it calls [virtual datasets](https://icechunk.io/en/stable/virtual/).
* [**Kerchunk**](https://fsspec.github.io/kerchunk/) is an early approach to creating chunk manifests (what it calls reference files) which maps virtual Zarr array coordinates to byte ranges in existing files using the JSON or Parquet file formats for persistence.
Expand Down
Loading