Skip to content

Commit 15d8e25

Browse files
changes to address feedback (#52)
* changes to address feedback * Add comment about deck.gl-zarr * Rephrase recommendation on chunk structure * Add benefit of virtual stores for new products Added a point about virtual stores simplifying rechunking or regridding. * Enhance description of cloud-native archival access Revised explanation of legacy scientific data formats and their performance in cloud environments. Clarified the role of virtual stores in providing cloud-optimized access without data duplication. * Move language and ecosystem constraints section to bottom of limitations * Replace reprocessing with duplicating
1 parent 740c87b commit 15d8e25

5 files changed

Lines changed: 46 additions & 28 deletions

File tree

executive-summary.qmd

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ This report assesses the current state and future direction of virtual store tec
88

99
## Summary of recommendations and current state
1010

11+
The performance of legacy scientific data formats is poor in a cloud environment, yet reprocessing or copying the full NASA archive is not feasible. Virtual stores address both problems: they provide cloud-optimized access via Zarr to existing data, without duplication.
12+
1113
Virtual store technology is ready for production use with consistently gridded data and is actively being developed for more complex data types.
1214

1315
Key recommendations include:

index.qmd

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Virtual Stores at NASA"
3-
subtitle: "Unifying access to NASA datasets"
3+
subtitle: "Cloud-native access to archival formats"
44
authors: Aimee Barciauskas, Ed Armstrong, Amy Steiker, Owen Littlejohns, Daniel Kaufman, Chris Battisto, Hailiang Zhang, Christine Smit, Jack McNelis, Luis Lopez, Siddharth Chaudhary, Joseph H. Kennedy, Kim Fairbanks
55
title-block-style: plain
66
---
@@ -10,16 +10,19 @@ This report outlines the current state of virtual stores at NASA. It's intended
1010
**If you are a program lead,** it is suggested to start here with the main points about virtual stores. Then it is recommended to visit the [Executive Summary](./executive_summary.qmd) and [Recommendations](./recommendations.qmd). From there, you will likely want to also make note of the [Limitations](./limitations.qmd) and [Governance](./governance.qmd) sections.
1111
**If you are looking to understand virtual stores at NASA on a more technical level,**, please visit the Technical Aspects sections. The [Resources](./resources.qmd) may also be of interest.
1212

13-
## Vision: NASA datasets accessible through a single entrypoint
13+
## Vision: Unlocking cloud-native access to archival formats
1414

1515
<img style="height: 150px; margin: 0px auto; display: block" alt="Simple Virtual Zarr Graphic" src="./graphics/simple-virtual-zarr.svg" />
1616

17-
Virtual stores deliver a single entrypoint to a dataset comprised of many files. For NASA datasets this enables:
17+
Legacy scientific collections perform poorly in cloud environments because they were designed for local disk access rather than network-based access. Cloud-optimized formats like Zarr, COG, and cloud-optimized HDF5 address network-access-optimization, but reprocessing or duplicating NASA’s archives into new formats is not on NASA's roadmap due to the scale of historical data and the continued need to preserve self-describing, standalone archival files. Virtual stores bridge this gap by enabling efficient cloud-based access to existing archives without duplicating the underlying data.
1818

19+
A key benefit is that virtual stores also deliver a **single entrypoint** to datasets comprised of many files. Together, this enables:
20+
21+
* Better performance and reduced costs as less data — only the data the user needs — is sent over the internet.
1922
* Less pre-processing to be "analysis-ready".
2023
* Users do not have to know about the underlying data format or storage location.
2124
* Greater interoperability through a common API for reading, writing and analyzing complex and heterogeneous NASA datasets.
22-
* Better performance and reduced costs as less data -- only the data the user needs -- is sent over the internet.
25+
* Virtual stores are the efficient starting point for new products, as pre-identified chunk indices simplify any rechunking or regridding.
2326

2427
## What are virtual stores and what do they enable?
2528

limitations.qmd

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,30 @@
22
title: "Known Limitations"
33
---
44

5-
## Language and ecosystem constraints
5+
## Early structural decisions have lasting performance consequences
66

7-
Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores.
7+
Virtual data stores (VDS) depend upon the chunks of the underlying files. Files' internal chunk structure, and consequently chunk manifests, cannot be optimized simultaneously for all use cases. Chunk shape and size directly determines which access patterns are fast and which are slow. Early structural decisions will benefit some access patterns and disadvantage others. There is no universally optimal chunking, only a least-worst one for the most common use case.
88

9-
## Early structural decisions have lasting performance consequences
9+
### Chunk size
10+
11+
Chunks that are too small cause excessive HTTP requests and computational overhead to decompress. Chunks that are too large transfer more data than needed.
12+
13+
For a more thorough explanation, see [Datacube Guide: Tiny data chunks](https://developmentseed.org/datacube-guide/worst-practices/tiny-chunks.html) and [Datacube Guide: Massive data chunks](https://developmentseed.org/datacube-guide/worst-practices/massive-chunks.html).
14+
15+
### Chunk shape
16+
17+
If a set of files has a chunk shape to optimize for spatial access it cannot simultaneously be optimized for access across the time dimension (i.e. time series).
18+
19+
A useful analogy: pancakes vs. churros.
1020

11-
File chunking and chunk manifests cannot simultaneously optimize for all use cases. Further, chunk manifests depend on the chunking already inherent to the files. You cannot create a chunk manifest to access a unit smaller than chunks in the underlying files. The implication is, for example, if a set of files is optimized for spatial access it cannot simultaneously be optimized for access across the time dimension (i.e. time series).
21+
* A pancake chunk holds the full spatial extent at one timestep. Loading a global snapshot is fast because it's all in one chunk. Time series are slow because each timestep is stored in a separate chunk.
22+
* A churro chunk holds many timesteps for a small spatial location. Time series are fast for a spatial subset, but global views are slow.
23+
24+
![pancakes and churros](./graphics/pancake_vs_churro_chunking_v2.svg)
25+
26+
This is a real problem in practice: many datasets store one file per timestep, which makes data collection straight-forward but is not optimized for time series access.
27+
28+
VDSs are often built after data product decisions have already been made. What you can still control is the manifest, where you can make changes to what variables are represented and in what composition. For examples, see [Virtual Stores at NASA](./nasa-applications.qmd).
1229

1330
## Chunk sizes must be consistent across files
1431

@@ -41,5 +58,10 @@ If the GeoZarr community coalesce into a solution, it will need to be be impleme
4158

4259
Opening a virtual store backed by NASA data currently requires steps beyond standard Earthdata Login, specifically S3 credential configuration and tool-specific API calls to open the store before any data is accessed. Friction exists because virtual stores sit at the intersection of several authentication boundaries: the store itself (which may be in a public or protected S3 bucket), the source data files the store references (which typically require Earthdata Login credentials), and services (which may have their own authentication interfaces).
4360

61+
## Language and ecosystem constraints
62+
63+
Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores. Rust presents an organizational risk similar to what NASA has experienced with niche languages in other systems: supporting and extending Icechunk long-term would require NASA staff or contractors with Rust expertise, which is not yet widely available in the earth science community. Rust is seeing broader general adoption than some past niche languages, which reduces but does not eliminate this risk.
64+
65+
4466
Until this is simplified to something comparable to the experience `earthaccess` provides for direct file access, credential complexity will remain a practical barrier to adoption — particularly for researchers who are not cloud-infrastructure specialists.
4567

recommendations.qmd

Lines changed: 10 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,17 @@
22
title: "Recommendations"
33
---
44

5-
## Design files, chunks, and aggregated chunk manifests around typical use case patterns
5+
## Design file chunk structure around typical access patterns
66

7-
The most important design input for a new data product is understanding how the data will most commonly be accessed.
8-
With virtual data stores (VDS), you cannot change the chunks of the underlying files. This matters because more chunks along a dimension means slower access across that dimension.
9-
10-
A useful analogy: pancakes vs. churros.
11-
12-
* A pancake chunk holds the full spatial extent at one timestep. Loading a global snapshot is fast because it's all in one chunk. Time series are slow because each timestep is stored in a separate chunk.
13-
* A churro chunk holds many timesteps for a small spatial location. Time series are fast for a spatial subset, but global views are slow.
14-
15-
![pancakes and churros](./graphics/pancake_vs_churro_chunking_v2.svg)
16-
17-
This is a real problem in practice: many datasets store one file per timestep, which makes data collection straight-forward but is not optimized for time series access.
18-
19-
Chunk shape should be a deliberate design decision, made with awareness of the tradeoffs.
20-
21-
VDSs are often built after data product decisions have already been made. What you can still control is the manifest, where you can make changes to what variables are represented and in what composition. For examples, see [Virtual Stores at NASA](./nasa-applications.qmd).
7+
While the focus of this document is virtual data stores, it is worth mentioning data product design decisions, since those decisions impact VDS performance. As noted in the [Limitations section](limitations.qmd#early-structural-decisions-have-lasting-performance-consequences), virtual data stores depend on the chunk structure of the underlying files. That's why it is recommended to design files with target access patterns in mind (chunk for access, not storage).
228

239
## Adopt icechunk
2410

2511
[Icechunk](icechunk.io) should be adopted but with risk mitigation measures. Icechunk is a transactional storage engine for Zarr. In other words, it is a way to manage Zarr stores the same way you would with many traditional databases. Icechunk technology supports the following operational needs of many NASA datasets:
2612

27-
* Safety: Changes to a store can be made safely through ACID transactions which ensure all dependent updates are either committed together or rolled back together. Corrupted data can easily be fixed by rolling back to a previous snapshot.
28-
* Stability: Some production workflows may depend on Icechunk stores and pointing to a specific version ensures stability.
13+
* **Incremental updating**: Icechunk is the only technology that supports safely appending new data to an existing virtual store — critical for active missions that continuously produce new granules. Without it, the alternatives are rebuilding the entire store on each update or accepting the risk of metadata falling out of sync with the data it describes.
14+
* **Safety**: Changes to a store are made through ACID transactions, which ensure that all dependent updates (data and metadata) are committed together or rolled back together. This means a store will never be in a partially-updated state — corrupted data can be fixed by rolling back to a previous snapshot.
15+
* **Reproducibility**: An Icechunk store can be pinned to a specific snapshot, so science workflows that depend on a particular version of the data are not broken by subsequent updates. Snapshots can be tagged for long-term reference.
2916

3017
Reference: [https://icechunk.io/en/stable/overview/](https://icechunk.io/en/stable/overview/).
3118

@@ -39,7 +26,11 @@ More specifically, NASA should work on:
3926

4027
## Adopt GeoZarr standards
4128

42-
Adoption of [GeoZarr](https://geozarr.org/) is recommended to ensure interoperability with the developing GeoZarr ecosystem of tooling. GeoZarr is a specification for standardizing geospatial metadata in Zarr stores. Adopting these standards in VDSs is straight-forward as they already implement Zarr metadata.
29+
Adoption of [GeoZarr](https://geozarr.org/) is recommended to ensure interoperability with the developing GeoZarr ecosystem of tooling. GeoZarr is to Zarr what GeoTIFF is to TIFF: Zarr has no built-in concept of coordinate reference systems or multi-resolution data for use in tiling and overviews. GeoZarr defines conventions for these so that geospatial tools can work with Zarr stores without each tool implementing its own custom metadata interpretation. Adopting these standards in VDSs is straight-forward as they already implement Zarr metadata.
30+
31+
Without standards, tooling must be customized per-dataset. GeoZarr enabled development of the [deck.gl-zarr](https://github.com/developmentseed/deck.gl-raster/tree/main/packages/deck.gl-zarr) library, which renders any GeoZarr-compliant dataset in the browser.
32+
33+
The main alternatives are CF conventions (used by NetCDF/HDF5) and STAC (for discovery-level metadata). GeoZarr draws from CF conventions but is designed specifically for Zarr, making it the natural choice for Zarr-based virtual stores.
4334

4435
## Leverage existing tools, services and available chunk metadata.
4536

technical-overview.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ title: "Technical Overview of Virtual Stores"
66

77
![](graphics/virtualizarr_pipeline.svg)
88

9-
* [**Zarr**](https://zarr.dev/) is a chunked, compressed multi-dimensional array specification. Zarr is designed for cloud-native (network-addressable chunks in object storage) access. Zarr is the proposed specification for virtual stores at NASA.
9+
* [**Zarr**](https://zarr.dev/) is a chunked, compressed multi-dimensional array specification. Zarr is designed for cloud-native (network-addressable chunks in object storage) access. Zarr is the proposed specification for virtual stores at NASA. Unlike Cloud-Optimized GeoTIFF (COG), which is optimized for 2D raster imagery but does not generalize to multi-dimensional scientific arrays, or cloud-optimized HDF5, which improves on legacy HDF5 but still relies on HDF5 library internals and byte-range seeking within a single file, Zarr stores each chunk as a separate object in cloud storage. This design enables highly parallel reads, straightforward scaling, and native compatibility with object storage APIs without specialized client libraries.
1010
* **Chunk manifests** are lightweight metadata structures that describe a mapping from logical data space to where that data is stored. Another term for chunk manifests is an "indirection layer". Icechunk, Kerchunk, and DMR++ are common implementations of chunk manifests.
1111
* [**Icechunk**](https://icechunk.io/) is a transactional storage engine for Zarr arrays. Icechunk stores chunk manifests which it calls [virtual datasets](https://icechunk.io/en/stable/virtual/).
1212
* [**Kerchunk**](https://fsspec.github.io/kerchunk/) is an early approach to creating chunk manifests (what it calls reference files) which maps virtual Zarr array coordinates to byte ranges in existing files using the JSON or Parquet file formats for persistence.

0 commit comments

Comments
 (0)