Skip to content

ODD PI 26.3 Objective 1: 🤖 Develop + Maintain the Virtual Zarr Ecosystem #346

@abarciauskas-bgse

Description

@abarciauskas-bgse

Motivation

Virtual stores deliver a single entrypoint to a dataset comprised of many files. For NASA datasets this enables:

  • Less pre-processing to be “analysis-ready”.
  • Users do not have to know about the underlying data format or storage location.
  • Greater interoperability through a common API for reading, writing and analyzing complex and heterogeneous NASA datasets.
  • Better performance and reduced costs as less data – only the data the user needs – is sent over the internet.

This PI we have a number of parallel tasks to support the ecosystem of virtual zarr stores at NASA.

Each sub-task has its own motivation:

  • Task 1. Parse manifests back out of Icechunk: The inability to modify or inspect virtual stores reduces Icechunk adoption, despite its reliability and performance benefits relative to Kerchunk. Risk mitigation from dependency on icechunk and its core maintainers.
  • Task 2. VirtualiZarr Maintenance: We are core maintainers of VirtualiZarr, which is the library used to parse and write virtual stores. We reserve time to address bugs and questions as they arrive so the library is well-maintained.
  • Task 3. (Stretch) Design Virtual Data Cubes for NISAR and BIOMASS: NISAR data delivery creates the first real opportunity to test this at scale; a POC now shapes the long-term architecture before patterns solidify.
  • Task 4. (Stretch) Provide bearer-token HTTP support in Icechunk This is a direct request from PO.DAAC. They have many users will still rely on HTTPS access to datasets since they don't have the ability to work from us-west-2, where Earthdata cloud buckets are located. These users cannot use icechunk stores until there is bearer-token HTTP support in icechunk due to the need to pass along a token from Earthdata login.
  • Task 5. Virtual GEOS-CF maintenance and virtualizarr-data-pipelines Continued maintenance of the GOES-CF (v2) dataset as it continuously produces new data. Requirements exposed from this work will also be propagated down into virtualizarr-data-pipelines to improve the template for producing pipelines to virtualize datasets.

Sub-tasks (stand-in for acceptance criteria)

  • 1. Parse manifests back out of Icechunk
  • 2. VirtualiZarr Maintenance
  • 3. Stretch: Design Virtual Data Cubes for NISAR and BIOMASS: Deliver a proof-of-concept virtual data cube for NISAR and BIOMASS; define the path to a science-ready production workflow. Provide guidance to data producers/providers such as ASF/ESA on best practices for hosting data.
  • 5. Virtual GEOS-CF maintenance & Virtualizarr-data-pipelines improvement

LOE

1 FTE (for non-stretch tasks)

Dependencies

MAAP for AC 3

Related PRs

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions