Motivation
Virtual stores deliver a single entrypoint to a dataset comprised of many files. For NASA datasets this enables:
- Less pre-processing to be “analysis-ready”.
- Users do not have to know about the underlying data format or storage location.
- Greater interoperability through a common API for reading, writing and analyzing complex and heterogeneous NASA datasets.
- Better performance and reduced costs as less data – only the data the user needs – is sent over the internet.
This PI we have a number of parallel tasks to support the ecosystem of virtual zarr stores at NASA.
Each sub-task has its own motivation:
- Task 1. Parse manifests back out of Icechunk: The inability to modify or inspect virtual stores reduces Icechunk adoption, despite its reliability and performance benefits relative to Kerchunk. Risk mitigation from dependency on icechunk and its core maintainers.
- Task 2. VirtualiZarr Maintenance: We are core maintainers of VirtualiZarr, which is the library used to parse and write virtual stores. We reserve time to address bugs and questions as they arrive so the library is well-maintained.
- Task 3. (Stretch) Design Virtual Data Cubes for NISAR and BIOMASS: NISAR data delivery creates the first real opportunity to test this at scale; a POC now shapes the long-term architecture before patterns solidify.
- Task 4. (Stretch) Provide bearer-token HTTP support in Icechunk This is a direct request from PO.DAAC. They have many users will still rely on HTTPS access to datasets since they don't have the ability to work from us-west-2, where Earthdata cloud buckets are located. These users cannot use icechunk stores until there is bearer-token HTTP support in icechunk due to the need to pass along a token from Earthdata login.
- Task 5. Virtual GEOS-CF maintenance and virtualizarr-data-pipelines Continued maintenance of the GOES-CF (v2) dataset as it continuously produces new data. Requirements exposed from this work will also be propagated down into virtualizarr-data-pipelines to improve the template for producing pipelines to virtualize datasets.
Sub-tasks (stand-in for acceptance criteria)
LOE
1 FTE (for non-stretch tasks)
Dependencies
MAAP for AC 3
Related PRs
Motivation
Virtual stores deliver a single entrypoint to a dataset comprised of many files. For NASA datasets this enables:
This PI we have a number of parallel tasks to support the ecosystem of virtual zarr stores at NASA.
Each sub-task has its own motivation:
Sub-tasks (stand-in for acceptance criteria)
LOE
1 FTE (for non-stretch tasks)
Dependencies
MAAP for AC 3
Related PRs