Skip to content

changes to address feedback#52

Merged
abarciauskas-bgse merged 7 commits into
mainfrom
ab/address-feedback
May 29, 2026
Merged

changes to address feedback#52
abarciauskas-bgse merged 7 commits into
mainfrom
ab/address-feedback

Conversation

@abarciauskas-bgse
Copy link
Copy Markdown
Contributor

No description provided.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 9, 2026

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-05-29 22:50 UTC

@abarciauskas-bgse abarciauskas-bgse marked this pull request as ready for review May 11, 2026 21:51
@doug-newman-nasa
Copy link
Copy Markdown

Can we add something about the ease of rechunking based on a virtual Zarr store?

@abarciauskas-bgse abarciauskas-bgse self-assigned this May 14, 2026
Added a point about virtual stores simplifying rechunking or regridding.
@abarciauskas-bgse
Copy link
Copy Markdown
Contributor Author

@doug-newman-nasa yes thank you for the reminder, I've added a line in in this commit: 078b138.

Comment thread index.qmd Outdated
<img style="height: 150px; margin: 0px auto; display: block" alt="Simple Virtual Zarr Graphic" src="./graphics/simple-virtual-zarr.svg" />

Virtual stores deliver a single entrypoint to a dataset comprised of many files. For NASA datasets this enables:
The performance of legacy scientific data formats is poor in a cloud environment. Cloud-optimized formats like Zarr, COG, and cloud-optimized HDF5 address this — but reprocessing or copying the entire NASA archive into these formats is not feasible. Virtual stores bridge this gap: they provide cloud-optimized access to existing archived data without copying it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reprocessing or copying the entire NASA archive into these formats is not feasible

Why not? In an ideal world we would not need virtual stores, because data providers would write their data into the cloud in a suitable format in the first place. IMO we should be careful not to encourage a narrative that lift-and-shift is totally fine and okay because virtualization exists.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that these archival formats exist for a reason. The inclusion of metadata in each file is a self-describing feature which makes it possible to move and use files on their own. I don't think NASA is ready to shift users away or able to relinquish this archival format requirement. But I'm curious what @doug-newman-nasa would say to that.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes it possible to move and use files on their own.

If file download is treated as an access pattern rather than the source of truth then that need can still be served from cloud-native stores.

I don't think NASA is ready to shift users away or able to relinquish this archival format requirement.

Yeah I'm not suggesting that NASA is at all ready for this paradigm shift today, but I am suggesting that that is the real ultimate goal, and maybe we should make that explicit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok I see, I think I can rephrase it with that goal in mind...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas let me know what you think of the rewording in 9adbd0a

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I much prefer that framing, thank you. Only nit is that in the final sentence you said that virtual stores avoid the need for "reprocessing", but I think you really mean they avoid they need for "duplicating".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With "reprocessing", I was thinking of the case where data needs to be rechunked into cloud-friendly chunk structures. But I'm fine with "duplicating the underlying data" as I think it covers the bases of either reprocessing to cloud-optimized chunks in the same or a new format.

Comment thread limitations.qmd
Comment thread index.qmd
Comment thread limitations.qmd Outdated
## Language and ecosystem constraints

Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores.
Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores. Rust presents an organizational risk similar to what NASA has experienced with niche languages in other systems: supporting and extending Icechunk long-term would require NASA staff or contractors with Rust expertise, which is not yet widely available in the earth science community. Rust is seeing broader general adoption than some past niche languages, which reduces but does not eliminate this risk.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also captures Doug's feedback nicely. (Without naming a particular software system and its language choices 😉)

Is this the most important limitation to list, though? Feels like the points about chunk shape/size and other data product related considerations might want to get the top billing, and this could be moved down maybe? Or would that just bury the concern?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I actually think it's the least important? I hadn't thought about the order signaling level of significance of the limitation but I think the new order (moving the language limitation to the bottom) reflects the ordering of significance that I would propose.

76ac286

Comment thread recommendations.qmd
Revised explanation of legacy scientific data formats and their performance in cloud environments. Clarified the role of virtual stores in providing cloud-optimized access without data duplication.
Copy link
Copy Markdown
Collaborator

@owenlittlejohns owenlittlejohns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates @abarciauskas-bgse!

@abarciauskas-bgse abarciauskas-bgse merged commit 15d8e25 into main May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants