Skip to content

Dmrpp migration#902

Open
Mikejmnez wants to merge 18 commits into
zarr-developers:mainfrom
Mikejmnez:dmrpp_migration
Open

Dmrpp migration#902
Mikejmnez wants to merge 18 commits into
zarr-developers:mainfrom
Mikejmnez:dmrpp_migration

Conversation

@Mikejmnez

@Mikejmnez Mikejmnez commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

This DRAFT PR migrates the core of the dmrpp parser to pydap, integrating it for better support, while incorporating the latest contributions (for example #880). For example, the parser can reference external sources/path to individual chunks (no longer inheriting and assuming all chunks are within the file), parse inline (compressed) values, etc. Improvement of the dmrpp parser will continue close to the source, whilst aiming for interoperability with virtualizarr.

  • Closes pydap#417.
  • Tests added (mostly removed)
  • Tests passing (py311 and py312 environments)
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

This is still draft, since I need to make a new official release of pydap to make these and update the project.toml to reflect the pydap release. Aside from that, all parser tests have been migrated to pydap (where they pass)

summary

dmrpp parser migrated to pydap. However the classes dmrpp and dmrp still exist here to mostly parse the metadata and provide backwards compat with previous added features that remain exposed to users (e.g. earthaccess).

Inline references in dmrpp can be of 2 kinds of inline references - a) base64 and b) base64compressed. Pydap decodes these when parsing dmrpp turning them into arrays of atomic dap4 types. In Virtualizarr, all inline data is then base64 encoded and added as a chunk entry. In the future, if requested, pydap could retain all base64 dmrpp inline references as present when parsing, only decompressing those that are compressed so inline references are always base64 encoded.

@betolink

@codecov

codecov Bot commented Mar 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.08%. Comparing base (b693e0d) to head (480fa1d).

Files with missing lines Patch % Lines
virtualizarr/parsers/dmrpp.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #902      +/-   ##
==========================================
+ Coverage   89.97%   90.08%   +0.11%     
==========================================
  Files          36       36              
  Lines        2224     2048     -176     
==========================================
- Hits         2001     1845     -156     
+ Misses        223      203      -20     
Files with missing lines Coverage Δ
virtualizarr/parsers/dmrpp.py 50.00% <66.66%> (-35.86%) ⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@maxrjones

Copy link
Copy Markdown
Member

Hey @Mikejmnez, is there anything that I can do to support you with this effort?

For context, I'm interested in showing how the latest VirtualiZarr release with inline variable supports benefits DMR++ (since the work was partially funded by NASA), but it'd be nice to first have the migration done.

@Mikejmnez

Copy link
Copy Markdown
Contributor Author

hey @maxrjones yeah - i need to update this branch. What I am interested, and perhaps you can help me with, is the inline references. The parser I have can parse (and contains) all dmrpp inline references. I know you were working on a PR enabling inline references, but I have not looked if the PR has been merged or not.

@maxrjones

Copy link
Copy Markdown
Member

hey @maxrjones yeah - i need to update this branch. What I am interested, and perhaps you can help me with, is the inline references. The parser I have can parse (and contains) all dmrpp inline references. I know you were working on a PR enabling inline references, but I have not looked if the PR has been merged or not.

yes, we just released support for inline references. Here's an example PR that enabled it for the Kerchunk parser - #979. There's documentation about inlined variables at https://virtualizarr.readthedocs.io/en/stable/data_structures.html#inlined-chunks and https://virtualizarr.readthedocs.io/en/stable/api/developer.html#virtualizarr.manifests.ChunkManifest.from_arrays.

Let me know if you any questions or want to find a coworking time.

@Mikejmnez

Copy link
Copy Markdown
Contributor Author

yes, we just released support for inline references. Here's an example PR that enabled it for the Kerchunk parser - #979. There's documentation about inlined variables at https://virtualizarr.readthedocs.io/en/stable/data_structures.html#inlined-chunks and https://virtualizarr.readthedocs.io/en/stable/api/developer.html#virtualizarr.manifests.ChunkManifest.from_arrays.

Let me know if you any questions or want to find a coworking time.

Great! I'll take a look and reach out if I ran into something

@Mikejmnez

Copy link
Copy Markdown
Contributor Author

@maxrjones this is ready for review if you have time.

@maxrjones maxrjones left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mikejmnez!

Why leave most of the code in https://github.com/Mikejmnez/VirtualiZarr/blob/9ca723c434ae158a8408741bbff7cbb6cbf40baf/virtualizarr/parsers/dmrpp.py#L84-L230 rather than migrating that to pydap with a thin wrapper here?

Comment thread pyproject.toml Outdated
"requests",
"aiohttp",
"s3fs",
"pydap>=3.5.9",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to include a dedicated dmr optional dependency group. This group is poorly named, but is mostly for the kerchunk parser.

Comment thread virtualizarr/parsers/dmrpp.py Outdated
from obspec_utils.protocols import ReadableStore
from obspec_utils.readers import EagerStoreReader
from obspec_utils.registry import ObjectStoreRegistry
from pydap.parsers.dmr import DMRPPParser as _DMRPPParser

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the soft_import function to only import pydap if it's installed in the environment -

def soft_import(name: str, reason: str, strict: Optional[bool] = True):
try:
return importlib.import_module(name)
except (ImportError, ModuleNotFoundError):
if strict:
raise ImportError(
f"for {reason}, the {name} package is required. "
f"Please install it via pip or conda."
)
else:
return None

The hdf5 parser uses soft_import because h5py is also an optional dependency.

Alternatively, you could import DMRPPParser only in the TYPE_CHECKING block and the function that uses the class.

@Mikejmnez Mikejmnez May 28, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maxrjones I finally have some time to look into this.

You could use the soft_import

Yes - the soft import makes sense - I was wondering what was going on, although I did see kerchunk import fail on ci/ci here, so I was overall confused.

Why leave most of the code

The idea was that splitting into different groups (i.e. dataset), and skip_variables is not a pydap thing (it always generates the complete representation of the available metadata). Also, I figure I'd leave the thin wrapper since there is an expectation (from earthaccess at least) to get the parser._validation_issues. The thin wrapper allows the inheritance of that expected attribute. Everthing else, is just parsing through a dictionary all variables (+ encoding for fill values in a way virtualizarr expects it, but not pydap). I am happy to make changes to this per your suggestions!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did see kerchunk import fail on ci/ci here, so I was overall confused.

Update with latest changes on main and I think that failure will go away.

@TomNicholas TomNicholas May 28, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that splitting into different groups (i.e. dataset), and skip_variables is not a pydap thing

But it is a VirtualiZarr thing, so if you're defining a valid virtualizarr.Parser in pydap, that logic should live in pydap. I was imagining that literally the only thing left in this library would be the re-export, i.e.

from pydap.virtualizarr import DMRPPParser as DMRPPParser

@TomNicholas

Copy link
Copy Markdown
Member

I was gonna say the same thing as @maxrjones - the pydap optional dependency in this PR isn't being handled properly.

Also there was an actual bug in the kerchunk dependency handling which this PR exposed - fixed in #998.

@Mikejmnez

Copy link
Copy Markdown
Contributor Author

@maxrjones @TomNicholas This is ready for comments/suggestions. I'll still need to make a new pydap release and mark pydap with a min version.

Wrt codecov, I am sure the issue is that it did not like that I removed a bunch of tests (and migrated them into pydap).

Comment thread pyproject.toml

# dmrpp
dmrpp = [
"pydap @ git+https://github.com/pydap/pydap.git@refs/pull/697/head",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a flag to update this before merging

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - once this looks good on your end I will merge that PR, make a new pydap release (3.5.10), and declare pydap>=3.5.10 in here

@maxrjones maxrjones left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mikejmnez! I'm glad to see the DMPPParser moving closer to the DMR++ maintainers. I have one follow-up question about where the trust boundary lands after this migration.

The DMRPPParser glue still does the readall() and the ET.parse on the VirtualiZarr side before handing the root to pydap's DMRParser. Two things there seem worth handling carefully, and since the parsing is becoming pydap's responsibility it might make sense for these to live with it too:

  • readall() pulls the whole DMR++ into memory unbounded. This can be rough on memory-limited environments, and worse if the source is transparently compressed (a small payload can inflate a lot).
  • ET.parse is the stdlib parser on untrusted input. The entity-expansion / large-token DoS protections depend on the Expat version (2.7.2+), so it'd be good to know that's accounted for.

Would it make sense for pydap to expose a fetch+parse entry point that owns these guards, so VirtualiZarr isn't holding the raw read and parse? Since the XML parsing is the part most likely to need urgent security fixes, having it sit entirely with the parser's maintainers seems healthiest.

@Mikejmnez

Copy link
Copy Markdown
Contributor Author

The DMRPPParser glue still does the readall() and the ET.parse on the VirtualiZarr side before handing the root to pydap's DMRParser. Two things there seem worth handling carefully, and since the parsing is becoming pydap's responsibility it might make sense for these to live with it too:

Yes - that is a good point. I can certainly migrate the read.readall() into pydap.

ET.parse is the stdlib parser on untrusted input. The entity-expansion / large-token DoS protections depend on the Expat version (2.7.2+), so it'd be good to know that's accounted for.

I am not well versed on this, and so I am unsure how to address this concern besides migrating the xml read/parse completely into pydap. Can you point me to a resource that'll help me address this concern? Thanks @maxrjones!

@maxrjones

Copy link
Copy Markdown
Member

ET.parse is the stdlib parser on untrusted input. The entity-expansion / large-token DoS protections depend on the Expat version (2.7.2+), so it'd be good to know that's accounted for.

I am not well versed on this, and so I am unsure how to address this concern besides migrating the xml read/parse completely into pydap. Can you point me to a resource that'll help me address this concern? Thanks @maxrjones!

I'm not super well versed either, but I would probably verify the Expat version to make sure the fix is available. The defusedxml docs are a pretty good resource here: https://github.com/tiran/defusedxml#how-to-avoid-xml-vulnerabilities

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DMR++ references generation Reading byte ranges from archival files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants