Dmrpp migration by Mikejmnez · Pull Request #902 · zarr-developers/VirtualiZarr

Mikejmnez · 2026-03-06T00:32:07Z

This DRAFT PR migrates the core of the dmrpp parser to pydap, integrating it for better support, while incorporating the latest contributions (for example #880). For example, the parser can reference external sources/path to individual chunks (no longer inheriting and assuming all chunks are within the file), parse inline (compressed) values, etc. Improvement of the dmrpp parser will continue close to the source, whilst aiming for interoperability with virtualizarr.

Closes pydap#417.
Tests added (mostly removed)
Tests passing (py311 and py312 environments)
Full type hint coverage
Changes are documented in docs/releases.rst
New functions/methods are listed in api.rst
New functionality has documentation

This is still draft, since I need to make a new official release of pydap to make these and update the project.toml to reflect the pydap release. Aside from that, all parser tests have been migrated to pydap (where they pass)

summary

dmrpp parser migrated to pydap. However the classes dmrpp and dmrp still exist here to mostly parse the metadata and provide backwards compat with previous added features that remain exposed to users (e.g. earthaccess).

Inline references in dmrpp can be of 2 kinds of inline references - a) base64 and b) base64compressed. Pydap decodes these when parsing dmrpp turning them into arrays of atomic dap4 types. In Virtualizarr, all inline data is then base64 encoded and added as a chunk entry. In the future, if requested, pydap could retain all base64 dmrpp inline references as present when parsing, only decompressing those that are compressed so inline references are always base64 encoded.

@betolink

codecov · 2026-03-11T17:20:55Z

Codecov Report

❌ Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.08%. Comparing base (b693e0d) to head (480fa1d).

Files with missing lines	Patch %	Lines
virtualizarr/parsers/dmrpp.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #902      +/-   ##
==========================================
+ Coverage   89.97%   90.08%   +0.11%     
==========================================
  Files          36       36              
  Lines        2224     2048     -176     
==========================================
- Hits         2001     1845     -156     
+ Misses        223      203      -20

Files with missing lines	Coverage Δ
virtualizarr/parsers/dmrpp.py	`50.00% <66.66%> (-35.86%)`	⬇️

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

maxrjones · 2026-05-03T10:47:50Z

Hey @Mikejmnez, is there anything that I can do to support you with this effort?

For context, I'm interested in showing how the latest VirtualiZarr release with inline variable supports benefits DMR++ (since the work was partially funded by NASA), but it'd be nice to first have the migration done.

Mikejmnez · 2026-05-04T15:22:49Z

hey @maxrjones yeah - i need to update this branch. What I am interested, and perhaps you can help me with, is the inline references. The parser I have can parse (and contains) all dmrpp inline references. I know you were working on a PR enabling inline references, but I have not looked if the PR has been merged or not.

maxrjones · 2026-05-04T15:32:33Z

hey @maxrjones yeah - i need to update this branch. What I am interested, and perhaps you can help me with, is the inline references. The parser I have can parse (and contains) all dmrpp inline references. I know you were working on a PR enabling inline references, but I have not looked if the PR has been merged or not.

yes, we just released support for inline references. Here's an example PR that enabled it for the Kerchunk parser - #979. There's documentation about inlined variables at https://virtualizarr.readthedocs.io/en/stable/data_structures.html#inlined-chunks and https://virtualizarr.readthedocs.io/en/stable/api/developer.html#virtualizarr.manifests.ChunkManifest.from_arrays.

Let me know if you any questions or want to find a coworking time.

Mikejmnez · 2026-05-04T15:41:39Z

yes, we just released support for inline references. Here's an example PR that enabled it for the Kerchunk parser - #979. There's documentation about inlined variables at https://virtualizarr.readthedocs.io/en/stable/data_structures.html#inlined-chunks and https://virtualizarr.readthedocs.io/en/stable/api/developer.html#virtualizarr.manifests.ChunkManifest.from_arrays.

Let me know if you any questions or want to find a coworking time.

Great! I'll take a look and reach out if I ran into something

Mikejmnez · 2026-05-04T22:23:28Z

@maxrjones this is ready for review if you have time.

maxrjones

Thanks @Mikejmnez!

Why leave most of the code in https://github.com/Mikejmnez/VirtualiZarr/blob/9ca723c434ae158a8408741bbff7cbb6cbf40baf/virtualizarr/parsers/dmrpp.py#L84-L230 rather than migrating that to pydap with a thin wrapper here?

maxrjones · 2026-05-18T19:49:56Z

    "requests",
    "aiohttp",
    "s3fs",
+    "pydap>=3.5.9",


I think it would be better to include a dedicated dmr optional dependency group. This group is poorly named, but is mostly for the kerchunk parser.

maxrjones · 2026-05-18T19:58:09Z

 from obspec_utils.protocols import ReadableStore
 from obspec_utils.readers import EagerStoreReader
 from obspec_utils.registry import ObjectStoreRegistry
+from pydap.parsers.dmr import DMRPPParser as _DMRPPParser


You could use the soft_import function to only import pydap if it's installed in the environment -

VirtualiZarr/virtualizarr/utils.py

Lines 70 to 80 in abd7d0f

def soft_import(name: str, reason: str, strict: Optional[bool] = True):

try:

return importlib.import_module(name)

except (ImportError, ModuleNotFoundError):

if strict:

raise ImportError(

f"for {reason}, the {name} package is required. "

f"Please install it via pip or conda."

)

else:

return None

The hdf5 parser uses soft_import because h5py is also an optional dependency.

Alternatively, you could import DMRPPParser only in the TYPE_CHECKING block and the function that uses the class.

Thanks @maxrjones I finally have some time to look into this.

You could use the soft_import

Yes - the soft import makes sense - I was wondering what was going on, although I did see kerchunk import fail on ci/ci here, so I was overall confused.

Why leave most of the code

The idea was that splitting into different groups (i.e. dataset), and skip_variables is not a pydap thing (it always generates the complete representation of the available metadata). Also, I figure I'd leave the thin wrapper since there is an expectation (from earthaccess at least) to get the parser._validation_issues. The thin wrapper allows the inheritance of that expected attribute. Everthing else, is just parsing through a dictionary all variables (+ encoding for fill values in a way virtualizarr expects it, but not pydap). I am happy to make changes to this per your suggestions!

did see kerchunk import fail on ci/ci here, so I was overall confused.

Update with latest changes on main and I think that failure will go away.

The idea was that splitting into different groups (i.e. dataset), and skip_variables is not a pydap thing

But it is a VirtualiZarr thing, so if you're defining a valid virtualizarr.Parser in pydap, that logic should live in pydap. I was imagining that literally the only thing left in this library would be the re-export, i.e.

from pydap.virtualizarr import DMRPPParser as DMRPPParser

TomNicholas · 2026-05-18T20:07:11Z

I was gonna say the same thing as @maxrjones - the pydap optional dependency in this PR isn't being handled properly.

Also there was an actual bug in the kerchunk dependency handling which this PR exposed - fixed in #998.

Mikejmnez · 2026-06-15T19:13:29Z

@maxrjones @TomNicholas This is ready for comments/suggestions. I'll still need to make a new pydap release and mark pydap with a min version.

Wrt codecov, I am sure the issue is that it did not like that I removed a bunch of tests (and migrated them into pydap).

maxrjones · 2026-06-16T17:25:34Z


+# dmrpp
+dmrpp = [
+    "pydap @ git+https://github.com/pydap/pydap.git@refs/pull/697/head",


Just a flag to update this before merging

Yes - once this looks good on your end I will merge that PR, make a new pydap release (3.5.10), and declare pydap>=3.5.10 in here

maxrjones

Thanks @Mikejmnez! I'm glad to see the DMPPParser moving closer to the DMR++ maintainers. I have one follow-up question about where the trust boundary lands after this migration.

The DMRPPParser glue still does the readall() and the ET.parse on the VirtualiZarr side before handing the root to pydap's DMRParser. Two things there seem worth handling carefully, and since the parsing is becoming pydap's responsibility it might make sense for these to live with it too:

readall() pulls the whole DMR++ into memory unbounded. This can be rough on memory-limited environments, and worse if the source is transparently compressed (a small payload can inflate a lot).
ET.parse is the stdlib parser on untrusted input. The entity-expansion / large-token DoS protections depend on the Expat version (2.7.2+), so it'd be good to know that's accounted for.

Would it make sense for pydap to expose a fetch+parse entry point that owns these guards, so VirtualiZarr isn't holding the raw read and parse? Since the XML parsing is the part most likely to need urgent security fixes, having it sit entirely with the parser's maintainers seems healthiest.

Mikejmnez · 2026-06-16T20:52:28Z

The DMRPPParser glue still does the readall() and the ET.parse on the VirtualiZarr side before handing the root to pydap's DMRParser. Two things there seem worth handling carefully, and since the parsing is becoming pydap's responsibility it might make sense for these to live with it too:

Yes - that is a good point. I can certainly migrate the read.readall() into pydap.

ET.parse is the stdlib parser on untrusted input. The entity-expansion / large-token DoS protections depend on the Expat version (2.7.2+), so it'd be good to know that's accounted for.

I am not well versed on this, and so I am unsure how to address this concern besides migrating the xml read/parse completely into pydap. Can you point me to a resource that'll help me address this concern? Thanks @maxrjones!

maxrjones · 2026-06-17T13:10:56Z

ET.parse is the stdlib parser on untrusted input. The entity-expansion / large-token DoS protections depend on the Expat version (2.7.2+), so it'd be good to know that's accounted for.

I am not well versed on this, and so I am unsure how to address this concern besides migrating the xml read/parse completely into pydap. Can you point me to a resource that'll help me address this concern? Thanks @maxrjones!

I'm not super well versed either, but I would probably verify the Expat version to make sure the fix is available. The defusedxml docs are a pretty good resource here: https://github.com/tiran/defusedxml#how-to-avoid-xml-vulnerabilities

Mikejmnez had a problem deploying to test-release March 6, 2026 00:32 — with GitHub Actions Failure

Mikejmnez had a problem deploying to test-release March 6, 2026 00:33 — with GitHub Actions Failure

Mikejmnez had a problem deploying to test-release March 6, 2026 00:45 — with GitHub Actions Failure

TomNicholas added references generation Reading byte ranges from archival files DMR++ labels Mar 6, 2026

Mikejmnez force-pushed the dmrpp_migration branch from 3307507 to 8aa366e Compare March 11, 2026 17:18

Mikejmnez had a problem deploying to test-release March 11, 2026 17:18 — with GitHub Actions Failure

Mikejmnez had a problem deploying to test-release March 11, 2026 17:43 — with GitHub Actions Failure

Mikejmnez force-pushed the dmrpp_migration branch from d462973 to 0365302 Compare March 31, 2026 22:34

Mikejmnez had a problem deploying to test-release March 31, 2026 22:34 — with GitHub Actions Failure

Mikejmnez force-pushed the dmrpp_migration branch from 0365302 to 8c45a7b Compare May 4, 2026 20:55

Mikejmnez had a problem deploying to test-release May 4, 2026 20:56 — with GitHub Actions Failure

Mikejmnez had a problem deploying to test-release May 4, 2026 21:52 — with GitHub Actions Failure

Mikejmnez marked this pull request as ready for review May 4, 2026 22:11

maxrjones had a problem deploying to test-release May 18, 2026 19:47 — with GitHub Actions Failure

maxrjones had a problem deploying to test-release May 18, 2026 19:48 — with GitHub Actions Failure

maxrjones reviewed May 18, 2026

View reviewed changes

TomNicholas mentioned this pull request May 18, 2026

fix: also require kerchunk for test_read_netcdf3 #998

Merged

2 tasks

maxrjones mentioned this pull request May 18, 2026

Support moving DMR++ parser into pydap NASA-IMPACT/veda-odd#380

Closed

Mikejmnez added 4 commits June 12, 2026 09:56

remove migrated tests

10c28e0

strip DMRParser class

e62650c

update DMRParser class to allow for on-the-fly dmrpps

9381d46

add pydap as optional dependency

b528b97

Mikejmnez added 13 commits June 12, 2026 09:56

comment tests as these are migrated

2536cb0

raise NotImplementError if inline values

2a652df

install pydap from unreleased PR

a3c9506

rebase

79b9a79

rebase

44f7074

fix mypy

cf63a2e

skip test on macOS too

9047593

update how inline references are handled v1

1f8f871

update how inline references are handled v2

f9162e9

add pydap to min-deps via dev - avoid min-deps ci/cd failure

9c23ccc

fix missing ,

c486c4c

add a separate isolated opt-dep entry for dmrpp

93837b9

update dmrpp migration - soft imports and code refactor

5fa3d8f

Mikejmnez force-pushed the dmrpp_migration branch from 9ca723c to 5fa3d8f Compare June 12, 2026 23:31

Mikejmnez temporarily deployed to test-release June 12, 2026 23:32 — with GitHub Actions Inactive

add dmrpp testing to py312 and py313

480fa1d

Mikejmnez temporarily deployed to test-release June 12, 2026 23:49 — with GitHub Actions Inactive

maxrjones mentioned this pull request Jun 15, 2026

Review latest changes in PR to decouple DMRParser/Virtualizarr NASA-IMPACT/veda-odd#414

Open

maxrjones reviewed Jun 16, 2026

View reviewed changes

	def soft_import(name: str, reason: str, strict: Optional[bool] = True):
	try:
	return importlib.import_module(name)
	except (ImportError, ModuleNotFoundError):
	if strict:
	raise ImportError(
	f"for {reason}, the {name} package is required. "
	f"Please install it via pip or conda."
	)
	else:
	return None

Conversation

Mikejmnez commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

summary

Uh oh!

codecov Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

maxrjones commented May 3, 2026

Uh oh!

Mikejmnez commented May 4, 2026

Uh oh!

maxrjones commented May 4, 2026

Uh oh!

Mikejmnez commented May 4, 2026

Uh oh!

Mikejmnez commented May 4, 2026

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

Uh oh!

maxrjones May 18, 2026

Choose a reason for hiding this comment

Uh oh!

maxrjones May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Mikejmnez May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas May 28, 2026

Choose a reason for hiding this comment

Uh oh!

TomNicholas May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomNicholas commented May 18, 2026

Uh oh!

Mikejmnez commented Jun 15, 2026

Uh oh!

maxrjones Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Mikejmnez Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

maxrjones left a comment

Choose a reason for hiding this comment

Uh oh!

Mikejmnez commented Jun 16, 2026

Uh oh!

maxrjones commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Mikejmnez commented Mar 6, 2026 •

edited

Loading

codecov Bot commented Mar 11, 2026 •

edited

Loading

Mikejmnez May 28, 2026 •

edited

Loading

TomNicholas May 28, 2026 •

edited

Loading