Dmrpp migration#902
Conversation
3307507 to
8aa366e
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #902 +/- ##
==========================================
+ Coverage 89.97% 90.08% +0.11%
==========================================
Files 36 36
Lines 2224 2048 -176
==========================================
- Hits 2001 1845 -156
+ Misses 223 203 -20
🚀 New features to boost your workflow:
|
d462973 to
0365302
Compare
|
Hey @Mikejmnez, is there anything that I can do to support you with this effort? For context, I'm interested in showing how the latest VirtualiZarr release with inline variable supports benefits DMR++ (since the work was partially funded by NASA), but it'd be nice to first have the migration done. |
|
hey @maxrjones yeah - i need to update this branch. What I am interested, and perhaps you can help me with, is the inline references. The parser I have can parse (and contains) all dmrpp inline references. I know you were working on a PR enabling inline references, but I have not looked if the PR has been merged or not. |
yes, we just released support for inline references. Here's an example PR that enabled it for the Kerchunk parser - #979. There's documentation about inlined variables at https://virtualizarr.readthedocs.io/en/stable/data_structures.html#inlined-chunks and https://virtualizarr.readthedocs.io/en/stable/api/developer.html#virtualizarr.manifests.ChunkManifest.from_arrays. Let me know if you any questions or want to find a coworking time. |
Great! I'll take a look and reach out if I ran into something |
|
@maxrjones this is ready for review if you have time. |
maxrjones
left a comment
There was a problem hiding this comment.
Thanks @Mikejmnez!
Why leave most of the code in https://github.com/Mikejmnez/VirtualiZarr/blob/9ca723c434ae158a8408741bbff7cbb6cbf40baf/virtualizarr/parsers/dmrpp.py#L84-L230 rather than migrating that to pydap with a thin wrapper here?
| "requests", | ||
| "aiohttp", | ||
| "s3fs", | ||
| "pydap>=3.5.9", |
There was a problem hiding this comment.
I think it would be better to include a dedicated dmr optional dependency group. This group is poorly named, but is mostly for the kerchunk parser.
| from obspec_utils.protocols import ReadableStore | ||
| from obspec_utils.readers import EagerStoreReader | ||
| from obspec_utils.registry import ObjectStoreRegistry | ||
| from pydap.parsers.dmr import DMRPPParser as _DMRPPParser |
There was a problem hiding this comment.
You could use the soft_import function to only import pydap if it's installed in the environment -
VirtualiZarr/virtualizarr/utils.py
Lines 70 to 80 in abd7d0f
The hdf5 parser uses soft_import because h5py is also an optional dependency.
Alternatively, you could import DMRPPParser only in the TYPE_CHECKING block and the function that uses the class.
There was a problem hiding this comment.
Thanks @maxrjones I finally have some time to look into this.
You could use the soft_import
Yes - the soft import makes sense - I was wondering what was going on, although I did see kerchunk import fail on ci/ci here, so I was overall confused.
Why leave most of the code
The idea was that splitting into different groups (i.e. dataset), and skip_variables is not a pydap thing (it always generates the complete representation of the available metadata). Also, I figure I'd leave the thin wrapper since there is an expectation (from earthaccess at least) to get the parser._validation_issues. The thin wrapper allows the inheritance of that expected attribute. Everthing else, is just parsing through a dictionary all variables (+ encoding for fill values in a way virtualizarr expects it, but not pydap). I am happy to make changes to this per your suggestions!
There was a problem hiding this comment.
did see kerchunk import fail on ci/ci here, so I was overall confused.
Update with latest changes on main and I think that failure will go away.
There was a problem hiding this comment.
The idea was that splitting into different groups (i.e. dataset), and skip_variables is not a pydap thing
But it is a VirtualiZarr thing, so if you're defining a valid virtualizarr.Parser in pydap, that logic should live in pydap. I was imagining that literally the only thing left in this library would be the re-export, i.e.
from pydap.virtualizarr import DMRPPParser as DMRPPParser
|
I was gonna say the same thing as @maxrjones - the pydap optional dependency in this PR isn't being handled properly. Also there was an actual bug in the kerchunk dependency handling which this PR exposed - fixed in #998. |
9ca723c to
5fa3d8f
Compare
|
@maxrjones @TomNicholas This is ready for comments/suggestions. I'll still need to make a new pydap release and mark pydap with a min version. Wrt codecov, I am sure the issue is that it did not like that I removed a bunch of tests (and migrated them into pydap). |
|
|
||
| # dmrpp | ||
| dmrpp = [ | ||
| "pydap @ git+https://github.com/pydap/pydap.git@refs/pull/697/head", |
There was a problem hiding this comment.
Just a flag to update this before merging
There was a problem hiding this comment.
Yes - once this looks good on your end I will merge that PR, make a new pydap release (3.5.10), and declare pydap>=3.5.10 in here
maxrjones
left a comment
There was a problem hiding this comment.
Thanks @Mikejmnez! I'm glad to see the DMPPParser moving closer to the DMR++ maintainers. I have one follow-up question about where the trust boundary lands after this migration.
The DMRPPParser glue still does the readall() and the ET.parse on the VirtualiZarr side before handing the root to pydap's DMRParser. Two things there seem worth handling carefully, and since the parsing is becoming pydap's responsibility it might make sense for these to live with it too:
readall()pulls the whole DMR++ into memory unbounded. This can be rough on memory-limited environments, and worse if the source is transparently compressed (a small payload can inflate a lot).ET.parseis the stdlib parser on untrusted input. The entity-expansion / large-token DoS protections depend on the Expat version (2.7.2+), so it'd be good to know that's accounted for.
Would it make sense for pydap to expose a fetch+parse entry point that owns these guards, so VirtualiZarr isn't holding the raw read and parse? Since the XML parsing is the part most likely to need urgent security fixes, having it sit entirely with the parser's maintainers seems healthiest.
Yes - that is a good point. I can certainly migrate the read.readall() into pydap.
I am not well versed on this, and so I am unsure how to address this concern besides migrating the xml read/parse completely into pydap. Can you point me to a resource that'll help me address this concern? Thanks @maxrjones! |
I'm not super well versed either, but I would probably verify the Expat version to make sure the fix is available. The defusedxml docs are a pretty good resource here: https://github.com/tiran/defusedxml#how-to-avoid-xml-vulnerabilities |
This DRAFT PR migrates the core of the dmrpp parser to pydap, integrating it for better support, while incorporating the latest contributions (for example #880). For example, the parser can reference external sources/path to individual chunks (no longer inheriting and assuming all chunks are within the file), parse inline (compressed) values, etc. Improvement of the dmrpp parser will continue close to the source, whilst aiming for interoperability with virtualizarr.
docs/releases.rstapi.rstThis is still draft, since I need to make a new official release of pydap to make these and update the project.toml to reflect the pydap release. Aside from that, all parser tests have been migrated to pydap (where they pass)
summary
dmrpp parser migrated to pydap. However the classes dmrpp and dmrp still exist here to mostly parse the metadata and provide backwards compat with previous added features that remain exposed to users (e.g. earthaccess).
Inline references in dmrpp can be of 2 kinds of inline references - a) base64 and b) base64compressed. Pydap decodes these when parsing dmrpp turning them into arrays of atomic dap4 types. In Virtualizarr, all inline data is then base64 encoded and added as a chunk entry. In the future, if requested, pydap could retain all base64 dmrpp inline references as present when parsing, only decompressing those that are compressed so inline references are always base64 encoded.
@betolink