Skip to content

Support JWST and Roman#25

Merged
icweaver merged 12 commits into
mainfrom
strict
Apr 4, 2026
Merged

Support JWST and Roman#25
icweaver merged 12 commits into
mainfrom
strict

Conversation

@icweaver

@icweaver icweaver commented Mar 21, 2026

Copy link
Copy Markdown
Member

Hey folks, these are a few toggles I added to try and support some of the quirks I ran into "out in the wild". I've split them into the following three commits:

  1. Don't choke on unknown tags: dc40329
    1. This just provides a fallback for STScI's many asdf tags. Could be replaced by specific converters in the future.
    2. Also adds a metadata tag and a newer ndarray-1.1.0 tag that Roman seems to use now.
    3. This feature is behind an ASDF.load(<filename.asdf>; extensions = false) kwarg by default to preserve current behavior. Similar to the extensions kwarg in the Python impl.
  2. Checksum or not to checksum: 56a8c33
    1. Apparently some Roman data products store their file's checksum based on the uncompressed file instead of compressed (ASDF.jl's default). This breaks things.
    2. Added a flag to get around this, similarly to the Python impl.
    3. 2026-04-03: Update from ASDF office hours. This seems to be a genuine bug, so we will keep this checksum validation flag in on our end. Thanks for helping us confirm this, Brett!
  3. Blocks vs Frames: 0d425a0
    1. Roman seems to use the block layout for their Lz4 compression, instead of frame. This adds support for both, and automatically handles it using magic numbers.

Usage examples:

Does this seem reasonable to folks? I'd especially appreciate feedback for the last point, which I have little experience in and relied heavily on Claude for to come up with the needed incantations.

@icweaver icweaver mentioned this pull request Mar 21, 2026
@codecov

codecov Bot commented Mar 21, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.74%. Comparing base (ff79be9) to head (ab17e30).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #25      +/-   ##
==========================================
+ Coverage   99.70%   99.74%   +0.03%     
==========================================
  Files           1        1              
  Lines         343      389      +46     
==========================================
+ Hits          342      388      +46     
  Misses          1        1              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread src/ASDF.jl
data[3] == 0x4D && data[4] == 0x18)
return decode(LZ4FrameCodec(), data)
else
# If the data was originally created from Python's ASDF, then it will be in block instead of frame layout,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also produce the block layout? Should we? Can Python handle the frame layout?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me, will add that in. Looks like they can, but as a plug-in https://github.com/asdf-format/asdf-compression

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in e012b21

There is a layout option for NDArrayWrapper now to flip between frame and block. Does that seem like a reasonable place to control things? I guess it doesn't really make sense for other compression schemes, so I just set the default layout as default

I don't love from a maintainability point of view that there is now a hand-rolled Lz4-specific encode and decode path to accommodate Python's asdf scheme. There's also the matter of compatibility with Lz4 frame support on the Python side, but since that's still an experimental plugin, maybe that can be a problem for future us to deal with.

I am now squarely out of my comfort zone and would gladly accept any suggestions for simplifying things, haha. Thanks again for taking a look!

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only for lz4 then I would call it lz4_layout, with values :frame and :block. I am sure that other compression schemes will also want to have options in the future, e.g. specifying the compression level.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, just renamed here e5b6bd4

@eschnett

Copy link
Copy Markdown
Collaborator

The checksum bit sounds weird. There is a standard for checksums, and THE major player in the field gets the implementation wrong? Does this make checksums useless in practice? Well, the world is what it is...

@icweaver

icweaver commented Mar 21, 2026

Copy link
Copy Markdown
Member Author

Yea, the checksum bit was bothering me too. Here's what I am seeing for the sample Roman data I tried:

julia> using ASDF, MD5

julia> af = ASDF.load_file("docs/data/roman.asdf"; extensions = true);

julia> ndarray = af.metadata["roman"]["data"];

julia> header = ndarray.lazy_block_headers.block_headers[ndarray.source + 1]
ASDF.BlockHeader(IOStream(<file docs/data/roman.asdf>), 176347, UInt8[0xd3, 0x42, 0x4c, 0x4b], 0x0030, 0x00000000, UInt8[0x6c, 0x7a, 0x34, 0x00], 0x0000000003ffe53b, 0x0000000003ffe53b, 0x0000000003fc0100, UInt8[0xef, 0x4e, 0x63, 0x45, 0xc4, 0xd6, 0xcd, 0xa0, 0xed, 0x4d, 0x14, 0x27, 0x43, 0xa7, 0xb2, 0xbc], true)

julia> block_data_start = header.position + 6 + header.header_size
176401

julia> seek(header.io, block_data_start)
IOStream(<file docs/data/roman.asdf>)

julia> data = Array{UInt8}(undef, header.used_size);

julia> nb = readbytes!(header.io, data)
67102011

julia> data_decompressed = ASDF.decode_Lz4(data);

julia> md5(data_decompressed) == header.checksum
true

I wonder if this behavior could just be an uncompressed data streaming thing for this specific instance with Roman's data products. I don't work with AWS S3, but it seems that this a selling point for them I guess. fwiw, the JWST data I tried looks to behave as expected (it's just downloaded as a regular file). Here are worked examples for both

Doc previews:

@icweaver icweaver added this to the Release v2 milestone Mar 25, 2026
@eschnett

eschnett commented Apr 4, 2026

Copy link
Copy Markdown
Collaborator

You have some unchecked to-do items: Is this PR ready to be merged, or are you still working on it?

@icweaver icweaver merged commit 5e5154b into main Apr 4, 2026
22 checks passed
@icweaver icweaver deleted the strict branch April 4, 2026 17:38
@icweaver

icweaver commented Apr 4, 2026

Copy link
Copy Markdown
Member Author

Thanks Erik, just updated the top-level comment and merged. According to office hours yesterday, the checksum issue is a genuine bug on the Python side, so according to Brett it sounded like we are good to go over here.

Re: DKIST and newer Roman files, most things look to work in informal tests I tried, and the remaining bits are tracked over in #32

@icweaver icweaver mentioned this pull request Apr 4, 2026
13 tasks
icweaver added a commit that referenced this pull request Jun 6, 2026
Adds support for structured (nested)
[datatypes](https://www.asdf-format.org/projects/asdf-standard/en/latest/generated/stsci.edu/asdf/core/datatype-1.0.0.html),
and support for ascii and ucs (also outlined in that link), which some
newer Roman data files seem to use. This required changing how the
`datatype` field is parsed, so I thought it would be better to have this
as a separate PR instead of cramming it into #25.

To add this support, the new `AsciiDatatype` and `Ucs4Datatype` types
were introduced. These are spiritual variants of `Datatype`, but since
that one is an enum, it seemed like it would be more trouble than it's
worth to get them to all subtype a common abstract type.

This PR also includes a grip of tests using the reference files from
[asdf-standard](https://github.com/asdf-format/asdf-standard/tree/main/reference_files/1.6.0).
I just included a copy of the most recent ones (v1.6.0) in the test
directory here for simplicity. On a related note, should the asdf spec
version [printed
here](https://github.com/JuliaAstro/ASDF.jl/blob/9810e49925cb5cd0f3ad053f8c7b1bbac82027a0/src/ASDF.jl#L1199)
be updated to reflect this?

Could be a good first PR to generalize things to dynamically load
different versions directly from that repo instead. Another good one
might be to start refactoring `src/ASDF.jl` now that it is starting to
get a little long

# Try this PR

1. Install Julia: <https://julialang.org/downloads/>
1. Create a new environment:

   ```julia
   > julia
   
   julia> # Press ] to enter pkg mode
   
   pkg> activate my-env
   ```

1. Install this branch:

   ```julia
   (my-env) pkg> add ASDF#datatype
   ```

1. Load a sample asdf file:
   ```julia
   (my-env) pkg> # Press backspace to return to Julia REPL
   
   julia> using ASDF
   
   julia> load("path to your asdf file")
   ```

Sample asdf files [are available
here](https://github.com/JuliaAstro/ASDF.jl/tree/datatype/test/data/asdf-1.6.0).

Documentation preview: <https://juliaastro.org/ASDF.jl/previews/PR32/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants