Skip to content

Full CRAM 3.1 support & performance tuning of CRAM read/write#1763

Merged
tfenne merged 3 commits into
devfrom
tf_cram_31
Apr 25, 2026
Merged

Full CRAM 3.1 support & performance tuning of CRAM read/write#1763
tfenne merged 3 commits into
devfrom
tf_cram_31

Conversation

@tfenne
Copy link
Copy Markdown
Member

@tfenne tfenne commented Mar 30, 2026

Description

Ok, so this is a monster PR. I'm not even sure it should be a single PR, but I wanted to put it up here to get some feedback on how best to proceed. I've created it as a draft for that reason. It builds upon the work by @cmnbroad in #1739 while implementing the necessary additions to write CRAM 3.1 at comparable compression ratios to htslib/samtools.

This was done with the assistance of Claude Code, though it also represents a significant investment of my own personal time, prompting, reviewing, fixing, redirecting.

The PR breaks down roughly as follows:

Many of the additions and changes are strongly inspired by work in htslib!

Currently with the normal profile selected, htsjdk emits CRAMs approximately similar in size to htslib. With some 1KG low-pass data I've been testing with the results are actually slightly smaller. However runtime to generate the CRAMs is ~2x slower than htslib/samtools. I think that gap could be shrunk further (it started out around 5-6x slower), I think there will always be a substantial gap due to a) not having easy access to SIMD instructions and b) not having a good libdeflate wrapper on the JVM.

The read path for CRAM 3.1 predates this PR, but the changes here provide substantial performance improvements. Again, testing on low-pass WGS data I'm seeing a 40-50% speedup.

Assuming this PR/changeset has legs, I will do some broader benchmarking and report on both compression and runtime performance.

@jkbonfield I wouldn't expect you to review this PR, but if you have thoughts on how to validate the results of this work beyond the broad set of CRAM unit tests in the htsjdk repo, I would be very grateful for your input.

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@jkbonfield
Copy link
Copy Markdown

jkbonfield commented Mar 31, 2026

For testing, hts-specs has a corpus of test files designed to validate CRAM decoder implementations.

https://github.com/samtools/hts-specs/tree/master/test/cram

The 3.0/README.md covers the basics of the CRAM structure, adding in feature at a time. So blank file with header, then basic one-read bits without reference based encoding, etc. This is mostly testing structures, features, encodings, and so on. By the end of it it's also testing compression codecs.

The 3.1 directory has 4 CRAM files with progressive levels of new compression codecs, basically htslib's profiles I think. They're whole file real world examples, as beyond the new codecs there are no format changes.

More useful for 3.1 is the codecs directory which have raw codec compressed byte streams for you to decode and validate. These are shrunk down copies of the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/dat, although I think that may be a little incomplete now. Basically it's designed to test things like rans, fqz, arith and name tokeniser codecs in isolation, for unit testing purposes.

This is mostly just integration testing. The nature of CRAM is that even with the same codec we could produce subtly different encoded byte streams (eg it's up to the implementation how it approximates the probabilities tables in rANS and handles rounding errors when the frequencies don't divide up perfectly, but the data stream contains the necessary lookup tables to decode so that's what matters).

So what we need for testing are:

  • Can we round-trip our own data without losing information
  • Can we round-trip with other implementations; java enc -> htscodecs dec, or htscodecs enc -> java dec. The test data here is attempting to validate that bit.

@jkbonfield
Copy link
Copy Markdown

jkbonfield commented Mar 31, 2026

Some comments regarding CRAM 4.0. This is long, but it may explain why I think CRAM 4 is now something to ignore. (Mainly speaking to you as someone looking at adding it for Noodles).

Personally, I think I made an error when I added it to htslib, but the history here is important to understand how we got to where we did.

Initially I had two aims to improve on CRAM 3.0:

  • Fix some glaring issues in the CRAM format.
  • Improve CRAM compression and if possible also speed.
  • A third bonus was to simplify where possible (eg cull the kitchen sink of CORE bit-wise encodings).

Fixing the CRAM format included things like replacing ITF8/LTF8 with a true universal variable length encoder that doesn't need to know the expected bit-size (somewhat of an anathema to proper variable length encoding). It also separated signed from unsigned values, so we no longer take 5 bytes to store -1 (or worse for LTF8). It also gained better ways of doing constant values. I also fixed some deficiencies such as how to discover whether the original data had MD/NM (by adding MD:* and NM:* tags with the "*" being auto-generated on decode, so the presence and location are known). Further improvements would be around =/X vs M in cigar, and so on. There are quite a few little improvements to make it work better as a round-trip tool. I added a CRAM v4 idea tracker on hts-specs not just because I was hoping for community ideas, but primarily as a way to date stamp the ideas I had and prevent patent squatters like MPEG from patenting obvious future improvements!

Looking at the compression side, I regret not adding in zstd, but it was a bit too new then. Zstd should totally replace Gzip (deflate) and LZMA. Deflate is just hampered by a maximum window size of 32kb, which makes deduplication - the core part of LZ - fail on long reads and similar data types. At high levels zstd approaches lzma for ratio. It's not quite as good, but close, and frankly lzma is almost never worth the CPU cost so I doubt it's used much. Similarly bzip2 should be ditched in favour of libbsc, which betters it on pretty much every aspect. Combined we'd be reducing the number of dependencies while also using more modern high performant libraries. There's even a branch of io_lib with libbsc and zstd. This was demonstrated to MPEG at a conference in their first call-for-proposals on a new genomics format. Something I now strongly regret taking part in! (Ultimately they only took the name tokeniser, but have now regretted going with their own entropy encoder as it's simply way too slow.)

The other aspect was improving ratios via custom codecs. The old rANS wins here because it's so much faster as an entropy encoder, and sometimes entropy encoding (or minimal statistical encoding) is the best you can get, so the complex LZ searching methods aren't worth it. (Before CRAM 3.0 we even had zlib being tried with Z_HUFFMAN_ONLY as an option, because it sometimes beat normal zlib while being vastly faster.) The ideas from there eventually ended up in CRAM 3.1.

One aspect of that was the notion of data transforms. This is an idea taken out of PNG, and later used by myself for ZTR. Data starts off as raw, but can have a transformation applied to it. Decoding is like peeling the layers off an onion. If it's not raw format yet, you pass it through the transform indicated in the data-stream and repeat, until you get to the original raw data. These could be delta if it's a series of integer values; size squashing - eg an array of 16bit values becoming 8-bit with an escape mechanism for out of bound data; run-length encoding; bit-packing (eg ACGT strings taking up 2bits per value instead of 8); or even bigger transformations such as BWT and MTF. These can be implemented in CRAM as layered encodings, much like we already have BYTE_ARRAY_LEN which contains two sub-encodings for the literal and length fields. RLE would be a direct comparison to that in data layout. I initially implementing PACK and RLE here.

CRAM 4 was an experimental melting pot of compression improvements along side the format changes.

However it rapidly became apparent that getting any traction on these notions was going to be slow in htsjdk, so I took the decision to attempt to produce a cut down CRAM 4, renamed to CRAM 3.1. The theory was if we only had compression codec changes and we had a high performance separate library to implement them then the code changes needed in other CRAM implementations were very minimal; just updating the list of legal codec numbers and attaching them to callouts to a library API. This isn't much different to how existing implementations may already be handling lzma, bzip2 and deflate formats. (Although I'm aware Java has some native implementations, I was incorrectly assuming the faster C implementations were being utilised mostly.)

A consequence of this is the separate onion-layer transform approach had to go, as that was an CRAM encoding change rather than a trivial codec API (ie compress-buffer and decompress-buffer functions). They got consumed into the rANS-Nx16 codec instead, although minus delta which was a mistake as it would greatly help certain tag types such as ONT signals).

It wasn't such a bad idea, but ultimately it didn't help at all as there was no willingness to simply link against an external library yet we suffered splitting CRAM updates into two new versions instead of 1.

There are still improvements that could be made with CRAM, but I wouldn't make them now I think so ultimately I think CRAM 4.0 is dead in the water. I'm considering removing it from htslib, if only to reduce the potential attack vector for supplying malicious files.

If I was to pick this up again, it'd probably be to start again from scratch and build on something like the recent OpenZL. It wouldn't have the same compression ratios as archive mode CRAM, unless we extended it, but I like the notion of a standard structure that off the shelf tooling can decode. It's like BGZF is basically just decodable with zcat, but with openzl it could include all those useful chunking strategies and data transformations. I still have plans for a BGZF2, but based on OpenZL now instead of zstd. I just got sucked into other projects and my time evaporated to work on this, but one day!

My rough prototype showed enormous gains for multi-sample VCF and long-read SAM files. It's not up to CRAM performance, but closer to CRAM than BAM for Revio and with a rewrite to use OpenZL it ought to be far closer still.

@tfenne
Copy link
Copy Markdown
Member Author

tfenne commented Mar 31, 2026

@jkbonfield Thank you so much for both your pointers on testdata in htsspecs, and the history around CRAM 4. That's a lot to digest, and I think probably I'll reach out via email re: future formats (or finally make it to a GA4GH call). But the short version is that between AI-assisted coding and Project Panama in Java 21/22+, I think it might be much easier to spin up Java wrappers around native libraries and build Java-enabled implementations of future compression formats much faster.

@tfenne tfenne force-pushed the tf_cram_31 branch 2 times, most recently from 94ec6dc to c944b6c Compare April 1, 2026 13:14
@tfenne tfenne requested review from nh13 and yfarjoun April 1, 2026 13:33
@tfenne tfenne marked this pull request as ready for review April 1, 2026 13:33
@tfenne
Copy link
Copy Markdown
Member Author

tfenne commented Apr 1, 2026

@yfarjoun and @nh13 I've marked this as ready for review. I've been back over the code with multiple AIs doing code review, and have fixed up everything they found that I agreed with.

Copy link
Copy Markdown
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some battle-testing; it's too big for a static code review :/

public ByteBuffer decompress(final ByteBuffer input) {
// Read ISIZE from the last 4 bytes of the GZIP block to size the output
final int isizeOffset = input.limit() - 4;
final int isize = input.duplicate().order(ByteOrder.LITTLE_ENDIAN).position(isizeOffset).getInt();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue:
ISIZE is mod 2^32 so you use Integer.toUnsignedLong for the 2GB-4GB case, check if isize is zero, but no idea how to handle if the actual size was > 4GB. I think it doesn't matter for CRAM, but this is a public function...

}

private void grow(final int minCapacity) {
int newCapacity = buf.length << 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: could overflow if buf.length > 1GB

Comment thread CHANGELOG.md Outdated
### CRAM 3.1 Write Support

- Enable CRAM 3.1 writing with all spec codecs: rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE
- Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
- Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection
- Add configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE) with trial compression for automatic codec selection

Comment thread README.md Outdated

> **NOTE: _HTSJDK has only partial support for the latest Variant Call Format Specification. VCFv4.3 can be read but not written, VCFv4.4 can be read in lenient mode only, and there is no support for BCFv2.2._**

> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**
> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**

@@ -107,7 +109,7 @@ public void writeAlignment(final SAMRecord alignment) {
*/
// TODO: retained for backward compatibility for disq in order to run GATK tests (remove before merging this branch)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we remove this now?

int basesRemaining = 0;
boolean firstLen = true;
int lastLen = 0;
int qualOffset = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can remove

@@ -335,84 +402,150 @@ private static void writeString(final ByteBuffer tokenStreamBuffer, final String
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
tokenStreamBuffer.put(val.getBytes(StandardCharsets.UTF_8));

this.byteOffsetOfContainer = containerByteOffset;

final ContentDigests hasher = ContentDigests.create(ContentDigests.ALL);
// htslib does not write content digest tags (BD/SD/B5/S5/B1/S1) into slice headers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: add to breaking changes in the CHANGELOG?

import htsjdk.samtools.*;
import htsjdk.samtools.cram.structure.CRAMCompressionProfile;
import htsjdk.samtools.cram.structure.CRAMEncodingStrategy;
import htsjdk.samtools.util.CloserUtil;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
import htsjdk.samtools.util.CloserUtil;

* @param reader the reader to read from
* @return the decoded value
*/
public static long readUnsignedLTF8(final CRAMByteReader reader) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: these are super dense, ITF8.java and the rest of this code I think is more readable.

cmnbroad and others added 3 commits April 25, 2026 03:50
Initial implementation of CRAM 3.1 write support, contributed by Chris
Norman. Adds a naive write profile that writes valid CRAM 3.1 files
without yet using any of the format's new specialised codecs (FQZComp,
Name Tokeniser, RangeCoder, etc.); all data series continue to use
CRAM 3.0 codecs (rANS, GZIP). This squashes the four prior commits:

  - Enable a naive CRAM 3.1 write profile.
  - Changes based on instrumented trial runs.
  - CRAM 3.1 write tests and code cleanup.
  - Temp fix for preSorted=false default.

This commit serves as the foundation that the following commit builds
on to add the full CRAM 3.1 codec set, profile system, and performance
optimisations.

Co-authored-by: Tim Fennell <tim@fulcrumgenomics.com>
Builds out CRAM 3.1 write support on top of cnorman's initial naive
profile, taking htsjdk to feature parity with samtools/htslib for
writing CRAM 3.1 and matching samtools on cross-implementation fidelity
across a broad range of real-world data.

CRAM 3.1 codecs and supporting infrastructure
---------------------------------------------
- Full CRAM 3.1 codec set: rANS Nx16 (Order 0/1, RLE, Stripe, Pack,
  Cat), FQZComp (quality scores), Range adaptive arithmetic coder, and
  the Name Tokeniser.
- New compression profile system (FAST, NORMAL, SMALL, ARCHIVE),
  matching htslib's `--output-fmt-option fast/small/archive`. Each
  profile picks per-DataSeries compressors and (for SMALL/ARCHIVE)
  trial-compression candidate sets.
- TrialCompressor: a wrapper that tries multiple compressors per block
  and keeps the smallest output. Replaces the ad-hoc tag triple-
  compression path with a uniform, profile-driven mechanism.
- DataSeries content IDs aligned with htslib so CRAM dumps from the
  two implementations are directly comparable.

Codec performance work
----------------------
- rANS codecs reworked to use a `byte[]` API and backwards-write,
  removing per-byte stream overhead and several layers of indirection.
- GzipCodec uses Deflater/Inflater directly instead of going through
  GZIPInputStream/OutputStream, avoiding gzip framing overhead per
  block.
- Name Tokeniser encoder: hand-written tokenisation replaces regex,
  per-type flags + STRIPE + duplicate-stream elimination + all-MATCH
  fast path significantly improve both speed and ratio.
- FQZComp / Range coder / rANS encoder hot paths tightened.
- Pooled RANSNx16Decode instance in the Name Tokeniser to avoid
  reallocating per call.
- Replaced ByteArrayInputStream/OutputStream with unsynchronized
  CRAMByteReader/Writer to remove monitor overhead in tight loops.
- Cached SAMTag key metadata to eliminate per-record String allocation
  during CRAM decode.
- Fused read-base restoration, CIGAR building, and NM/MD computation
  into a single pass instead of three.
- jlibdeflate compatibility fix on the ByteBuffer path.
- Roughly 15% faster encoding overall vs the pre-optimisation state,
  with read decoding gains in line.

Correctness fixes found during cross-implementation validation
--------------------------------------------------------------
- TLEN is now computed using htslib's coordinate-extent rule
  (max(end1,end2) - min(start1,start2) + 1, signed by leftmost
  position) for CRAM-specific compute, rather than the SAM 5'-to-5'
  rule. Without this, every read with a supplementary alignment
  mismatched on TLEN through any CRAM round-trip.
- CompressionHeader serialisation now uses a growable
  ByteArrayOutputStream rather than a fixed 100 KB ByteBuffer for the
  preservation map and tag encoding map. The TD tag dictionary alone
  can exceed 100 KB for rich tag sets (PacBio/Ultima flow-space, ONT
  modified-base tags) under the archive profile, where it would
  previously throw BufferOverflowException.
- Slice flush now uses a dual record/bases threshold matching htslib's
  `seqs_per_slice` AND `bases_per_slice` (default = readsPerSlice *
  500). Without this, archive-profile encoding of long-read data
  (PacBio HiFi 15 kb, ONT 30 kb+) would buffer ~1+ GB of quality
  scores per slice and OOM the FQZComp encoder.
- Strip NM/MD on encode and regenerate them on decode (matching
  htslib's default `store_nm=0`/`store_md=0` behaviour). Implemented
  attached mate pairs to align the in-memory representation with the
  spec.
- Restored CIGAR reconstruction when SEQ is `*` (CF_UNKNOWN_BASES).
- Fixed crash when reading containers that contain no slices.
- Removed unnecessary content digest tags from CRAM slice headers
  (htslib doesn't write them either; htsjdk's verification on read
  was overly strict).

Tools
-----
- `CramConverter`: a small command-line tool for converting between
  SAM/BAM/CRAM, primarily for benchmarking and exercising profiles.
- `CramStats`: dump per-block compression statistics from a CRAM file
  for debugging and ratio analysis.

Tests and CI
------------
- New hts-specs CRAM 3.0/3.1 compliance tests covering decode, index
  query, and round-trip behaviour.
- FQZComp round-trip tests using the hts-specs quality data files.
- CRAI index query correctness tests; codec roundtrip property tests.
- Split CRAM31 fidelity tests into per-profile classes for parallel
  execution.
- Reduced memory pressure in the unit tests to eliminate intermittent
  OOM failures on CI.
- Sped up several long-running CRAM tests by caching test data,
  reducing slice-size matrices, and downsampling fixtures.
- Misc: fixed a thread-safety bug in VariantContextTestProvider that
  was producing non-deterministic test counts.

CHANGELOG / README updated for the 5.0.0 release.

Co-authored-by: Chris Norman <cnorman@broadinstitute.org>
Snakemake pipeline that round-trips a curated set of public BAM/CRAM
files through htsjdk and samtools and verifies output equivalence
across all four CRAM 3.1 compression profiles (FAST, NORMAL, SMALL,
ARCHIVE) and all four encode/decode combinations. The pipeline was
used to validate the CRAM 3.1 write implementation in the previous
commit against samtools 1.21 on real-world data spanning five
sequencing platforms (Illumina, PacBio HiFi, ONT, Element AVITI,
Ultima), seven assay types (WGS, WGBS, RNA-seq, scRNA-seq, exome,
Hi-C, amplicon), and two organisms.

Pipeline (validation/)
----------------------
- Snakefile: encode original -> CRAM with both htsjdk and samtools,
  decode each CRAM with both, then compare each decoded BAM against
  the original. 4 profiles * 4 encoder/decoder combinations = 16
  comparisons per sample.
- Two sample-sheet formats:
    samples.tsv      - local file paths (3 columns)
    test_datasets.tsv - remote URLs auto-downloaded (9 columns)
  Format is auto-detected from header. Reference URLs may be a
  comma-separated list, with optional `#contig_name` suffix to rename
  a single-sequence FASTA's header on download (e.g. Lambda phage
  chrL spike-in for bisulfite samples).
- Intermediate files (BAMs and CRAMs) are marked temp() so they are
  cleaned up as soon as their dependents finish; rule priorities push
  the DAG to drain depth-first, keeping peak disk bounded.
- Per-rule log directives, retry/escalating-memory ladder for the
  Java jobs (4 GB -> 6 GB -> 8 GB -> 10 GB), and pixi.toml for a
  reproducible toolchain.

CramComparison
--------------
A general-purpose record-by-record SAM/BAM/CRAM comparison utility
used by the pipeline as the comparator. Tolerates the universal CRAM
roundtrip normalisations:
  - CIGAR =/X collapsed to M before comparison (CRAM stores match/
    mismatch in read features, so the =/X distinction is lost on
    decode in both htsjdk and samtools).
  - mapQ ignored for unmapped reads (SAM spec leaves it undefined,
    and both implementations normalise it to 0 on decode).
  - Auto-generated MD/NM tags stripped when one side is missing them.
  - Unsigned B-array type differences tolerated (CRAM stores signed).
Exit code 0 on both match and mismatch; exit 2 only on actual error,
so Snakemake preserves the result file in either case.

Co-authored-by: Claude <noreply@anthropic.com>
@tfenne
Copy link
Copy Markdown
Member Author

tfenne commented Apr 25, 2026

Tests pass locally, and are still running into OOM issues in CI. I'm going to merge this to dev and address there if it's still an issue once some other PRs land.

@tfenne tfenne merged commit 6aa602f into dev Apr 25, 2026
2 of 3 checks passed
@tfenne
Copy link
Copy Markdown
Member Author

tfenne commented Apr 25, 2026

I should note for the record that this PR adds a metric ton of unit tests for the CRAM code, and has been battle-tested by transcoding a broad variety of real-world data from BAM to CRAM and back again with each compression profile, with samtools and htsjdk via the included validation pipeline. The validation pipeline uncovered a small number of subtle bugs that have now been fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants