Update CHANGELOG.md for v5.0.0 (#1776)

tfenne · web-flow · commit 79c65fdac616 · 2026-04-25T16:39:46.000-06:00
Expands the existing 5.0.0 stub (which previously covered only the CRAM 3.1 write work) into a complete entry covering everything since 4.3.0. Adds: - Lead headlines summarizing the major themes (CRAM 3.1 writing, slimmer runtime deps, faster BAM [de]compression, enforced formatting, fixed test reporting). - A prominent ⚠️ Breaking changes section calling out SRA removal, Nashorn now opt-in, the SAMRecord.toString() format change, the removed CRAM slice digest tags, and the new default CRAM version 3.1. - A new "CRAM correctness and cross-implementation fixes" section consolidating the read- and write-path fixes that improve interop with samtools/htslib (TLEN computation, CIGAR =/X comparison, CIGAR reconstruction for sequence '*', container-with-no-slices crash, archive header overflow, unmapped-read query, supplementary/secondary read-name limitation). - Performance entries beyond just the CRAM-internal optimizations: jlibdeflate integration, the BAM decoding path improvements, and a long-read-friendly bases-per-slice threshold. - A bug-fix section covering the LTF8 9-byte write fix, the SamLocusIterator offset bug, the SamPairUtil dovetail fix, and the snappy native-load UnsatisfiedLinkError catch. - A build, tooling, and dependency clean-up section: Palantir Java Format + Spotless enforcement, Maven Central portal migration, snapshot version naming, deprecation cleanup, the test-runner pass/fail-reporting fix, and the dependency clean-up (commons-logging constraint, Nashorn compileOnly, ngs-java removal). - A compatibility line noting JDK 17 / 21 / 24 spot-checks. - Expanded testing entries: the hts-specs CRAM 3.0/3.1 compliance tests, FQZComp round-trip tests, CRAI correctness tests, test-suite speedups, the CEUTrio test-data downsizing, and the JS filter test bulk-up. Source for the additions was the full git log since the 4.3.0 tag plus the unsquashed backup branch tf_cram_31_backup_20260425, which retains fine-grained commits that the merged CRAM 3.1 PR squashed away. The CRAM write-speed gains are intentionally not headlined yet -- prior htsjdk wrote CRAM 3.0 (lower compression, fewer codec passes), so "faster" without "and same/better compression" would be misleading. We'll revisit the perf bullet after benchmarking against samtools.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,9 +12,50 @@ early infrastructure for a plugin-based codec framework and resource bundles.
 
 ## 5.0.0
 
-Adds **CRAM 3.1 write support** to htsjdk.  This is the culmination of the read-side codec work
-in 4.2.0 and the reader wiring in 4.3.0: htsjdk can now produce CRAM 3.1 files that are
-interoperable with samtools/htslib.
+Major release.  
+
+### Headlines: 
+
+- **CRAM 3.1 write support** (the culmination of the read-side codec
+work in 4.2.0 and the reader wiring in 4.3.0 — htsjdk can now produce CRAM 3.1 files that are
+interoperable with samtools/htslib), 
+- **Major speed-ups in CRAM reading speed**, with reading efficiency only 20-30% slower than the C implementation in htslib/samtools
+- **slimmed-down runtime dependency tree** (SRA support
+removed, Nashorn moved to an opt-in dependency, several stale or misleading dependency
+declarations cleaned up)
+- **faster BAM [de]compression** via [jlibdeflate](https://github.com/fulcrumgenomics/jlibdeflate)
+- **enforced automatic code formatting** via Palantir Java Format enforced on every build
+- **fixed unit tests** to accurately report pass/fail stats when run via gradle, and massively reduce unit test runtime
+
+### ⚠️ Breaking changes
+
+Consumers should review these before upgrading.
+
+- **SRA support removed.**  All `htsjdk.samtools.sra.*` types, `SRAFileReader`, `SRAIterator`,
+  `SRAIndex`, `SamInputResource.of(SRAAccession)`, `SamReader.Type.SRA_TYPE`, and the
+  `InputResource.Type.SRA_ACCESSION` enum value have been deleted.  The
+  `gov.nih.nlm.ncbi:ngs-java` dependency (and the `samjdk.sra_libraries_download` system
+  property) are gone.  Consumers needing SRA access must use NCBI's tooling or a different
+  library (#1774).
+- **Nashorn is no longer a transitive runtime dependency.**  The `JavascriptSamRecordFilter`
+  and `JavascriptVariantFilter` classes still exist but htsjdk no longer ships
+  `org.openjdk.nashorn:nashorn-core` (or its 5 ASM transitives) on consumers' runtime
+  classpath.  Consumers who use the JavaScript filter classes must add
+  `org.openjdk.nashorn:nashorn-core:15.7` (or another JSR-223 `"js"` engine) to their own
+  runtime classpath; the no-engine error message names the artifact and prints both Gradle
+  and Maven coordinates (#1775).
+- **`SAMRecord.toString()` now returns the full SAM-format string** for the record (all 11
+  mandatory SAM fields plus tags), replacing the previous minimal summary.  The previous
+  output was usually insufficient to debug failures in `println()` calls or test-assertion
+  messages; the new output is the same line you would see in a SAM file.  Anything that
+  parses or asserts against the exact old format will need updating (#1762).
+- **CRAM slice headers no longer include the optional content digest tags** (BD/SD/B5/S5/B1/S1).
+  Matches htslib/samtools behavior.  Block-level CRC32 (required since CRAM 3.0) still
+  provides data integrity.  Technically a wire-format change but with zero known practical
+  impact, since no known tools consume these tags.
+- **Default CRAM version for writing is now 3.1** (was 3.0).  CRAM 3.0 readers will not be
+  able to read newly-produced files; pass an explicit version to the writer if you need 3.0
+  output.
 
 ### CRAM 3.1 Write Support
 
@@ -25,9 +66,25 @@ interoperable with samtools/htslib.
 - Strip NM/MD tags on CRAM encode and regenerate on decode, matching htslib behavior
 - Implement attached (same-slice) mate pair resolution
 - Align DataSeries content IDs with htslib for cross-implementation debugging
-- Remove content digest tags (BD/SD/B5/S5/B1/S1) from CRAM slice headers, matching htslib/samtools behavior. These are optional per the spec and were expensive to compute. Block-level CRC32 (required by CRAM 3.0+) provides data integrity. This is technically a breaking change but has zero practical impact since no known tools consume these tags.
-- Default CRAM version for writing is now 3.1 (was 3.0)
+- Remove content digest tags (BD/SD/B5/S5/B1/S1) from CRAM slice headers, matching htslib/samtools behavior (see Breaking changes)
+- Default CRAM version for writing is now 3.1 (was 3.0; see Breaking changes)
 - Add `CramConverter` command-line tool for testing and benchmarking CRAM write profiles
+- Add cross-implementation CRAM validation pipeline (`validation/`) for round-tripping against samtools/htslib
+- Add bases-per-slice threshold to bound slice memory when writing long reads
+- Refine `CompressionHeader` map serialization
+- Resolve a pile of in-tree `TODO`s in CRAM structure classes
+
+### CRAM correctness and cross-implementation fixes
+
+These fixes apply to both reading and writing CRAM and substantially improve interoperability with samtools/htslib.
+
+- Fix CRAM `TLEN` computation to match htslib (cross-tool comparisons of the same input now produce matching `TLEN` values)
+- Fix `CIGAR` reconstruction when the sequence is `*` (`CF_UNKNOWN_BASES`)
+- Fix `=`/`X` `CIGAR` op comparison in cross-implementation tests
+- Fix CRAM archive header overflow on large containers
+- Fix crash when reading a CRAM container with no slices
+- Fix unmapped-read query in the hts-specs compliance harness
+- Document the supplementary/secondary read-name resolution limitation in the writer
 
 ### Codec and Compression Optimizations
 
@@ -38,17 +95,48 @@ interoperable with samtools/htslib.
 
 ### Performance
 
-- Replace `ByteArrayInputStream`/`ByteArrayOutputStream` with unsynchronized `CRAMByteReader`/`CRAMByteWriter` to eliminate synchronization overhead
-- Fuse read base restoration, CIGAR building, and NM/MD computation into a single pass during decode
+- Integrate [jlibdeflate](https://github.com/fulcrumgenomics/jlibdeflate) for native libdeflate-backed DEFLATE compression and decompression. Used by default; falls back to the JDK Deflater/Inflater if the native library cannot be loaded (#1768)
+- A few targeted optimizations to the BAM decoding path yielding ~6-7% improvement in BAM read performance (#1764)
+- Optimize CRAM write performance: ~15% faster encoding via codec-level tuning and reduced per-record allocation
+- Replace `ByteArrayInputStream`/`ByteArrayOutputStream` with unsynchronized `CRAMByteReader`/`CRAMByteWriter` to eliminate synchronization overhead in CRAM
+- Fuse read base restoration, CIGAR building, and NM/MD computation into a single pass during CRAM decode
 - Cache tag key metadata to eliminate per-record `String` allocation during CRAM decode
 - Pool `RANSNx16Decode` instances in the Name Tokeniser
 - Optimize BAM nibble-to-ASCII base decoding with a bulk lookup table
 
+### Bug fixes
+
+- Fix LTF8 9-byte write bug: wrong bit shift (`>> 28` instead of `>> 24`) corrupted the high byte of large CRAM offsets (#1765)
+- Fix `SamLocusIterator` so that read position is not incorrectly offset (#1758)
+- Fix asymmetric `SamPairUtil.getPairOrientation` on dovetail pairs (#1771)
+- Catch `UnsatisfiedLinkError` when loading the snappy native library so failure to load it does not abort downstream consumers (#1753)
+
+### Build, tooling, and dependency clean-up
+
+- **Code formatting:** apply [Palantir Java Format](https://github.com/palantir/palantir-java-format) to the entire codebase and enforce it on every build via [Spotless](https://github.com/diffplug/spotless).  `compileJava` auto-formats source in place; CI separately runs `spotlessCheck` as the enforcement boundary.  See `CONTRIBUTING.md` for details, including the `.git-blame-ignore-revs` opt-in for the bulk-format commit (#1761)
+- **Maven Central publishing migrated** from the legacy OSSRH endpoint to the new [Sonatype Central Portal](https://central.sonatype.com), via the [NMCP Gradle plugin](https://github.com/GradleUp/nmcp).  Consumer-visible groupId/artifactId/version coordinates are unchanged (#1769)
+- **Snapshot versioning** now embeds the short commit hash (e.g. `5.0.0-23c681a-SNAPSHOT`) so each snapshot is a distinct, pinnable artifact rather than a moving Maven SNAPSHOT (#1772)
+- **Test runner** now correctly reports failures rather than silently skipping them when a `@DataProvider` throws (#1759)
+- **Existing API deprecations** cleaned up across `htsjdk.samtools` and `htsjdk.variant` (#1767)
+- **`commons-logging` direct declaration removed.**  htsjdk does not use commons-logging itself; the version pin is now expressed as a Gradle dependency constraint and only kicks in transitively when JEXL pulls it
+- **Nashorn moved to `compileOnly`** — see Breaking changes
+- **`gov.nih.nlm.ncbi:ngs-java` removed** — see Breaking changes (SRA support)
+
+### Compatibility
+
+- Compiled and tested against JDK 17 (CI default), 21, and 24.  CI continues to build only on 17.  htsjdk's published minimum remains Java 17 (set in 4.0.0)
+
 ### Testing and Infrastructure
 
+- Add hts-specs CRAM 3.0 / 3.1 decode-compliance tests, plus FQZComp round-trip tests using hts-specs quality data
+- Add CRAI index query correctness tests and codec round-trip property tests
 - Split CRAM 3.1 fidelity tests into per-profile classes for parallel execution
+- Speed up BCF2 and SeekableStream integration tests; cache test data in CRAM index test classes
+- Reduce `CRAMFileBAIIndexTest` from 4 to 2 slice-size variants, sampling every 200th
+- Downsample the CEUTrio test CRAM from ~654K to ~150K records (47 MB → 11 MB)
 - Reduce memory pressure in unit tests to eliminate OOM failures
 - Fix thread-safety bug in `VariantContextTestProvider` causing non-deterministic test counts
+- Bulk up the JavaScript filter test suites: replace 4 checked-in `.js` fixtures with 46 small inline-script tests covering all three constructors, return-type semantics, bindings, and error paths (#1775)
 
 ---