Skip to content

Commit 79c65fd

Browse files
authored
Update CHANGELOG.md for v5.0.0 (#1776)
Expands the existing 5.0.0 stub (which previously covered only the CRAM 3.1 write work) into a complete entry covering everything since 4.3.0. Adds: - Lead headlines summarizing the major themes (CRAM 3.1 writing, slimmer runtime deps, faster BAM [de]compression, enforced formatting, fixed test reporting). - A prominent ⚠️ Breaking changes section calling out SRA removal, Nashorn now opt-in, the SAMRecord.toString() format change, the removed CRAM slice digest tags, and the new default CRAM version 3.1. - A new "CRAM correctness and cross-implementation fixes" section consolidating the read- and write-path fixes that improve interop with samtools/htslib (TLEN computation, CIGAR =/X comparison, CIGAR reconstruction for sequence '*', container-with-no-slices crash, archive header overflow, unmapped-read query, supplementary/secondary read-name limitation). - Performance entries beyond just the CRAM-internal optimizations: jlibdeflate integration, the BAM decoding path improvements, and a long-read-friendly bases-per-slice threshold. - A bug-fix section covering the LTF8 9-byte write fix, the SamLocusIterator offset bug, the SamPairUtil dovetail fix, and the snappy native-load UnsatisfiedLinkError catch. - A build, tooling, and dependency clean-up section: Palantir Java Format + Spotless enforcement, Maven Central portal migration, snapshot version naming, deprecation cleanup, the test-runner pass/fail-reporting fix, and the dependency clean-up (commons-logging constraint, Nashorn compileOnly, ngs-java removal). - A compatibility line noting JDK 17 / 21 / 24 spot-checks. - Expanded testing entries: the hts-specs CRAM 3.0/3.1 compliance tests, FQZComp round-trip tests, CRAI correctness tests, test-suite speedups, the CEUTrio test-data downsizing, and the JS filter test bulk-up. Source for the additions was the full git log since the 4.3.0 tag plus the unsquashed backup branch tf_cram_31_backup_20260425, which retains fine-grained commits that the merged CRAM 3.1 PR squashed away. The CRAM write-speed gains are intentionally not headlined yet -- prior htsjdk wrote CRAM 3.0 (lower compression, fewer codec passes), so "faster" without "and same/better compression" would be misleading. We'll revisit the perf bullet after benchmarking against samtools.
1 parent 9fae270 commit 79c65fd

1 file changed

Lines changed: 95 additions & 7 deletions

File tree

CHANGELOG.md

Lines changed: 95 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,50 @@ early infrastructure for a plugin-based codec framework and resource bundles.
1212

1313
## 5.0.0
1414

15-
Adds **CRAM 3.1 write support** to htsjdk. This is the culmination of the read-side codec work
16-
in 4.2.0 and the reader wiring in 4.3.0: htsjdk can now produce CRAM 3.1 files that are
17-
interoperable with samtools/htslib.
15+
Major release.
16+
17+
### Headlines:
18+
19+
- **CRAM 3.1 write support** (the culmination of the read-side codec
20+
work in 4.2.0 and the reader wiring in 4.3.0 — htsjdk can now produce CRAM 3.1 files that are
21+
interoperable with samtools/htslib),
22+
- **Major speed-ups in CRAM reading speed**, with reading efficiency only 20-30% slower than the C implementation in htslib/samtools
23+
- **slimmed-down runtime dependency tree** (SRA support
24+
removed, Nashorn moved to an opt-in dependency, several stale or misleading dependency
25+
declarations cleaned up)
26+
- **faster BAM [de]compression** via [jlibdeflate](https://github.com/fulcrumgenomics/jlibdeflate)
27+
- **enforced automatic code formatting** via Palantir Java Format enforced on every build
28+
- **fixed unit tests** to accurately report pass/fail stats when run via gradle, and massively reduce unit test runtime
29+
30+
### ⚠️ Breaking changes
31+
32+
Consumers should review these before upgrading.
33+
34+
- **SRA support removed.** All `htsjdk.samtools.sra.*` types, `SRAFileReader`, `SRAIterator`,
35+
`SRAIndex`, `SamInputResource.of(SRAAccession)`, `SamReader.Type.SRA_TYPE`, and the
36+
`InputResource.Type.SRA_ACCESSION` enum value have been deleted. The
37+
`gov.nih.nlm.ncbi:ngs-java` dependency (and the `samjdk.sra_libraries_download` system
38+
property) are gone. Consumers needing SRA access must use NCBI's tooling or a different
39+
library (#1774).
40+
- **Nashorn is no longer a transitive runtime dependency.** The `JavascriptSamRecordFilter`
41+
and `JavascriptVariantFilter` classes still exist but htsjdk no longer ships
42+
`org.openjdk.nashorn:nashorn-core` (or its 5 ASM transitives) on consumers' runtime
43+
classpath. Consumers who use the JavaScript filter classes must add
44+
`org.openjdk.nashorn:nashorn-core:15.7` (or another JSR-223 `"js"` engine) to their own
45+
runtime classpath; the no-engine error message names the artifact and prints both Gradle
46+
and Maven coordinates (#1775).
47+
- **`SAMRecord.toString()` now returns the full SAM-format string** for the record (all 11
48+
mandatory SAM fields plus tags), replacing the previous minimal summary. The previous
49+
output was usually insufficient to debug failures in `println()` calls or test-assertion
50+
messages; the new output is the same line you would see in a SAM file. Anything that
51+
parses or asserts against the exact old format will need updating (#1762).
52+
- **CRAM slice headers no longer include the optional content digest tags** (BD/SD/B5/S5/B1/S1).
53+
Matches htslib/samtools behavior. Block-level CRC32 (required since CRAM 3.0) still
54+
provides data integrity. Technically a wire-format change but with zero known practical
55+
impact, since no known tools consume these tags.
56+
- **Default CRAM version for writing is now 3.1** (was 3.0). CRAM 3.0 readers will not be
57+
able to read newly-produced files; pass an explicit version to the writer if you need 3.0
58+
output.
1859

1960
### CRAM 3.1 Write Support
2061

@@ -25,9 +66,25 @@ interoperable with samtools/htslib.
2566
- Strip NM/MD tags on CRAM encode and regenerate on decode, matching htslib behavior
2667
- Implement attached (same-slice) mate pair resolution
2768
- Align DataSeries content IDs with htslib for cross-implementation debugging
28-
- Remove content digest tags (BD/SD/B5/S5/B1/S1) from CRAM slice headers, matching htslib/samtools behavior. These are optional per the spec and were expensive to compute. Block-level CRC32 (required by CRAM 3.0+) provides data integrity. This is technically a breaking change but has zero practical impact since no known tools consume these tags.
29-
- Default CRAM version for writing is now 3.1 (was 3.0)
69+
- Remove content digest tags (BD/SD/B5/S5/B1/S1) from CRAM slice headers, matching htslib/samtools behavior (see Breaking changes)
70+
- Default CRAM version for writing is now 3.1 (was 3.0; see Breaking changes)
3071
- Add `CramConverter` command-line tool for testing and benchmarking CRAM write profiles
72+
- Add cross-implementation CRAM validation pipeline (`validation/`) for round-tripping against samtools/htslib
73+
- Add bases-per-slice threshold to bound slice memory when writing long reads
74+
- Refine `CompressionHeader` map serialization
75+
- Resolve a pile of in-tree `TODO`s in CRAM structure classes
76+
77+
### CRAM correctness and cross-implementation fixes
78+
79+
These fixes apply to both reading and writing CRAM and substantially improve interoperability with samtools/htslib.
80+
81+
- Fix CRAM `TLEN` computation to match htslib (cross-tool comparisons of the same input now produce matching `TLEN` values)
82+
- Fix `CIGAR` reconstruction when the sequence is `*` (`CF_UNKNOWN_BASES`)
83+
- Fix `=`/`X` `CIGAR` op comparison in cross-implementation tests
84+
- Fix CRAM archive header overflow on large containers
85+
- Fix crash when reading a CRAM container with no slices
86+
- Fix unmapped-read query in the hts-specs compliance harness
87+
- Document the supplementary/secondary read-name resolution limitation in the writer
3188

3289
### Codec and Compression Optimizations
3390

@@ -38,17 +95,48 @@ interoperable with samtools/htslib.
3895

3996
### Performance
4097

41-
- Replace `ByteArrayInputStream`/`ByteArrayOutputStream` with unsynchronized `CRAMByteReader`/`CRAMByteWriter` to eliminate synchronization overhead
42-
- Fuse read base restoration, CIGAR building, and NM/MD computation into a single pass during decode
98+
- Integrate [jlibdeflate](https://github.com/fulcrumgenomics/jlibdeflate) for native libdeflate-backed DEFLATE compression and decompression. Used by default; falls back to the JDK Deflater/Inflater if the native library cannot be loaded (#1768)
99+
- A few targeted optimizations to the BAM decoding path yielding ~6-7% improvement in BAM read performance (#1764)
100+
- Optimize CRAM write performance: ~15% faster encoding via codec-level tuning and reduced per-record allocation
101+
- Replace `ByteArrayInputStream`/`ByteArrayOutputStream` with unsynchronized `CRAMByteReader`/`CRAMByteWriter` to eliminate synchronization overhead in CRAM
102+
- Fuse read base restoration, CIGAR building, and NM/MD computation into a single pass during CRAM decode
43103
- Cache tag key metadata to eliminate per-record `String` allocation during CRAM decode
44104
- Pool `RANSNx16Decode` instances in the Name Tokeniser
45105
- Optimize BAM nibble-to-ASCII base decoding with a bulk lookup table
46106

107+
### Bug fixes
108+
109+
- Fix LTF8 9-byte write bug: wrong bit shift (`>> 28` instead of `>> 24`) corrupted the high byte of large CRAM offsets (#1765)
110+
- Fix `SamLocusIterator` so that read position is not incorrectly offset (#1758)
111+
- Fix asymmetric `SamPairUtil.getPairOrientation` on dovetail pairs (#1771)
112+
- Catch `UnsatisfiedLinkError` when loading the snappy native library so failure to load it does not abort downstream consumers (#1753)
113+
114+
### Build, tooling, and dependency clean-up
115+
116+
- **Code formatting:** apply [Palantir Java Format](https://github.com/palantir/palantir-java-format) to the entire codebase and enforce it on every build via [Spotless](https://github.com/diffplug/spotless). `compileJava` auto-formats source in place; CI separately runs `spotlessCheck` as the enforcement boundary. See `CONTRIBUTING.md` for details, including the `.git-blame-ignore-revs` opt-in for the bulk-format commit (#1761)
117+
- **Maven Central publishing migrated** from the legacy OSSRH endpoint to the new [Sonatype Central Portal](https://central.sonatype.com), via the [NMCP Gradle plugin](https://github.com/GradleUp/nmcp). Consumer-visible groupId/artifactId/version coordinates are unchanged (#1769)
118+
- **Snapshot versioning** now embeds the short commit hash (e.g. `5.0.0-23c681a-SNAPSHOT`) so each snapshot is a distinct, pinnable artifact rather than a moving Maven SNAPSHOT (#1772)
119+
- **Test runner** now correctly reports failures rather than silently skipping them when a `@DataProvider` throws (#1759)
120+
- **Existing API deprecations** cleaned up across `htsjdk.samtools` and `htsjdk.variant` (#1767)
121+
- **`commons-logging` direct declaration removed.** htsjdk does not use commons-logging itself; the version pin is now expressed as a Gradle dependency constraint and only kicks in transitively when JEXL pulls it
122+
- **Nashorn moved to `compileOnly`** — see Breaking changes
123+
- **`gov.nih.nlm.ncbi:ngs-java` removed** — see Breaking changes (SRA support)
124+
125+
### Compatibility
126+
127+
- Compiled and tested against JDK 17 (CI default), 21, and 24. CI continues to build only on 17. htsjdk's published minimum remains Java 17 (set in 4.0.0)
128+
47129
### Testing and Infrastructure
48130

131+
- Add hts-specs CRAM 3.0 / 3.1 decode-compliance tests, plus FQZComp round-trip tests using hts-specs quality data
132+
- Add CRAI index query correctness tests and codec round-trip property tests
49133
- Split CRAM 3.1 fidelity tests into per-profile classes for parallel execution
134+
- Speed up BCF2 and SeekableStream integration tests; cache test data in CRAM index test classes
135+
- Reduce `CRAMFileBAIIndexTest` from 4 to 2 slice-size variants, sampling every 200th
136+
- Downsample the CEUTrio test CRAM from ~654K to ~150K records (47 MB → 11 MB)
50137
- Reduce memory pressure in unit tests to eliminate OOM failures
51138
- Fix thread-safety bug in `VariantContextTestProvider` causing non-deterministic test counts
139+
- Bulk up the JavaScript filter test suites: replace 4 checked-in `.js` fixtures with 46 small inline-script tests covering all three constructors, return-type semantics, bindings, and error paths (#1775)
52140

53141
---
54142

0 commit comments

Comments
 (0)