Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
bdd6791
Fix SamLocusIterator so that read position is not incorrectly offset …
tfenne Apr 11, 2026
6a16622
A few targeted optimizations to the BAM decoding path that yield 6-7%…
tfenne Apr 11, 2026
2a5564a
Fix existing deprecations in code and build. (#1767)
tfenne Apr 11, 2026
99cb4b6
Integrate jlibdeflate for faster DEFLATE compression and decompressio…
tfenne Apr 11, 2026
ceffa6f
Update maven central publishing to the new central portal and update …
tfenne Apr 11, 2026
b06e6e4
fix: catch UnsatisfiedLinkError when loading snappy native library (#…
nh13 Apr 11, 2026
9b47987
CRAM 3.1: enable an initial naive write profile
cmnbroad Apr 25, 2026
07230f5
CRAM 3.1 write support: full codecs, profiles, optimisation, and tests
tfenne Apr 25, 2026
e7f34b0
Add cross-implementation CRAM validation pipeline
tfenne Apr 25, 2026
23c681a
Changed SAMRecord.toString() to emit the SAM format string with all f…
tfenne Apr 25, 2026
62c308d
Fix snapshot version naming and trim build hygiene (#1772)
tfenne Apr 25, 2026
4b75dd5
Apply Palantir Java Format to entire codebase
tfenne Apr 25, 2026
41e497a
Add Spotless + Palantir Java Format and auto-format on every build (#…
tfenne Apr 25, 2026
49e50e3
Excise SRA support from htsjdk (#1774)
tfenne Apr 25, 2026
9fae270
Trim runtime deps: make Nashorn optional, bulk up JS filter tests (#1…
tfenne Apr 25, 2026
4f25db8
Update CHANGELOG.md for v5.0.0 (#1776)
tfenne Apr 25, 2026
fb57878
Fix javadoc warnings across the codebase. (#1778)
tfenne Apr 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
14 changes: 14 additions & 0 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Commits listed here are skipped by `git blame` so that mechanical, whole-tree
# reformats do not obscure the author who actually wrote each line.
#
# GitHub honors this file automatically in the web blame view. For `git blame`
# on the command line, opt in once per clone:
#
# git config blame.ignoreRevsFile .git-blame-ignore-revs
#
# When adding a new entry, include a one-line comment above the SHA explaining
# what the commit did and why it should be skipped. Only mechanical reformats
# belong here -- never use this to hide substantive changes.

# Apply Palantir Java Format to entire codebase (#1761)
4b75dd524198dea5b789fd383f99ce974510fb1d
23 changes: 23 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ jobs:
name: Java ${{ matrix.Java }} build and test
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # full history + tags so palantir/git-version sees the latest release tag
- name: Set up java ${{ matrix.Java }}
uses: actions/setup-java@v3
with:
Expand Down Expand Up @@ -49,6 +51,8 @@ jobs:
name: Tests that require external APIs
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # full history + tags so palantir/git-version sees the latest release tag
- name: Set up java 17
uses: actions/setup-java@v3
with:
Expand All @@ -67,11 +71,30 @@ jobs:
with:
name: test-results-external-apis
path: build/reports/tests
formatCheck:
runs-on: ubuntu-latest
name: Java Format Check
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # full history + tags so palantir/git-version sees the latest release tag
- name: Set up java 17
uses: actions/setup-java@v3
with:
java-version: '17'
distribution: 'adopt'
cache: gradle
- name: Grant execute permission for gradlew
run: chmod +x gradlew
- name: Verify formatting
run: ./gradlew spotlessCheck
spotBugs:
runs-on: ubuntu-latest
name: SpotBugs
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0 # full history + tags so palantir/git-version sees the latest release tag
- name: Set up java 17
uses: actions/setup-java@v3
with:
Expand Down
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
htsjdk.iws
.command_tmp
atlassian-ide-plugin.xml
/htsjdk.version.properties
/test-output/
.DS_Store

Expand Down
56 changes: 0 additions & 56 deletions .travis.yml

This file was deleted.

126 changes: 126 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,132 @@ early infrastructure for a plugin-based codec framework and resource bundles.

---

## 5.0.0

Major release.

### Headlines

- **CRAM 3.1 write support** (the culmination of the read-side codec work in 4.2.0 and the reader wiring in 4.3.0 — htsjdk can now produce CRAM 3.1 files that are interoperable with samtools/htslib).
- **CRAM 3.1 is now the default write version** (previously 3.0). On the same input, files written with the new default `NORMAL` profile are roughly 36% smaller and encode 18-20% faster than what htsjdk 4.3 produced with its `FAST` (3.0) default.
- **Major speed-ups across the BAM and CRAM read/write paths** vs htsjdk 4.3.0. Measured on AWS m8gd / m8id (single thread, 32.7M-read input), the headline wins are: BAM write 50-58% faster, CRAM encode (FAST) 41-47% faster, CRAM read 42-46% faster, BAM read 30-31% faster.
- **`jlibdeflate` is now the default DEFLATE engine** ([jlibdeflate](https://github.com/fulcrumgenomics/jlibdeflate) wrapping native libdeflate); falls back to the JDK `Deflater`/`Inflater` if the native library cannot be loaded.
- **Slimmed-down runtime dependency tree** (SRA support removed, Nashorn moved to an opt-in dependency, several stale or misleading dependency declarations cleaned up).
- **Enforced automatic code formatting** via Palantir Java Format on every build.
- **Unit test improvements**: pass/fail stats now reported correctly when run via Gradle, and total suite runtime massively reduced (now 2-3 minutes).

### ⚠️ Breaking changes

Consumers should review these before upgrading.

- **SRA support removed.** All `htsjdk.samtools.sra.*` types, `SRAFileReader`, `SRAIterator`,
`SRAIndex`, `SamInputResource.of(SRAAccession)`, `SamReader.Type.SRA_TYPE`, and the
`InputResource.Type.SRA_ACCESSION` enum value have been deleted. The
`gov.nih.nlm.ncbi:ngs-java` dependency (and the `samjdk.sra_libraries_download` system
property) are gone. Consumers needing SRA access must use NCBI's tooling or a different
library (#1774).
- **Nashorn is no longer a transitive runtime dependency.** The `JavascriptSamRecordFilter`
and `JavascriptVariantFilter` classes still exist but htsjdk no longer ships
`org.openjdk.nashorn:nashorn-core` (or its 5 ASM transitives) on consumers' runtime
classpath. Consumers who use the JavaScript filter classes must add
`org.openjdk.nashorn:nashorn-core:15.7` (or another JSR-223 `"js"` engine) to their own
runtime classpath; the no-engine error message names the artifact and prints both Gradle
and Maven coordinates (#1775).
- **`SAMRecord.toString()` now returns the full SAM-format string** for the record (all 11
mandatory SAM fields plus tags), replacing the previous minimal summary. The previous
output was usually insufficient to debug failures in `println()` calls or test-assertion
messages; the new output is the same line you would see in a SAM file. Anything that
parses or asserts against the exact old format will need updating (#1762).
- **CRAM slice headers no longer include the optional content digest tags** (BD/SD/B5/S5/B1/S1).
Matches htslib/samtools behavior. Block-level CRC32 (required since CRAM 3.0) still
provides data integrity. Technically a wire-format change but with zero known practical
impact, since no known tools consume these tags.
- **Default CRAM version for writing is now 3.1** (was 3.0). CRAM 3.0 readers will not be
able to read newly-produced files; pass an explicit version to the writer if you need 3.0
output.

### CRAM 3.1 Write Support

- Enable CRAM 3.1 writing with all spec codecs: rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE
- Add configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE) with trial compression for automatic codec selection
- Implement `TrialCompressor` to replace ad-hoc triple-compression for tags and align trial candidates with htslib
- Add `GzipCodec` for direct Deflater/Inflater GZIP compression, wired into CRAM as a codec option
- Strip NM/MD tags on CRAM encode and regenerate on decode, matching htslib behavior
- Implement attached (same-slice) mate pair resolution
- Align DataSeries content IDs with htslib for cross-implementation debugging
- Remove content digest tags (BD/SD/B5/S5/B1/S1) from CRAM slice headers, matching htslib/samtools behavior (see Breaking changes)
- Default CRAM version for writing is now 3.1 (was 3.0; see Breaking changes)
- Add `CramConverter` command-line tool for testing and benchmarking CRAM write profiles
- Add cross-implementation CRAM validation pipeline (`validation/`) for round-tripping against samtools/htslib
- Add bases-per-slice threshold to bound slice memory when writing long reads
- Refine `CompressionHeader` map serialization
- Resolve a pile of in-tree `TODO`s in CRAM structure classes

### CRAM correctness and cross-implementation fixes

These fixes apply to both reading and writing CRAM and substantially improve interoperability with samtools/htslib.

- Fix CRAM `TLEN` computation to match htslib (cross-tool comparisons of the same input now produce matching `TLEN` values)
- Fix `CIGAR` reconstruction when the sequence is `*` (`CF_UNKNOWN_BASES`)
- Fix `=`/`X` `CIGAR` op comparison in cross-implementation tests
- Fix CRAM archive header overflow on large containers
- Fix crash when reading a CRAM container with no slices
- Fix unmapped-read query in the hts-specs compliance harness
- Document the supplementary/secondary read-name resolution limitation in the writer

### Codec and Compression Optimizations

- Refactor and optimize all rANS codecs: byte-array API, backwards-write encoding, and general simplifications
- Optimize Name Tokeniser encoder: replace regex with hand-written parser; add per-type flags, STRIPE support, stream deduplication, and all-MATCH elimination
- Optimize FQZComp, Range coder, and rANS encoder hot paths
- Tune NORMAL profile codec assignments based on empirical compression testing

### Performance

- Integrate [jlibdeflate](https://github.com/fulcrumgenomics/jlibdeflate) for native libdeflate-backed DEFLATE compression and decompression. Used by default; falls back to the JDK Deflater/Inflater if the native library cannot be loaded (#1768)
- A few targeted optimizations to the BAM decoding path yielding ~6-7% improvement in BAM read performance (#1764)
- Replace `ByteArrayInputStream`/`ByteArrayOutputStream` with unsynchronized `CRAMByteReader`/`CRAMByteWriter` to eliminate synchronization overhead in CRAM
- Fuse read base restoration, CIGAR building, and NM/MD computation into a single pass during CRAM decode
- Cache tag key metadata to eliminate per-record `String` allocation during CRAM decode
- Pool `RANSNx16Decode` instances in the Name Tokeniser
- Optimize BAM nibble-to-ASCII base decoding with a bulk lookup table

### Bug fixes

- Fix LTF8 9-byte write bug: wrong bit shift (`>> 28` instead of `>> 24`) corrupted the high byte of large CRAM offsets (#1765)
- Fix `SamLocusIterator` so that read position is not incorrectly offset (#1758)
- Fix asymmetric `SamPairUtil.getPairOrientation` on dovetail pairs (#1771)
- Catch `UnsatisfiedLinkError` when loading the snappy native library so failure to load it does not abort downstream consumers (#1753)

### Build, tooling, and dependency clean-up

- **Code formatting:** apply [Palantir Java Format](https://github.com/palantir/palantir-java-format) to the entire codebase and enforce it on every build via [Spotless](https://github.com/diffplug/spotless). `compileJava` auto-formats source in place; CI separately runs `spotlessCheck` as the enforcement boundary. See `CONTRIBUTING.md` for details, including the `.git-blame-ignore-revs` opt-in for the bulk-format commit (#1761)
- **Maven Central publishing migrated** from the legacy OSSRH endpoint to the new [Sonatype Central Portal](https://central.sonatype.com), via the [NMCP Gradle plugin](https://github.com/GradleUp/nmcp). Consumer-visible groupId/artifactId/version coordinates are unchanged (#1769)
- **Snapshot versioning** now embeds the short commit hash (e.g. `5.0.0-23c681a-SNAPSHOT`) so each snapshot is a distinct, pinnable artifact rather than a moving Maven SNAPSHOT (#1772)
- **Test runner** now correctly reports failures rather than silently skipping them when a `@DataProvider` throws (#1759)
- **Existing API deprecations** cleaned up across `htsjdk.samtools` and `htsjdk.variant` (#1767)
- **`commons-logging` direct declaration removed.** htsjdk does not use commons-logging itself; the version pin is now expressed as a Gradle dependency constraint and only kicks in transitively when JEXL pulls it
- **Nashorn moved to `compileOnly`** — see Breaking changes
- **`gov.nih.nlm.ncbi:ngs-java` removed** — see Breaking changes (SRA support)

### Compatibility

- Compiled and tested against JDK 17 (CI default), 21, and 24. CI continues to build only on 17. htsjdk's published minimum remains Java 17 (set in 4.0.0)

### Testing and Infrastructure

- Add hts-specs CRAM 3.0 / 3.1 decode-compliance tests, plus FQZComp round-trip tests using hts-specs quality data
- Add CRAI index query correctness tests and codec round-trip property tests
- Split CRAM 3.1 fidelity tests into per-profile classes for parallel execution
- Speed up BCF2 and SeekableStream integration tests; cache test data in CRAM index test classes
- Reduce `CRAMFileBAIIndexTest` from 4 to 2 slice-size variants, sampling every 200th
- Downsample the CEUTrio test CRAM from ~654K to ~150K records (47 MB → 11 MB)
- Reduce memory pressure in unit tests to eliminate OOM failures
- Fix thread-safety bug in `VariantContextTestProvider` causing non-deterministic test counts
- Bulk up the JavaScript filter test suites: replace 4 checked-in `.js` fixtures with 46 small inline-script tests covering all three constructors, return-type semantics, bindings, and error paths (#1775)

---

## 4.3.0 (2025-05-09)

Completes CRAM 3.1 read support by wiring the codec implementations (added in 4.2.0) into
Expand Down
Loading
Loading