Skip to content

fix(bam): subtract Phred+33 offset from QUAL before writing to BAM#38

Open
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/bam-qual-phred-offset
Open

fix(bam): subtract Phred+33 offset from QUAL before writing to BAM#38
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/bam-qual-phred-offset

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

The BAM binary QUAL field stores raw Phred values (0-93) per SAM spec §4.2.3. rustar was passing FASTQ ASCII bytes (Phred+33 encoded) straight to noodles::sam::QualityScores::from, which expects raw Phred. Result: every QUAL byte in a rustar BAM was +33 above the spec value; samtools view re-added 33 for the SAM text rendering, yielding biologically impossible quality strings.

samtools stats average_quality was reading 68.3 vs STAR's 35.3 on the nf-core/rnaseq test profile (consistent +33 across every sample); error_rate was 0 (compounding the NM/nM bug in #29).

Fix

Subtract 33 from each FASTQ ASCII byte at every BAM-write call site before constructing the QualityScores. Uses saturating_sub so malformed FASTQ bytes < 33 clamp to 0 rather than underflowing.

Three call sites in src/io/sam.rs (SE genome, PE genome, transcriptome) plus the doc comment on EncodedRead.quality in src/io/fastq.rs updated to flag the offset.

Test plan

  • New unit test asserts a FASTQ ASCII quality b"III" produces BAM Phred bytes [40, 40, 40]
  • Existing unit tests pass
  • cargo build
  • cargo clippy --all-targets -- -D warnings
  • cargo fmt --check

After this fix, samtools stats <rustar.bam> | grep 'average quality' returns the same value as STAR on the same FASTQ inputs.

Fixes #34

The BAM binary QUAL field stores raw Phred values (0-93) per SAM spec
§4.2.3. rustar was passing FASTQ ASCII bytes (Phred+33 encoded, e.g.
'B' = ASCII 66 = Phred 33) straight to noodles' QualityScores::from,
which expects raw Phred values. Every QUAL byte in a rustar BAM was
therefore +33 above what the spec mandates; samtools view rendered
the SAM text with another +33 on top, producing biologically
impossible quality strings like 'cccgg...'.

Subtract 33 from each FASTQ ASCII byte at every BAM-write call site
before constructing the QualityScores. Use saturating_sub so a
malformed FASTQ byte < 33 clamps to 0 rather than underflowing.

samtools stats average_quality on the nf-core/rnaseq test profile now
reads 35.3 (matching STAR) rather than the previous 68.3.

Fixes scverse#34
@pinin4fjords
Copy link
Copy Markdown
Author

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch.

samtools view <bam> first-record QUAL string + samtools stats:

QUAL: B<BBB<FF<FBF////<<FF/BFFB/<<F<F/BFFF</F<FFFF/<FF<</B<BFF/<FF
SN  average quality:  35.5

Pre-fix on the same input the same first record showed cccccggggcccg... (every byte +33 above STAR's) and average quality: 68.5. After the fix the QUAL string is in the realistic Illumina range (ASCII 47-70 = Phred 14-37) and matches STAR's mean (35.5) on the same FASTQ.

The samtools stats mismatches: 0 / error rate: 0 lines are still present — those are a downstream symptom of #29 (NM/nM tag swap), separate from this fix. With the NM tag right (or once #29 lands) those will also populate.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BAM QUAL field is offset by +33 (Phred+33 ASCII bytes written instead of raw Phred values)

1 participant