Skip to content

BAM QUAL field is offset by +33 (Phred+33 ASCII bytes written instead of raw Phred values) #34

@pinin4fjords

Description

@pinin4fjords

Summary

rustar-aligner's BAM writer stores the raw FASTQ ASCII quality bytes (Phred+33 encoded) in the BAM's binary QUAL field, where the SAM spec mandates raw Phred values (0-93, no offset). Every QUAL byte in a rustar BAM is therefore +33 above the SAM-spec value. samtools view then adds another +33 when rendering as SAM text, so what should be BBBBBFFFFBBBF... (Phred 33-37, typical Illumina) comes out as cccccggggcccg... (would imply Phred 66-70, biologically impossible on any real sequencer).

Visible downstream symptoms in MultiQC and samtools stats:

Metric (samtools stats) STAR rustar Delta
average quality 35.3-35.5 68.3-68.5 +33 (consistent across 5 samples)
error rate ~0.0091 0.0 (companion symptom of #29 NM/nM, separate bug)

average_quality doubling was the smoking-gun signal that flagged this as something distinct from the NM tag bug.

SAM spec reference

From SAM v1 spec §1.4, QUAL field:

ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals -10 log10 Pr{base is wrong}.

The SAM text format ASCII-encodes with the +33 offset; the BAM binary format stores the raw Phred values without the offset (SAM spec §4.2.3):

qual: phred base quality (a sequence of 0xFF if absent) ... ranges from 0 to 93.

So a FASTQ char B (ASCII 66) represents Phred 33. The corresponding BAM byte should be 0x21 (33), and samtools view re-adds 33 to display it as B.

rustar appears to write 0x42 (66) directly into the BAM binary, so samtools view displays 66+33 = 99 = 'c', and samtools stats reads the byte as Phred 66 -> doubles the average.

Root cause in rustar

src/io/fastq.rs:55-60 documents the in-memory EncodedRead.quality as the raw FASTQ ASCII bytes:

/// A read from a FASTQ file with encoded bases
#[derive(Debug, Clone)]
pub struct EncodedRead {
    pub name: String,
    pub sequence: Vec<u8>,
    /// Quality scores (raw FASTQ quality values)
    pub quality: Vec<u8>,
}

src/io/sam.rs:632-637 passes those raw FASTQ bytes straight into noodles::sam's QualityScores::from:

let mut qual = read_qual.to_vec();
qual.reverse();
*record.quality_scores_mut() = QualityScores::from(qual);
// (and the non-reverse branch at L639)
*record.quality_scores_mut() = QualityScores::from(read_qual.to_vec());

QualityScores::from(Vec<u8>) in noodles is a thin wrapper that does not apply the Phred+33 offset — it expects the bytes to already be raw Phred values (matching the BAM spec). Net result: FASTQ ASCII bytes go straight to the BAM binary QUAL field, +33 above what they should be.

Reproducer (paired STAR + rustar)

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-qual && cd /tmp/rustar-mre-qual

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

# Pick a read present in both BAMs and compare QUAL strings
READ=SRR6357072.7414694
echo "=== Same read $READ, STAR QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
    "samtools view /w/STAR.Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"
echo "=== Same read $READ, rustar QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
    "samtools view /w/RUS./Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"

echo
echo "=== samtools stats: average_quality ==="
echo -n "rustar: "; docker run --rm -v $PWD:/w $STAR samtools stats /w/RUS./Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'
echo -n "STAR  : "; docker run --rm -v $PWD:/w $STAR samtools stats /w/STAR.Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'

Observed (verified on commit 5f8ad08 + STAR 2.7.11b)

Same read SRR6357072.7414694 in both BAMs:

=== STAR QUAL ===
BBBBBFFFFBBBFFFBBFFFFFFBBBF<<FBBFBFFF<<7FFFF/<FB7BF/</<FBFFFFFFFF</7FBFFBFBFF<F////FFFB/FFFFFFFFFFFB/

=== rustar QUAL ===
cccccggggcccgggccggggggcccg]]gccgcggg]]XggggP]gcXcgP]P]gcgggggggg]PXgcggcgcgg]gPPPPgggcPgggggggggggcP

Every byte in rustar's string is exactly 33 above the corresponding STAR byte:

STAR char (Phred) rustar char (samtools view shows) ASCII diff
B (33) c (66) +33
F (37) g (70) +33
< (27) ] (60) +33
7 (22) X (55) +33
/ (14) P (47) +33

samtools stats average quality across the five nf-core/rnaseq test profile samples:

Sample STAR rustar
WT_REP1 35.4 68.4
WT_REP2 35.5 68.5
RAP1_IAA_30M_REP1 35.4 68.4
RAP1_UNINDUCED_REP1 35.3 68.3
RAP1_UNINDUCED_REP2 35.3 68.3

Consistent +33 across every sample, every record.

Suggested fix

In src/io/sam.rs:632-640 (and the analogous transcriptome-path code in build_transcriptome_records around L611-L640), subtract the Phred+33 offset before handing the buffer to noodles:

// FASTQ-ASCII -> raw Phred for BAM binary QUAL
let qual_raw: Vec<u8> = read_qual.iter().map(|&b| b.saturating_sub(33)).collect();
*record.quality_scores_mut() = QualityScores::from(qual_raw);

(saturating_sub because a malformed FASTQ byte < 33 should clamp to 0, not underflow.)

Same conversion needed wherever EncodedRead.quality is plumbed into a noodles RecordBuf. Three call sites in src/io/sam.rs need updating (genome-PE, genome-SE, transcriptome).

A worthwhile companion change: rename the field doc to be explicit, e.g. quality: Vec<u8> doc comment to "Quality scores (FASTQ ASCII bytes, Phred+33 encoded — subtract 33 before writing to BAM)". The current comment "raw FASTQ quality values" is ambiguous — the value is "raw FASTQ", but FASTQ's raw values are already +33 over Phred.

Test plan

  1. Run rustar on any paired-end FASTQ.
  2. Assert samtools stats <bam> average quality is in the realistic Illumina range (25-42), not 60+.
  3. Assert per-byte deltas vs STAR are zero on a record present in both BAMs.
  4. Add a small unit test in src/io/sam.rs that feeds known FASTQ ASCII quality (e.g. b"III" = Phred 40,40,40) and asserts the resulting RecordBuf quality bytes are [40, 40, 40].

Why this matters

  • samtools stats is consumed by every modern QC pipeline. With QUAL bytes +33 too high, average quality doubles, error rate is wrong (compounding the #29 NM/nM zero-error symptom), and every quality histogram is shifted.
  • MultiQC displays Samtools Stats numbers prominently; the "average quality 68" entry is biologically impossible and will be flagged by any operator familiar with Illumina output.
  • Quality-aware variant callers (bcftools, GATK, DeepVariant) treat QUAL bytes as Phred values directly. A +33 offset elevates every base score past the calibration range these callers expect; some will silently emit wrong likelihoods, others will clip at 93 and lose information.
  • Quality-aware trimmers that operate on the BAM (rather than re-trimming from FASTQ) will see all bases as "high quality" and do nothing.

This and #29 NM/nM together make samtools stats and any quality-aware downstream tool unsafe to use on rustar BAMs until both are fixed.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions