Summary
rustar-aligner's BAM writer stores the raw FASTQ ASCII quality bytes (Phred+33 encoded) in the BAM's binary QUAL field, where the SAM spec mandates raw Phred values (0-93, no offset). Every QUAL byte in a rustar BAM is therefore +33 above the SAM-spec value. samtools view then adds another +33 when rendering as SAM text, so what should be BBBBBFFFFBBBF... (Phred 33-37, typical Illumina) comes out as cccccggggcccg... (would imply Phred 66-70, biologically impossible on any real sequencer).
Visible downstream symptoms in MultiQC and samtools stats:
| Metric (samtools stats) |
STAR |
rustar |
Delta |
average quality |
35.3-35.5 |
68.3-68.5 |
+33 (consistent across 5 samples) |
error rate |
~0.0091 |
0.0 |
(companion symptom of #29 NM/nM, separate bug) |
average_quality doubling was the smoking-gun signal that flagged this as something distinct from the NM tag bug.
SAM spec reference
From SAM v1 spec §1.4, QUAL field:
ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals -10 log10 Pr{base is wrong}.
The SAM text format ASCII-encodes with the +33 offset; the BAM binary format stores the raw Phred values without the offset (SAM spec §4.2.3):
qual: phred base quality (a sequence of 0xFF if absent) ... ranges from 0 to 93.
So a FASTQ char B (ASCII 66) represents Phred 33. The corresponding BAM byte should be 0x21 (33), and samtools view re-adds 33 to display it as B.
rustar appears to write 0x42 (66) directly into the BAM binary, so samtools view displays 66+33 = 99 = 'c', and samtools stats reads the byte as Phred 66 -> doubles the average.
Root cause in rustar
src/io/fastq.rs:55-60 documents the in-memory EncodedRead.quality as the raw FASTQ ASCII bytes:
/// A read from a FASTQ file with encoded bases
#[derive(Debug, Clone)]
pub struct EncodedRead {
pub name: String,
pub sequence: Vec<u8>,
/// Quality scores (raw FASTQ quality values)
pub quality: Vec<u8>,
}
src/io/sam.rs:632-637 passes those raw FASTQ bytes straight into noodles::sam's QualityScores::from:
let mut qual = read_qual.to_vec();
qual.reverse();
*record.quality_scores_mut() = QualityScores::from(qual);
// (and the non-reverse branch at L639)
*record.quality_scores_mut() = QualityScores::from(read_qual.to_vec());
QualityScores::from(Vec<u8>) in noodles is a thin wrapper that does not apply the Phred+33 offset — it expects the bytes to already be raw Phred values (matching the BAM spec). Net result: FASTQ ASCII bytes go straight to the BAM binary QUAL field, +33 above what they should be.
Reproducer (paired STAR + rustar)
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-qual && cd /tmp/rustar-mre-qual
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
# Pick a read present in both BAMs and compare QUAL strings
READ=SRR6357072.7414694
echo "=== Same read $READ, STAR QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
"samtools view /w/STAR.Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"
echo "=== Same read $READ, rustar QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
"samtools view /w/RUS./Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"
echo
echo "=== samtools stats: average_quality ==="
echo -n "rustar: "; docker run --rm -v $PWD:/w $STAR samtools stats /w/RUS./Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'
echo -n "STAR : "; docker run --rm -v $PWD:/w $STAR samtools stats /w/STAR.Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'
Observed (verified on commit 5f8ad08 + STAR 2.7.11b)
Same read SRR6357072.7414694 in both BAMs:
=== STAR QUAL ===
BBBBBFFFFBBBFFFBBFFFFFFBBBF<<FBBFBFFF<<7FFFF/<FB7BF/</<FBFFFFFFFF</7FBFFBFBFF<F////FFFB/FFFFFFFFFFFB/
=== rustar QUAL ===
cccccggggcccgggccggggggcccg]]gccgcggg]]XggggP]gcXcgP]P]gcgggggggg]PXgcggcgcgg]gPPPPgggcPgggggggggggcP
Every byte in rustar's string is exactly 33 above the corresponding STAR byte:
| STAR char (Phred) |
rustar char (samtools view shows) |
ASCII diff |
B (33) |
c (66) |
+33 |
F (37) |
g (70) |
+33 |
< (27) |
] (60) |
+33 |
7 (22) |
X (55) |
+33 |
/ (14) |
P (47) |
+33 |
samtools stats average quality across the five nf-core/rnaseq test profile samples:
| Sample |
STAR |
rustar |
| WT_REP1 |
35.4 |
68.4 |
| WT_REP2 |
35.5 |
68.5 |
| RAP1_IAA_30M_REP1 |
35.4 |
68.4 |
| RAP1_UNINDUCED_REP1 |
35.3 |
68.3 |
| RAP1_UNINDUCED_REP2 |
35.3 |
68.3 |
Consistent +33 across every sample, every record.
Suggested fix
In src/io/sam.rs:632-640 (and the analogous transcriptome-path code in build_transcriptome_records around L611-L640), subtract the Phred+33 offset before handing the buffer to noodles:
// FASTQ-ASCII -> raw Phred for BAM binary QUAL
let qual_raw: Vec<u8> = read_qual.iter().map(|&b| b.saturating_sub(33)).collect();
*record.quality_scores_mut() = QualityScores::from(qual_raw);
(saturating_sub because a malformed FASTQ byte < 33 should clamp to 0, not underflow.)
Same conversion needed wherever EncodedRead.quality is plumbed into a noodles RecordBuf. Three call sites in src/io/sam.rs need updating (genome-PE, genome-SE, transcriptome).
A worthwhile companion change: rename the field doc to be explicit, e.g. quality: Vec<u8> doc comment to "Quality scores (FASTQ ASCII bytes, Phred+33 encoded — subtract 33 before writing to BAM)". The current comment "raw FASTQ quality values" is ambiguous — the value is "raw FASTQ", but FASTQ's raw values are already +33 over Phred.
Test plan
- Run rustar on any paired-end FASTQ.
- Assert
samtools stats <bam> average quality is in the realistic Illumina range (25-42), not 60+.
- Assert per-byte deltas vs STAR are zero on a record present in both BAMs.
- Add a small unit test in
src/io/sam.rs that feeds known FASTQ ASCII quality (e.g. b"III" = Phred 40,40,40) and asserts the resulting RecordBuf quality bytes are [40, 40, 40].
Why this matters
samtools stats is consumed by every modern QC pipeline. With QUAL bytes +33 too high, average quality doubles, error rate is wrong (compounding the #29 NM/nM zero-error symptom), and every quality histogram is shifted.
- MultiQC displays Samtools Stats numbers prominently; the "average quality 68" entry is biologically impossible and will be flagged by any operator familiar with Illumina output.
- Quality-aware variant callers (bcftools, GATK, DeepVariant) treat QUAL bytes as Phred values directly. A +33 offset elevates every base score past the calibration range these callers expect; some will silently emit wrong likelihoods, others will clip at 93 and lose information.
- Quality-aware trimmers that operate on the BAM (rather than re-trimming from FASTQ) will see all bases as "high quality" and do nothing.
This and #29 NM/nM together make samtools stats and any quality-aware downstream tool unsafe to use on rustar BAMs until both are fixed.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
rustar-aligner's BAM writer stores the raw FASTQ ASCII quality bytes (Phred+33 encoded) in the BAM's binary QUAL field, where the SAM spec mandates raw Phred values (0-93, no offset). Every QUAL byte in a rustar BAM is therefore +33 above the SAM-spec value.samtools viewthen adds another +33 when rendering as SAM text, so what should beBBBBBFFFFBBBF...(Phred 33-37, typical Illumina) comes out ascccccggggcccg...(would imply Phred 66-70, biologically impossible on any real sequencer).Visible downstream symptoms in MultiQC and
samtools stats:average qualityerror rateaverage_qualitydoubling was the smoking-gun signal that flagged this as something distinct from the NM tag bug.SAM spec reference
From SAM v1 spec §1.4, QUAL field:
The SAM text format ASCII-encodes with the
+33offset; the BAM binary format stores the raw Phred values without the offset (SAM spec §4.2.3):So a FASTQ char
B(ASCII 66) represents Phred 33. The corresponding BAM byte should be0x21(33), andsamtools viewre-adds 33 to display it asB.rustar appears to write
0x42(66) directly into the BAM binary, sosamtools viewdisplays66+33 = 99 = 'c', andsamtools statsreads the byte as Phred 66 -> doubles the average.Root cause in rustar
src/io/fastq.rs:55-60documents the in-memoryEncodedRead.qualityas the raw FASTQ ASCII bytes:src/io/sam.rs:632-637passes those raw FASTQ bytes straight intonoodles::sam'sQualityScores::from:QualityScores::from(Vec<u8>)in noodles is a thin wrapper that does not apply the Phred+33 offset — it expects the bytes to already be raw Phred values (matching the BAM spec). Net result: FASTQ ASCII bytes go straight to the BAM binary QUAL field, +33 above what they should be.Reproducer (paired STAR + rustar)
Observed (verified on commit
5f8ad08+ STAR 2.7.11b)Same read
SRR6357072.7414694in both BAMs:Every byte in rustar's string is exactly 33 above the corresponding STAR byte:
B(33)c(66)F(37)g(70)<(27)](60)7(22)X(55)/(14)P(47)samtools stats average qualityacross the fivenf-core/rnaseqtest profile samples:Consistent +33 across every sample, every record.
Suggested fix
In
src/io/sam.rs:632-640(and the analogous transcriptome-path code inbuild_transcriptome_recordsaround L611-L640), subtract the Phred+33 offset before handing the buffer to noodles:(
saturating_subbecause a malformed FASTQ byte < 33 should clamp to 0, not underflow.)Same conversion needed wherever
EncodedRead.qualityis plumbed into a noodlesRecordBuf. Three call sites insrc/io/sam.rsneed updating (genome-PE, genome-SE, transcriptome).A worthwhile companion change: rename the field doc to be explicit, e.g.
quality: Vec<u8>doc comment to "Quality scores (FASTQ ASCII bytes, Phred+33 encoded — subtract 33 before writing to BAM)". The current comment "raw FASTQ quality values" is ambiguous — the value is "raw FASTQ", but FASTQ's raw values are already +33 over Phred.Test plan
samtools stats <bam>average qualityis in the realistic Illumina range (25-42), not 60+.src/io/sam.rsthat feeds known FASTQ ASCII quality (e.g.b"III"= Phred 40,40,40) and asserts the resultingRecordBufquality bytes are[40, 40, 40].Why this matters
samtools statsis consumed by every modern QC pipeline. With QUAL bytes +33 too high,average qualitydoubles,error rateis wrong (compounding the #29 NM/nM zero-error symptom), and every quality histogram is shifted.This and #29 NM/nM together make
samtools statsand any quality-aware downstream tool unsafe to use on rustar BAMs until both are fixed.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.