BAM `QUAL` field is offset by +33 (Phred+33 ASCII bytes written instead of raw Phred values)

## Summary

`rustar-aligner`'s BAM writer stores the **raw FASTQ ASCII quality bytes (Phred+33 encoded)** in the BAM's binary QUAL field, where the SAM spec mandates **raw Phred values (0-93, no offset)**. Every QUAL byte in a rustar BAM is therefore +33 above the SAM-spec value. `samtools view` then adds another +33 when rendering as SAM text, so what should be `BBBBBFFFFBBBF...` (Phred 33-37, typical Illumina) comes out as `cccccggggcccg...` (would imply Phred 66-70, biologically impossible on any real sequencer).

Visible downstream symptoms in MultiQC and `samtools stats`:

| Metric (samtools stats)          | STAR    | rustar  | Delta            |
|----------------------------------|--------:|--------:|------------------|
| `average quality`                | 35.3-35.5 | 68.3-68.5 | **+33 (consistent across 5 samples)** |
| `error rate`                     | ~0.0091 | 0.0     | (companion symptom of [#29 NM/nM](https://github.com/scverse/rustar-aligner/issues/29), separate bug) |

`average_quality` doubling was the smoking-gun signal that flagged this as something distinct from the NM tag bug.

## SAM spec reference

From [SAM v1 spec §1.4](https://samtools.github.io/hts-specs/SAMv1.pdf), QUAL field:

> ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals -10 log10 Pr{base is wrong}.

The SAM text format ASCII-encodes with the `+33` offset; **the BAM binary format stores the raw Phred values without the offset** ([SAM spec §4.2.3](https://samtools.github.io/hts-specs/SAMv1.pdf)):

> qual: phred base quality (a sequence of 0xFF if absent) ... ranges from 0 to 93.

So a FASTQ char `B` (ASCII 66) represents Phred 33. The corresponding BAM byte should be `0x21` (33), and `samtools view` re-adds 33 to display it as `B`.

rustar appears to write `0x42` (66) directly into the BAM binary, so `samtools view` displays `66+33 = 99 = 'c'`, and `samtools stats` reads the byte as Phred 66 -> doubles the average.

## Root cause in rustar

[`src/io/fastq.rs:55-60`](https://github.com/scverse/rustar-aligner/blob/main/src/io/fastq.rs) documents the in-memory `EncodedRead.quality` as the raw FASTQ ASCII bytes:

```rust
/// A read from a FASTQ file with encoded bases
#[derive(Debug, Clone)]
pub struct EncodedRead {
    pub name: String,
    pub sequence: Vec<u8>,
    /// Quality scores (raw FASTQ quality values)
    pub quality: Vec<u8>,
}
```

[`src/io/sam.rs:632-637`](https://github.com/scverse/rustar-aligner/blob/main/src/io/sam.rs) passes those raw FASTQ bytes straight into `noodles::sam`'s `QualityScores::from`:

```rust
let mut qual = read_qual.to_vec();
qual.reverse();
*record.quality_scores_mut() = QualityScores::from(qual);
// (and the non-reverse branch at L639)
*record.quality_scores_mut() = QualityScores::from(read_qual.to_vec());
```

`QualityScores::from(Vec<u8>)` in noodles is a thin wrapper that does **not** apply the Phred+33 offset — it expects the bytes to already be raw Phred values (matching the BAM spec). Net result: FASTQ ASCII bytes go straight to the BAM binary QUAL field, +33 above what they should be.

## Reproducer (paired STAR + rustar)

```bash
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-qual && cd /tmp/rustar-mre-qual

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

# Pick a read present in both BAMs and compare QUAL strings
READ=SRR6357072.7414694
echo "=== Same read $READ, STAR QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
    "samtools view /w/STAR.Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"
echo "=== Same read $READ, rustar QUAL ==="
docker run --rm -v $PWD:/w $STAR bash -c \
    "samtools view /w/RUS./Aligned.out.bam | awk -v n=$READ '\$1==n {print \$11; exit}'"

echo
echo "=== samtools stats: average_quality ==="
echo -n "rustar: "; docker run --rm -v $PWD:/w $STAR samtools stats /w/RUS./Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'
echo -n "STAR  : "; docker run --rm -v $PWD:/w $STAR samtools stats /w/STAR.Aligned.out.bam | grep '^SN.*average quality:' | awk '{print $NF}'
```

## Observed (verified on commit `5f8ad08` + STAR 2.7.11b)

Same read `SRR6357072.7414694` in both BAMs:

```
=== STAR QUAL ===
BBBBBFFFFBBBFFFBBFFFFFFBBBF<<FBBFBFFF<<7FFFF/<FB7BF/</<FBFFFFFFFF</7FBFFBFBFF<F////FFFB/FFFFFFFFFFFB/

=== rustar QUAL ===
cccccggggcccgggccggggggcccg]]gccgcggg]]XggggP]gcXcgP]P]gcgggggggg]PXgcggcgcgg]gPPPPgggcPgggggggggggcP
```

Every byte in rustar's string is exactly 33 above the corresponding STAR byte:

| STAR char (Phred) | rustar char (samtools view shows) | ASCII diff |
|-------------------|-----------------------------------|-----------:|
| `B` (33)          | `c` (66)                          | +33        |
| `F` (37)          | `g` (70)                          | +33        |
| `<` (27)          | `]` (60)                          | +33        |
| `7` (22)          | `X` (55)                          | +33        |
| `/` (14)          | `P` (47)                          | +33        |

`samtools stats average quality` across the five `nf-core/rnaseq` test profile samples:

| Sample              | STAR    | rustar  |
|---------------------|--------:|--------:|
| WT_REP1             | 35.4    | 68.4    |
| WT_REP2             | 35.5    | 68.5    |
| RAP1_IAA_30M_REP1   | 35.4    | 68.4    |
| RAP1_UNINDUCED_REP1 | 35.3    | 68.3    |
| RAP1_UNINDUCED_REP2 | 35.3    | 68.3    |

Consistent +33 across every sample, every record.

## Suggested fix

In [`src/io/sam.rs:632-640`](https://github.com/scverse/rustar-aligner/blob/main/src/io/sam.rs) (and the analogous transcriptome-path code in `build_transcriptome_records` around L611-L640), subtract the Phred+33 offset before handing the buffer to noodles:

```rust
// FASTQ-ASCII -> raw Phred for BAM binary QUAL
let qual_raw: Vec<u8> = read_qual.iter().map(|&b| b.saturating_sub(33)).collect();
*record.quality_scores_mut() = QualityScores::from(qual_raw);
```

(`saturating_sub` because a malformed FASTQ byte < 33 should clamp to 0, not underflow.)

Same conversion needed wherever `EncodedRead.quality` is plumbed into a noodles `RecordBuf`. Three call sites in `src/io/sam.rs` need updating (genome-PE, genome-SE, transcriptome).

A worthwhile companion change: rename the field doc to be explicit, e.g. `quality: Vec<u8>` doc comment to "Quality scores (FASTQ ASCII bytes, Phred+33 encoded — subtract 33 before writing to BAM)". The current comment "raw FASTQ quality values" is ambiguous — the value is "raw FASTQ", but FASTQ's raw values are already +33 over Phred.

## Test plan

1. Run rustar on any paired-end FASTQ.
2. Assert `samtools stats <bam>` `average quality` is in the realistic Illumina range (25-42), not 60+.
3. Assert per-byte deltas vs STAR are zero on a record present in both BAMs.
4. Add a small unit test in `src/io/sam.rs` that feeds known FASTQ ASCII quality (e.g. `b"III"` = Phred 40,40,40) and asserts the resulting `RecordBuf` quality bytes are `[40, 40, 40]`.

## Why this matters

- **`samtools stats`** is consumed by every modern QC pipeline. With QUAL bytes +33 too high, `average quality` doubles, `error rate` is wrong (compounding the [#29 NM/nM](https://github.com/scverse/rustar-aligner/issues/29) zero-error symptom), and every quality histogram is shifted.
- **MultiQC** displays Samtools Stats numbers prominently; the "average quality 68" entry is biologically impossible and will be flagged by any operator familiar with Illumina output.
- **Quality-aware variant callers** (bcftools, GATK, DeepVariant) treat QUAL bytes as Phred values directly. A +33 offset elevates every base score past the calibration range these callers expect; some will silently emit wrong likelihoods, others will clip at 93 and lose information.
- **Quality-aware trimmers** that operate on the BAM (rather than re-trimming from FASTQ) will see all bases as "high quality" and do nothing.

This and [#29 NM/nM](https://github.com/scverse/rustar-aligner/issues/29) together make `samtools stats` and any quality-aware downstream tool unsafe to use on rustar BAMs until both are fixed.

---

<sub>Filed during **nf-core/rnaseq integration testing** ([nf-core/rnaseq#1855](https://github.com/nf-core/rnaseq/pull/1855)). All sibling issues from this exercise: [`author:pinin4fjords`](https://github.com/scverse/rustar-aligner/issues?q=is%3Aissue+author%3Apinin4fjords) or grep for `nf-core/rnaseq#1855`.</sub>


STAR char (Phred)	rustar char (samtools view shows)	ASCII diff
`B` (33)	`c` (66)	+33
`F` (37)	`g` (70)	+33
`<` (27)	`]` (60)	+33
`7` (22)	`X` (55)	+33
`/` (14)	`P` (47)	+33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BAM `QUAL` field is offset by +33 (Phred+33 ASCII bytes written instead of raw Phred values) #34

Summary

SAM spec reference

Root cause in rustar

Reproducer (paired STAR + rustar)

Observed (verified on commit `5f8ad08` + STAR 2.7.11b)

Suggested fix

Test plan

Why this matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric (samtools stats)	STAR	rustar	Delta
`average quality`	35.3-35.5	68.3-68.5	+33 (consistent across 5 samples)
`error rate`	~0.0091	0.0	(companion symptom of #29 NM/nM, separate bug)

Sample	STAR	rustar
WT_REP1	35.4	68.4
WT_REP2	35.5	68.5
RAP1_IAA_30M_REP1	35.4	68.4
RAP1_UNINDUCED_REP1	35.3	68.3
RAP1_UNINDUCED_REP2	35.3	68.3

BAM QUAL field is offset by +33 (Phred+33 ASCII bytes written instead of raw Phred values) #34

Description

Summary

SAM spec reference

Root cause in rustar

Reproducer (paired STAR + rustar)

Observed (verified on commit 5f8ad08 + STAR 2.7.11b)

Suggested fix

Test plan

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

BAM `QUAL` field is offset by +33 (Phred+33 ASCII bytes written instead of raw Phred values) #34

Observed (verified on commit `5f8ad08` + STAR 2.7.11b)