Skip to content

--quantMode TranscriptomeSAM paired-end BAM omits mate fields, breaking Salmon TPMs #22

@pinin4fjords

Description

@pinin4fjords

Summary

For paired-end input, rustar-aligner --quantMode TranscriptomeSAM writes Aligned.toTranscriptome.out.bam records with SEGMENTED + FIRST_SEGMENT / LAST_SEGMENT set but not PROPERLY_SEGMENTED (0x2), and with RNEXT=*, PNEXT=0, TLEN=0 on every record. Both mates project to the same transcript at correct positions; they just aren't recorded as mates of each other.

Downstream Salmon's alignment-mode quant cannot infer a fragment-length distribution from these records and falls back to its default prior (mean=250, sd=25), inflating TPMs of short transcripts. NumReads agreement with STAR stays high (gene Pearson ~0.9998); gene TPM Pearson drops to ~0.985 on identical FASTQ.

STAR reference behaviour

STAR's genome-space paired record builder sets PROPERLY_SEGMENTED, MATE_REVERSE_COMPLEMENTED, the mate ref / position fields, and the signed TLEN — see source/SAMfunctions.cpp and the BAM-attribute path in source/ReadAlign_outputTranscriptSAM.cpp. The transcriptome BAM emission uses the same per-pair bookkeeping by construction — once a pair projects to a shared transcript, the mate flags and TLEN are stamped.

Reproducer

A single bash script that runs both aligners on the same inputs and side-by-side compares the transcriptome BAM:

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-22 && cd /tmp/rustar-mre-22

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2
        --quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

# --- Verify: mate-field absence in rustar's transcriptome BAM ---
RT=RUS./Aligned.toTranscriptome.out.bam
ST=STAR.Aligned.toTranscriptome.out.bam
for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=$RT; else BAM=$ST; fi
    total=$(docker run --rm -v $PWD:/w $STAR samtools view -c /w/$BAM)
    proper=$(docker run --rm -v $PWD:/w $STAR samtools view -c -f 2 /w/$BAM)
    echo "$label: $proper / $total properly-paired"
done

echo
echo "First two records (NAME FLAG RNAME POS RNEXT PNEXT TLEN), rustar then STAR:"
docker run --rm -v $PWD:/w $STAR samtools view /w/$RT | awk 'BEGIN{OFS="\t"} {print "rustar", $1, $2, $3, $4, $7, $8, $9}' | head -2
docker run --rm -v $PWD:/w $STAR samtools view /w/$ST | awk 'BEGIN{OFS="\t"} {print "STAR  ", $1, $2, $3, $4, $7, $8, $9}' | head -2

Observed (verified on commit 5f8ad08 + STAR 2.7.11b, fresh paired run)

rustar: 0 / 77300 properly-paired
STAR:   78886 / 78886 properly-paired

First record SRR6357072.1900203:

Field STAR (r1 / r2) rustar (r1 / r2)
FLAG 163 / 83 81 / 129
RNAME YAR010C YAR010C
POS 1045 / 1103 1103 / 1045
RNEXT = / = * / *
PNEXT 1103 / 1045 0 / 0
TLEN 159 / -159 0 / 0

Root cause

src/lib.rs::build_transcriptome_records_pe, the post-processing at the end of the function (lines ~762-768):

use noodles::sam::alignment::record::Flags;
for r in rec1s.iter_mut() {
    *r.flags_mut() |= Flags::SEGMENTED | Flags::FIRST_SEGMENT;
}
for r in rec2s.iter_mut() {
    *r.flags_mut() |= Flags::SEGMENTED | Flags::LAST_SEGMENT;
}

This is the only pair-aware fix-up applied after the two SamWriter::build_transcriptome_records calls (each of which builds its list as if it were a single-end alignment list). Never set on transcriptome records:

  • Flags::PROPERLY_SEGMENTED (0x2)
  • Flags::MATE_REVERSE_COMPLEMENTED (0x20)
  • record.mate_reference_sequence_id_mut() -> RNEXT stays *
  • record.mate_alignment_start_mut() -> PNEXT stays 0
  • record.template_length_mut() -> TLEN stays 0

The underlying SamWriter::build_transcriptome_records (src/io/sam.rs:566-660) only sets REVERSE_COMPLEMENTED and SECONDARY — it has no concept of "mate".

The genome-space paired path in src/io/sam.rs::build_paired_mate_record (lines 1163-1260) does set all of the above. The logic exists; it just isn't wired into the transcriptome path.

Suggested fix

In build_transcriptome_records_pe, after the existing flag-stamping loops and before the interleave, walk rec1s.iter_mut().zip(rec2s.iter_mut()) alongside p1s.iter().zip(p2s.iter()) and set on both mates:

  • Flags::PROPERLY_SEGMENTED (both mates are guaranteed to share a transcript by construction).
  • MATE_REVERSE_COMPLEMENTED iff the other mate's REVERSE_COMPLEMENTED flag is set.
  • mate_reference_sequence_id_mut(Some(other.chr_idx)).
  • mate_alignment_start_mut(Some(other.alignment_start)).
  • template_length_mut(signed_insert_size) — STAR convention: magnitude max(end1, end2) - min(start1, start2) + 1, leftmost mate +TLEN, rightmost -TLEN.

Alternatively, factor the mate-bookkeeping out of build_paired_mate_record into a helper used by both paths.

Why this matters

Any paired-aware downstream tool (Salmon alignment-mode, RSEM, custom fragment-size QC, tximeta wrappers) either fails or silently degrades. Salmon silently degrades — users get plausible-looking TPMs that are systematically biased toward short transcripts.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions