Summary
For paired-end input, rustar-aligner --quantMode TranscriptomeSAM writes Aligned.toTranscriptome.out.bam records with SEGMENTED + FIRST_SEGMENT / LAST_SEGMENT set but not PROPERLY_SEGMENTED (0x2), and with RNEXT=*, PNEXT=0, TLEN=0 on every record. Both mates project to the same transcript at correct positions; they just aren't recorded as mates of each other.
Downstream Salmon's alignment-mode quant cannot infer a fragment-length distribution from these records and falls back to its default prior (mean=250, sd=25), inflating TPMs of short transcripts. NumReads agreement with STAR stays high (gene Pearson ~0.9998); gene TPM Pearson drops to ~0.985 on identical FASTQ.
STAR reference behaviour
STAR's genome-space paired record builder sets PROPERLY_SEGMENTED, MATE_REVERSE_COMPLEMENTED, the mate ref / position fields, and the signed TLEN — see source/SAMfunctions.cpp and the BAM-attribute path in source/ReadAlign_outputTranscriptSAM.cpp. The transcriptome BAM emission uses the same per-pair bookkeeping by construction — once a pair projects to a shared transcript, the mate flags and TLEN are stamped.
Reproducer
A single bash script that runs both aligners on the same inputs and side-by-side compares the transcriptome BAM:
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-22 && cd /tmp/rustar-mre-22
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2
--quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
# --- Verify: mate-field absence in rustar's transcriptome BAM ---
RT=RUS./Aligned.toTranscriptome.out.bam
ST=STAR.Aligned.toTranscriptome.out.bam
for label in rustar STAR; do
if [ $label = rustar ]; then BAM=$RT; else BAM=$ST; fi
total=$(docker run --rm -v $PWD:/w $STAR samtools view -c /w/$BAM)
proper=$(docker run --rm -v $PWD:/w $STAR samtools view -c -f 2 /w/$BAM)
echo "$label: $proper / $total properly-paired"
done
echo
echo "First two records (NAME FLAG RNAME POS RNEXT PNEXT TLEN), rustar then STAR:"
docker run --rm -v $PWD:/w $STAR samtools view /w/$RT | awk 'BEGIN{OFS="\t"} {print "rustar", $1, $2, $3, $4, $7, $8, $9}' | head -2
docker run --rm -v $PWD:/w $STAR samtools view /w/$ST | awk 'BEGIN{OFS="\t"} {print "STAR ", $1, $2, $3, $4, $7, $8, $9}' | head -2
Observed (verified on commit 5f8ad08 + STAR 2.7.11b, fresh paired run)
rustar: 0 / 77300 properly-paired
STAR: 78886 / 78886 properly-paired
First record SRR6357072.1900203:
| Field |
STAR (r1 / r2) |
rustar (r1 / r2) |
| FLAG |
163 / 83 |
81 / 129 |
| RNAME |
YAR010C |
YAR010C |
| POS |
1045 / 1103 |
1103 / 1045 |
| RNEXT |
= / = |
* / * |
| PNEXT |
1103 / 1045 |
0 / 0 |
| TLEN |
159 / -159 |
0 / 0 |
Root cause
src/lib.rs::build_transcriptome_records_pe, the post-processing at the end of the function (lines ~762-768):
use noodles::sam::alignment::record::Flags;
for r in rec1s.iter_mut() {
*r.flags_mut() |= Flags::SEGMENTED | Flags::FIRST_SEGMENT;
}
for r in rec2s.iter_mut() {
*r.flags_mut() |= Flags::SEGMENTED | Flags::LAST_SEGMENT;
}
This is the only pair-aware fix-up applied after the two SamWriter::build_transcriptome_records calls (each of which builds its list as if it were a single-end alignment list). Never set on transcriptome records:
Flags::PROPERLY_SEGMENTED (0x2)
Flags::MATE_REVERSE_COMPLEMENTED (0x20)
record.mate_reference_sequence_id_mut() -> RNEXT stays *
record.mate_alignment_start_mut() -> PNEXT stays 0
record.template_length_mut() -> TLEN stays 0
The underlying SamWriter::build_transcriptome_records (src/io/sam.rs:566-660) only sets REVERSE_COMPLEMENTED and SECONDARY — it has no concept of "mate".
The genome-space paired path in src/io/sam.rs::build_paired_mate_record (lines 1163-1260) does set all of the above. The logic exists; it just isn't wired into the transcriptome path.
Suggested fix
In build_transcriptome_records_pe, after the existing flag-stamping loops and before the interleave, walk rec1s.iter_mut().zip(rec2s.iter_mut()) alongside p1s.iter().zip(p2s.iter()) and set on both mates:
Flags::PROPERLY_SEGMENTED (both mates are guaranteed to share a transcript by construction).
MATE_REVERSE_COMPLEMENTED iff the other mate's REVERSE_COMPLEMENTED flag is set.
mate_reference_sequence_id_mut(Some(other.chr_idx)).
mate_alignment_start_mut(Some(other.alignment_start)).
template_length_mut(signed_insert_size) — STAR convention: magnitude max(end1, end2) - min(start1, start2) + 1, leftmost mate +TLEN, rightmost -TLEN.
Alternatively, factor the mate-bookkeeping out of build_paired_mate_record into a helper used by both paths.
Why this matters
Any paired-aware downstream tool (Salmon alignment-mode, RSEM, custom fragment-size QC, tximeta wrappers) either fails or silently degrades. Salmon silently degrades — users get plausible-looking TPMs that are systematically biased toward short transcripts.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
For paired-end input,
rustar-aligner --quantMode TranscriptomeSAMwritesAligned.toTranscriptome.out.bamrecords withSEGMENTED+FIRST_SEGMENT/LAST_SEGMENTset but notPROPERLY_SEGMENTED(0x2), and withRNEXT=*,PNEXT=0,TLEN=0on every record. Both mates project to the same transcript at correct positions; they just aren't recorded as mates of each other.Downstream Salmon's alignment-mode quant cannot infer a fragment-length distribution from these records and falls back to its default prior (mean=250, sd=25), inflating TPMs of short transcripts.
NumReadsagreement with STAR stays high (gene Pearson ~0.9998); geneTPMPearson drops to ~0.985 on identical FASTQ.STAR reference behaviour
STAR's genome-space paired record builder sets
PROPERLY_SEGMENTED,MATE_REVERSE_COMPLEMENTED, the mate ref / position fields, and the signed TLEN — seesource/SAMfunctions.cppand the BAM-attribute path insource/ReadAlign_outputTranscriptSAM.cpp. The transcriptome BAM emission uses the same per-pair bookkeeping by construction — once a pair projects to a shared transcript, the mate flags and TLEN are stamped.Reproducer
A single bash script that runs both aligners on the same inputs and side-by-side compares the transcriptome BAM:
Observed (verified on commit
5f8ad08+ STAR 2.7.11b, fresh paired run)First record
SRR6357072.1900203:YAR010CYAR010C=/=*/*Root cause
src/lib.rs::build_transcriptome_records_pe, the post-processing at the end of the function (lines ~762-768):This is the only pair-aware fix-up applied after the two
SamWriter::build_transcriptome_recordscalls (each of which builds its list as if it were a single-end alignment list). Never set on transcriptome records:Flags::PROPERLY_SEGMENTED(0x2)Flags::MATE_REVERSE_COMPLEMENTED(0x20)record.mate_reference_sequence_id_mut()-> RNEXT stays*record.mate_alignment_start_mut()-> PNEXT stays 0record.template_length_mut()-> TLEN stays 0The underlying
SamWriter::build_transcriptome_records(src/io/sam.rs:566-660) only setsREVERSE_COMPLEMENTEDandSECONDARY— it has no concept of "mate".The genome-space paired path in
src/io/sam.rs::build_paired_mate_record(lines 1163-1260) does set all of the above. The logic exists; it just isn't wired into the transcriptome path.Suggested fix
In
build_transcriptome_records_pe, after the existing flag-stamping loops and before the interleave, walkrec1s.iter_mut().zip(rec2s.iter_mut())alongsidep1s.iter().zip(p2s.iter())and set on both mates:Flags::PROPERLY_SEGMENTED(both mates are guaranteed to share a transcript by construction).MATE_REVERSE_COMPLEMENTEDiff the other mate'sREVERSE_COMPLEMENTEDflag is set.mate_reference_sequence_id_mut(Some(other.chr_idx)).mate_alignment_start_mut(Some(other.alignment_start)).template_length_mut(signed_insert_size)— STAR convention: magnitudemax(end1, end2) - min(start1, start2) + 1, leftmost mate+TLEN, rightmost-TLEN.Alternatively, factor the mate-bookkeeping out of
build_paired_mate_recordinto a helper used by both paths.Why this matters
Any paired-aware downstream tool (Salmon alignment-mode, RSEM, custom fragment-size QC, tximeta wrappers) either fails or silently degrades. Salmon silently degrades — users get plausible-looking TPMs that are systematically biased toward short transcripts.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.