`--quantMode TranscriptomeSAM` paired-end BAM omits mate fields, breaking Salmon TPMs

## Summary

For paired-end input, `rustar-aligner --quantMode TranscriptomeSAM` writes `Aligned.toTranscriptome.out.bam` records with `SEGMENTED` + `FIRST_SEGMENT` / `LAST_SEGMENT` set but **not** `PROPERLY_SEGMENTED` (0x2), and with `RNEXT=*`, `PNEXT=0`, `TLEN=0` on every record. Both mates project to the same transcript at correct positions; they just aren't recorded as mates of each other.

Downstream Salmon's alignment-mode quant cannot infer a fragment-length distribution from these records and falls back to its default prior (mean=250, sd=25), inflating TPMs of short transcripts. `NumReads` agreement with STAR stays high (gene Pearson ~0.9998); gene `TPM` Pearson drops to ~0.985 on identical FASTQ.

## STAR reference behaviour

STAR's genome-space paired record builder sets `PROPERLY_SEGMENTED`, `MATE_REVERSE_COMPLEMENTED`, the mate ref / position fields, and the signed TLEN — see [`source/SAMfunctions.cpp`](https://github.com/alexdobin/STAR/blob/master/source/SAMfunctions.cpp) and the BAM-attribute path in [`source/ReadAlign_outputTranscriptSAM.cpp`](https://github.com/alexdobin/STAR/blob/master/source/ReadAlign_outputTranscriptSAM.cpp). The transcriptome BAM emission uses the same per-pair bookkeeping by construction — once a pair projects to a shared transcript, the mate flags and TLEN are stamped.

## Reproducer

A single bash script that runs both aligners on the same inputs and side-by-side compares the transcriptome BAM:

```bash
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-22 && cd /tmp/rustar-mre-22

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2
        --quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

# --- Verify: mate-field absence in rustar's transcriptome BAM ---
RT=RUS./Aligned.toTranscriptome.out.bam
ST=STAR.Aligned.toTranscriptome.out.bam
for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=$RT; else BAM=$ST; fi
    total=$(docker run --rm -v $PWD:/w $STAR samtools view -c /w/$BAM)
    proper=$(docker run --rm -v $PWD:/w $STAR samtools view -c -f 2 /w/$BAM)
    echo "$label: $proper / $total properly-paired"
done

echo
echo "First two records (NAME FLAG RNAME POS RNEXT PNEXT TLEN), rustar then STAR:"
docker run --rm -v $PWD:/w $STAR samtools view /w/$RT | awk 'BEGIN{OFS="\t"} {print "rustar", $1, $2, $3, $4, $7, $8, $9}' | head -2
docker run --rm -v $PWD:/w $STAR samtools view /w/$ST | awk 'BEGIN{OFS="\t"} {print "STAR  ", $1, $2, $3, $4, $7, $8, $9}' | head -2
```

## Observed (verified on commit `5f8ad08` + STAR 2.7.11b, fresh paired run)

```
rustar: 0 / 77300 properly-paired
STAR:   78886 / 78886 properly-paired
```

First record `SRR6357072.1900203`:

| Field   | STAR (r1 / r2) | rustar (r1 / r2) |
|---------|----------------|------------------|
| FLAG    | 163 / 83       | 81 / 129         |
| RNAME   | `YAR010C`      | `YAR010C`        |
| POS     | 1045 / 1103    | 1103 / 1045      |
| RNEXT   | `=` / `=`      | **`*`** / **`*`** |
| PNEXT   | 1103 / 1045    | **0** / **0**    |
| TLEN    | 159 / -159     | **0** / **0**    |

## Root cause

[`src/lib.rs::build_transcriptome_records_pe`](https://github.com/scverse/rustar-aligner/blob/main/src/lib.rs), the post-processing at the end of the function (lines ~762-768):

```rust
use noodles::sam::alignment::record::Flags;
for r in rec1s.iter_mut() {
    *r.flags_mut() |= Flags::SEGMENTED | Flags::FIRST_SEGMENT;
}
for r in rec2s.iter_mut() {
    *r.flags_mut() |= Flags::SEGMENTED | Flags::LAST_SEGMENT;
}
```

This is the only pair-aware fix-up applied after the two `SamWriter::build_transcriptome_records` calls (each of which builds its list as if it were a single-end alignment list). Never set on transcriptome records:

- `Flags::PROPERLY_SEGMENTED` (0x2)
- `Flags::MATE_REVERSE_COMPLEMENTED` (0x20)
- `record.mate_reference_sequence_id_mut()` -> RNEXT stays `*`
- `record.mate_alignment_start_mut()`       -> PNEXT stays 0
- `record.template_length_mut()`            -> TLEN stays 0

The underlying `SamWriter::build_transcriptome_records` (`src/io/sam.rs:566-660`) only sets `REVERSE_COMPLEMENTED` and `SECONDARY` — it has no concept of "mate".

The genome-space paired path in `src/io/sam.rs::build_paired_mate_record` (lines 1163-1260) does set all of the above. The logic exists; it just isn't wired into the transcriptome path.

## Suggested fix

In `build_transcriptome_records_pe`, after the existing flag-stamping loops and before the interleave, walk `rec1s.iter_mut().zip(rec2s.iter_mut())` alongside `p1s.iter().zip(p2s.iter())` and set on both mates:

- `Flags::PROPERLY_SEGMENTED` (both mates are guaranteed to share a transcript by construction).
- `MATE_REVERSE_COMPLEMENTED` iff the other mate's `REVERSE_COMPLEMENTED` flag is set.
- `mate_reference_sequence_id_mut(Some(other.chr_idx))`.
- `mate_alignment_start_mut(Some(other.alignment_start))`.
- `template_length_mut(signed_insert_size)` — STAR convention: magnitude `max(end1, end2) - min(start1, start2) + 1`, leftmost mate `+TLEN`, rightmost `-TLEN`.

Alternatively, factor the mate-bookkeeping out of `build_paired_mate_record` into a helper used by both paths.

## Why this matters

Any paired-aware downstream tool (Salmon alignment-mode, RSEM, custom fragment-size QC, tximeta wrappers) either fails or silently degrades. Salmon silently degrades — users get plausible-looking TPMs that are systematically biased toward short transcripts.


---

<sub>Filed during **nf-core/rnaseq integration testing** ([nf-core/rnaseq#1855](https://github.com/nf-core/rnaseq/pull/1855)). All sibling issues from this exercise: [`author:pinin4fjords`](https://github.com/scverse/rustar-aligner/issues?q=is%3Aissue+author%3Apinin4fjords) or grep for `nf-core/rnaseq#1855`.</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`--quantMode TranscriptomeSAM` paired-end BAM omits mate fields, breaking Salmon TPMs #22

Summary

STAR reference behaviour

Reproducer

Observed (verified on commit `5f8ad08` + STAR 2.7.11b, fresh paired run)

Root cause

Suggested fix

Why this matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	STAR (r1 / r2)	rustar (r1 / r2)
FLAG	163 / 83	81 / 129
RNAME	`YAR010C`	`YAR010C`
POS	1045 / 1103	1103 / 1045
RNEXT	`=` / `=`	*`` / ``*
PNEXT	1103 / 1045	0 / 0
TLEN	159 / -159	0 / 0

--quantMode TranscriptomeSAM paired-end BAM omits mate fields, breaking Salmon TPMs #22

Description

Summary

STAR reference behaviour

Reproducer

Observed (verified on commit 5f8ad08 + STAR 2.7.11b, fresh paired run)

Root cause

Suggested fix

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`--quantMode TranscriptomeSAM` paired-end BAM omits mate fields, breaking Salmon TPMs #22

Observed (verified on commit `5f8ad08` + STAR 2.7.11b, fresh paired run)