Summary
The transcriptome BAM emitted by rustar-aligner --quantMode TranscriptomeSAM includes the @RG header line correctly, but no per-record RG:Z: tags — even when --outSAMattrRGline is supplied and the genome-space Aligned.out.bam has RG:Z: on every record.
| BAM |
@RG header |
Records |
Records with RG:Z: |
RUS./Aligned.out.bam (rustar, genome) |
yes |
88 592 |
88 592 |
RUS./Aligned.toTranscriptome.out.bam (rustar, transcript) |
yes |
77 300 |
0 |
STAR.Aligned.out.bam (STAR, genome) |
yes |
88 072 |
88 072 |
STAR.Aligned.toTranscriptome.out.bam (STAR, transcript) |
yes |
78 886 |
78 886 |
Breaks any tool that splits records in the transcriptome BAM by RG:Z: (multi-sample bundled BAMs, custom QC keyed on read group, downstream re-merge flows).
STAR reference behaviour
STAR maintains two independent attribute order lists, outSAMattrOrder (genome BAM) and outSAMattrOrderQuant (transcriptome BAM), and the RG-attribute promotion adds the tag to both.
source/Parameters_samAttributes.cpp:201-205:
if (outSAMattrRGline[0]!="-" && !outSAMattrPresent.RG) {
outSAMattrOrder.push_back(ATTR_RG);
outSAMattrOrderQuant.push_back(ATTR_RG); // <-- same RG goes to both lists
outSAMattrPresent.RG=true;
inOut->logMain << "WARNING --outSAMattrRG defines a read group, therefore STAR will output RG attribute" <<endl;
};
And explicit RG handling at source/Parameters_samAttributes.cpp:96-99 pushes to both outSAMattrOrder and outSAMattrOrderQuant. The transcriptome SAM/BAM header is written in source/samHeaders.cpp:14-17 with the same @RG line. STAR's invariant: if @RG is in either header, the per-record RG:Z: tag is in every record of that BAM.
Reproducer
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-32 && cd /tmp/rustar-mre-32
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
for label in rustar STAR; do
if [ $label = rustar ]; then G=RUS./Aligned.out.bam; T=RUS./Aligned.toTranscriptome.out.bam
else G=STAR.Aligned.out.bam; T=STAR.Aligned.toTranscriptome.out.bam; fi
g_hdr=$(docker run --rm -v $PWD:/w $STAR samtools view -H /w/$G | grep -c '^@RG')
g_rec=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$G | grep -c 'RG:Z:')
t_hdr=$(docker run --rm -v $PWD:/w $STAR samtools view -H /w/$T | grep -c '^@RG')
t_rec=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$T | grep -c 'RG:Z:')
t_total=$(docker run --rm -v $PWD:/w $STAR samtools view -c /w/$T)
printf "%-6s genome: @RG=%d, RG:Z:=%d | transcriptome: @RG=%d, RG:Z:=%d / %d\n" \
$label $g_hdr $g_rec $t_hdr $t_rec $t_total
done
Observed (verified, fresh paired run)
rustar genome: @RG=1, RG:Z:=88592 | transcriptome: @RG=1, RG:Z:=0 / 77300
STAR genome: @RG=1, RG:Z:=88072 | transcriptome: @RG=1, RG:Z:=78886 / 78886
Both aligners write the @RG header into the transcriptome BAM, but only STAR writes the per-record tag.
Suggested fix
In src/lib.rs::build_transcriptome_records_pe (and its SE counterpart), after building each record, stamp the RG tag from the user-supplied --outSAMattrRGline. The equivalent of STAR's Parameters_samAttributes.cpp:201-205 — adding ATTR_RG to the transcriptome attribute-order list whenever it's added to the genome list.
Alternatively, lift the RG-tag-write logic out of the genome-record builder into a shared helper that both paths call. Same general direction as the suggested fix for #22 (extracting paired-record bookkeeping into a shared helper).
Test plan
Once the fix lands:
- Run rustar with
--outSAMattrRGline ID:foo SM:bar and --quantMode TranscriptomeSAM.
- Assert
samtools view <transcriptome.bam> | grep -c 'RG:Z:foo' equals the total transcriptome record count.
- Cross-check the value matches the ID in the
@RG header line.
Severity
Low. Salmon's alignment-mode quant ignores RG so nf-core/rnaseq isn't directly broken by this. But this combines with #22 transcriptome BAM mate-fields and other transcriptome-side gaps as the trio of "transcriptome BAM is a less complete sibling of the genome BAM". Worth tightening so the transcriptome BAM is byte-symmetric with the genome BAM for shared fields.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
The transcriptome BAM emitted by
rustar-aligner --quantMode TranscriptomeSAMincludes the@RGheader line correctly, but no per-recordRG:Z:tags — even when--outSAMattrRGlineis supplied and the genome-spaceAligned.out.bamhasRG:Z:on every record.@RGheaderRG:Z:RUS./Aligned.out.bam(rustar, genome)RUS./Aligned.toTranscriptome.out.bam(rustar, transcript)STAR.Aligned.out.bam(STAR, genome)STAR.Aligned.toTranscriptome.out.bam(STAR, transcript)Breaks any tool that splits records in the transcriptome BAM by
RG:Z:(multi-sample bundled BAMs, custom QC keyed on read group, downstream re-merge flows).STAR reference behaviour
STAR maintains two independent attribute order lists,
outSAMattrOrder(genome BAM) andoutSAMattrOrderQuant(transcriptome BAM), and the RG-attribute promotion adds the tag to both.source/Parameters_samAttributes.cpp:201-205:And explicit RG handling at
source/Parameters_samAttributes.cpp:96-99pushes to bothoutSAMattrOrderandoutSAMattrOrderQuant. The transcriptome SAM/BAM header is written insource/samHeaders.cpp:14-17with the same@RGline. STAR's invariant: if@RGis in either header, the per-recordRG:Z:tag is in every record of that BAM.Reproducer
Observed (verified, fresh paired run)
Both aligners write the
@RGheader into the transcriptome BAM, but only STAR writes the per-record tag.Suggested fix
In
src/lib.rs::build_transcriptome_records_pe(and its SE counterpart), after building each record, stamp the RG tag from the user-supplied--outSAMattrRGline. The equivalent of STAR'sParameters_samAttributes.cpp:201-205— addingATTR_RGto the transcriptome attribute-order list whenever it's added to the genome list.Alternatively, lift the RG-tag-write logic out of the genome-record builder into a shared helper that both paths call. Same general direction as the suggested fix for #22 (extracting paired-record bookkeeping into a shared helper).
Test plan
Once the fix lands:
--outSAMattrRGline ID:foo SM:barand--quantMode TranscriptomeSAM.samtools view <transcriptome.bam> | grep -c 'RG:Z:foo'equals the total transcriptome record count.@RGheader line.Severity
Low. Salmon's alignment-mode quant ignores RG so nf-core/rnaseq isn't directly broken by this. But this combines with #22 transcriptome BAM mate-fields and other transcriptome-side gaps as the trio of "transcriptome BAM is a less complete sibling of the genome BAM". Worth tightening so the transcriptome BAM is byte-symmetric with the genome BAM for shared fields.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.