Skip to content

Transcriptome BAM (--quantMode TranscriptomeSAM) omits per-record RG:Z: tag #32

@pinin4fjords

Description

@pinin4fjords

Summary

The transcriptome BAM emitted by rustar-aligner --quantMode TranscriptomeSAM includes the @RG header line correctly, but no per-record RG:Z: tags — even when --outSAMattrRGline is supplied and the genome-space Aligned.out.bam has RG:Z: on every record.

BAM @RG header Records Records with RG:Z:
RUS./Aligned.out.bam (rustar, genome) yes 88 592 88 592
RUS./Aligned.toTranscriptome.out.bam (rustar, transcript) yes 77 300 0
STAR.Aligned.out.bam (STAR, genome) yes 88 072 88 072
STAR.Aligned.toTranscriptome.out.bam (STAR, transcript) yes 78 886 78 886

Breaks any tool that splits records in the transcriptome BAM by RG:Z: (multi-sample bundled BAMs, custom QC keyed on read group, downstream re-merge flows).

STAR reference behaviour

STAR maintains two independent attribute order lists, outSAMattrOrder (genome BAM) and outSAMattrOrderQuant (transcriptome BAM), and the RG-attribute promotion adds the tag to both.

source/Parameters_samAttributes.cpp:201-205:

if (outSAMattrRGline[0]!="-" && !outSAMattrPresent.RG) {
    outSAMattrOrder.push_back(ATTR_RG);
    outSAMattrOrderQuant.push_back(ATTR_RG);   // <-- same RG goes to both lists
    outSAMattrPresent.RG=true;
    inOut->logMain << "WARNING --outSAMattrRG defines a read group, therefore STAR will output RG attribute" <<endl;
};

And explicit RG handling at source/Parameters_samAttributes.cpp:96-99 pushes to both outSAMattrOrder and outSAMattrOrderQuant. The transcriptome SAM/BAM header is written in source/samHeaders.cpp:14-17 with the same @RG line. STAR's invariant: if @RG is in either header, the per-record RG:Z: tag is in every record of that BAM.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-32 && cd /tmp/rustar-mre-32

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --quantMode TranscriptomeSAM --quantTranscriptomeSAMoutput BanSingleEnd
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

for label in rustar STAR; do
    if [ $label = rustar ]; then G=RUS./Aligned.out.bam; T=RUS./Aligned.toTranscriptome.out.bam
    else                         G=STAR.Aligned.out.bam; T=STAR.Aligned.toTranscriptome.out.bam; fi
    g_hdr=$(docker run --rm -v $PWD:/w $STAR samtools view -H /w/$G | grep -c '^@RG')
    g_rec=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$G | grep -c 'RG:Z:')
    t_hdr=$(docker run --rm -v $PWD:/w $STAR samtools view -H /w/$T | grep -c '^@RG')
    t_rec=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$T | grep -c 'RG:Z:')
    t_total=$(docker run --rm -v $PWD:/w $STAR samtools view -c /w/$T)
    printf "%-6s genome: @RG=%d, RG:Z:=%d | transcriptome: @RG=%d, RG:Z:=%d / %d\n" \
        $label $g_hdr $g_rec $t_hdr $t_rec $t_total
done

Observed (verified, fresh paired run)

rustar genome: @RG=1, RG:Z:=88592 | transcriptome: @RG=1, RG:Z:=0     / 77300
STAR   genome: @RG=1, RG:Z:=88072 | transcriptome: @RG=1, RG:Z:=78886 / 78886

Both aligners write the @RG header into the transcriptome BAM, but only STAR writes the per-record tag.

Suggested fix

In src/lib.rs::build_transcriptome_records_pe (and its SE counterpart), after building each record, stamp the RG tag from the user-supplied --outSAMattrRGline. The equivalent of STAR's Parameters_samAttributes.cpp:201-205 — adding ATTR_RG to the transcriptome attribute-order list whenever it's added to the genome list.

Alternatively, lift the RG-tag-write logic out of the genome-record builder into a shared helper that both paths call. Same general direction as the suggested fix for #22 (extracting paired-record bookkeeping into a shared helper).

Test plan

Once the fix lands:

  1. Run rustar with --outSAMattrRGline ID:foo SM:bar and --quantMode TranscriptomeSAM.
  2. Assert samtools view <transcriptome.bam> | grep -c 'RG:Z:foo' equals the total transcriptome record count.
  3. Cross-check the value matches the ID in the @RG header line.

Severity

Low. Salmon's alignment-mode quant ignores RG so nf-core/rnaseq isn't directly broken by this. But this combines with #22 transcriptome BAM mate-fields and other transcriptome-side gaps as the trio of "transcriptome BAM is a less complete sibling of the genome BAM". Worth tightening so the transcriptome BAM is byte-symmetric with the genome BAM for shared fields.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions