Skip to content

--outSAMstrandField intronMotif does not produce any XS:A: tags #30

@pinin4fjords

Description

@pinin4fjords

Summary

--outSAMstrandField intronMotif is supposed to trigger emission of the XS:A:+ / XS:A:- tag on spliced reads, indicating the inferred strand from the intron motif. rustar-aligner does not emit XS:A: on any record, even with --outSAMstrandField intronMotif set and spliced reads in the output.

Breaks every downstream tool that needs strand information for stranded transcript assembly or library-protocol detection:

  • StringTie / Cufflinks need XS to assemble strand-aware transcripts.
  • rseqc's infer_experiment.py reads XS to detect library stranding.
  • HTSeq-count / featureCounts with -s rely on XS for ambiguity resolution.

STAR reference behaviour

STAR has two equivalent paths into XS emission:

  1. User passes XS in --outSAMattributes -> STAR auto-enables --outSAMstrandField intronMotif, Parameters_samAttributes.cpp:172-179:

    } else if (vAttr1.at(ii)== "XS") {
        outSAMattrOrder.push_back(ATTR_XS);
        outSAMattrPresent.XS=true;
        if (outSAMstrandField.type!=1) {
            inOut->logMain << "WARNING --outSAMattributes contains XS, therefore STAR will use --outSAMstrandField intronMotif" <<endl;
            outSAMstrandField.type=1;
        };
    };
  2. User passes --outSAMstrandField intronMotif -> STAR auto-adds ATTR_XS to the output order, Parameters_samAttributes.cpp:213-216:

    if (outSAMstrandField.type==1 && !outSAMattrPresent.XS) {
        outSAMattrOrder.push_back(ATTR_XS);
        inOut->logMain << "WARNING --outSAMstrandField=intronMotif, therefore STAR will output XS attribute" <<endl;
    };

Actual emission at ReadAlign_outputTranscriptSAM.cpp:298-302, driven by the per-read intron-motif inference earlier in the same function:

case ATTR_XS:
    if (...) *outStream<<"\tXS:A:+";
    else if (...) *outStream<<"\tXS:A:-";

+ if the canonical GT/AG donor/acceptor reads as forward-strand, - if reverse, no tag if the read is unspliced or the splice motif is non-canonical.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-30 && cd /tmp/rustar-mre-30

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMstrandField intronMotif
        --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

RBAM=RUS./Aligned.out.bam
SBAM=STAR.Aligned.out.bam

for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=$RBAM; else BAM=$SBAM; fi
    xs=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | grep -c 'XS:A:')
    splices=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | awk '$6 ~ /N/' | wc -l)
    echo "$label: XS:A: tags = $xs, spliced records (CIGAR N) = $splices"
done

Observed (verified, fresh paired run)

rustar: XS:A: tags = 0, spliced records (CIGAR N) = 358
STAR  : XS:A: tags = 1738, spliced records (CIGAR N) = ~715

Despite 358 spliced records and --outSAMstrandField intronMotif set, zero XS:A: tags are emitted by rustar. STAR on the same data emits 1 738 tags. The same is true when XS is explicitly added to --outSAMattributes: still zero tags emitted, where STAR auto-promotes the flag.

Suggested fix

In rustar's attribute order builder, add the two coupling paths from Parameters_samAttributes.cpp:

  1. If XS is in --outSAMattributes, force-enable outSAMstrandField = intronMotif and add ATTR_XS to the writer order.
  2. If outSAMstrandField == intronMotif and XS isn't already in the order list, add ATTR_XS to the writer order.

Then implement the per-record XS write in the genome-BAM record builder: walk the spliced CIGAR, look up donor/acceptor dinucleotides in the genome, and stamp + or - based on the motif (canonical GT/AG -> +, canonical CT/AC -> -, non-canonical -> omit).

rustar already computes junction_motifs (see src/align/transcript.rs::Transcript::junction_motifs and src/align/score::SpliceMotif), so the underlying motif-detection infrastructure is in place. This should be a write-time change, not an algorithmic one.

Test plan

  1. Run rustar with --outSAMstrandField intronMotif on PE data with annotated splices.
  2. Assert samtools view ... | grep -c 'XS:A:' > 0.
  3. Assert XS strand matches the genome-relative strand of the GT/AG donor on a spot-check read.
  4. (Bonus) cross-check against STAR's XS calls on the same reads — expect agreement on at least 95 % of spliced records.

Why this matters

infer_experiment.py runs on every nf-core/rnaseq sample to detect library stranding. With zero XS tags it can't infer anything, and the MultiQC report's RSeQC section reads "stranding could not be inferred" for every rustar-aligned sample. Stringtie's stranded assembly mode also fails silently (assembles all reads as if unstranded). Downstream impact is wider than NM/nM (#29) because XS is consumed during analysis, not just QC.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions