Summary
--outSAMstrandField intronMotif is supposed to trigger emission of the XS:A:+ / XS:A:- tag on spliced reads, indicating the inferred strand from the intron motif. rustar-aligner does not emit XS:A: on any record, even with --outSAMstrandField intronMotif set and spliced reads in the output.
Breaks every downstream tool that needs strand information for stranded transcript assembly or library-protocol detection:
- StringTie / Cufflinks need
XS to assemble strand-aware transcripts.
- rseqc's
infer_experiment.py reads XS to detect library stranding.
- HTSeq-count / featureCounts with
-s rely on XS for ambiguity resolution.
STAR reference behaviour
STAR has two equivalent paths into XS emission:
-
User passes XS in --outSAMattributes -> STAR auto-enables --outSAMstrandField intronMotif, Parameters_samAttributes.cpp:172-179:
} else if (vAttr1.at(ii)== "XS") {
outSAMattrOrder.push_back(ATTR_XS);
outSAMattrPresent.XS=true;
if (outSAMstrandField.type!=1) {
inOut->logMain << "WARNING --outSAMattributes contains XS, therefore STAR will use --outSAMstrandField intronMotif" <<endl;
outSAMstrandField.type=1;
};
};
-
User passes --outSAMstrandField intronMotif -> STAR auto-adds ATTR_XS to the output order, Parameters_samAttributes.cpp:213-216:
if (outSAMstrandField.type==1 && !outSAMattrPresent.XS) {
outSAMattrOrder.push_back(ATTR_XS);
inOut->logMain << "WARNING --outSAMstrandField=intronMotif, therefore STAR will output XS attribute" <<endl;
};
Actual emission at ReadAlign_outputTranscriptSAM.cpp:298-302, driven by the per-read intron-motif inference earlier in the same function:
case ATTR_XS:
if (...) *outStream<<"\tXS:A:+";
else if (...) *outStream<<"\tXS:A:-";
+ if the canonical GT/AG donor/acceptor reads as forward-strand, - if reverse, no tag if the read is unspliced or the splice motif is non-canonical.
Reproducer
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-30 && cd /tmp/rustar-mre-30
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMstrandField intronMotif
--outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
RBAM=RUS./Aligned.out.bam
SBAM=STAR.Aligned.out.bam
for label in rustar STAR; do
if [ $label = rustar ]; then BAM=$RBAM; else BAM=$SBAM; fi
xs=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | grep -c 'XS:A:')
splices=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | awk '$6 ~ /N/' | wc -l)
echo "$label: XS:A: tags = $xs, spliced records (CIGAR N) = $splices"
done
Observed (verified, fresh paired run)
rustar: XS:A: tags = 0, spliced records (CIGAR N) = 358
STAR : XS:A: tags = 1738, spliced records (CIGAR N) = ~715
Despite 358 spliced records and --outSAMstrandField intronMotif set, zero XS:A: tags are emitted by rustar. STAR on the same data emits 1 738 tags. The same is true when XS is explicitly added to --outSAMattributes: still zero tags emitted, where STAR auto-promotes the flag.
Suggested fix
In rustar's attribute order builder, add the two coupling paths from Parameters_samAttributes.cpp:
- If
XS is in --outSAMattributes, force-enable outSAMstrandField = intronMotif and add ATTR_XS to the writer order.
- If
outSAMstrandField == intronMotif and XS isn't already in the order list, add ATTR_XS to the writer order.
Then implement the per-record XS write in the genome-BAM record builder: walk the spliced CIGAR, look up donor/acceptor dinucleotides in the genome, and stamp + or - based on the motif (canonical GT/AG -> +, canonical CT/AC -> -, non-canonical -> omit).
rustar already computes junction_motifs (see src/align/transcript.rs::Transcript::junction_motifs and src/align/score::SpliceMotif), so the underlying motif-detection infrastructure is in place. This should be a write-time change, not an algorithmic one.
Test plan
- Run rustar with
--outSAMstrandField intronMotif on PE data with annotated splices.
- Assert
samtools view ... | grep -c 'XS:A:' > 0.
- Assert XS strand matches the genome-relative strand of the GT/AG donor on a spot-check read.
- (Bonus) cross-check against STAR's XS calls on the same reads — expect agreement on at least 95 % of spliced records.
Why this matters
infer_experiment.py runs on every nf-core/rnaseq sample to detect library stranding. With zero XS tags it can't infer anything, and the MultiQC report's RSeQC section reads "stranding could not be inferred" for every rustar-aligned sample. Stringtie's stranded assembly mode also fails silently (assembles all reads as if unstranded). Downstream impact is wider than NM/nM (#29) because XS is consumed during analysis, not just QC.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
--outSAMstrandField intronMotifis supposed to trigger emission of theXS:A:+/XS:A:-tag on spliced reads, indicating the inferred strand from the intron motif.rustar-alignerdoes not emitXS:A:on any record, even with--outSAMstrandField intronMotifset and spliced reads in the output.Breaks every downstream tool that needs strand information for stranded transcript assembly or library-protocol detection:
XSto assemble strand-aware transcripts.infer_experiment.pyreadsXSto detect library stranding.-srely onXSfor ambiguity resolution.STAR reference behaviour
STAR has two equivalent paths into
XSemission:User passes
XSin--outSAMattributes-> STAR auto-enables--outSAMstrandField intronMotif,Parameters_samAttributes.cpp:172-179:User passes
--outSAMstrandField intronMotif-> STAR auto-addsATTR_XSto the output order,Parameters_samAttributes.cpp:213-216:Actual emission at
ReadAlign_outputTranscriptSAM.cpp:298-302, driven by the per-read intron-motif inference earlier in the same function:+if the canonical GT/AG donor/acceptor reads as forward-strand,-if reverse, no tag if the read is unspliced or the splice motif is non-canonical.Reproducer
Observed (verified, fresh paired run)
Despite 358 spliced records and
--outSAMstrandField intronMotifset, zeroXS:A:tags are emitted by rustar. STAR on the same data emits 1 738 tags. The same is true whenXSis explicitly added to--outSAMattributes: still zero tags emitted, where STAR auto-promotes the flag.Suggested fix
In rustar's attribute order builder, add the two coupling paths from
Parameters_samAttributes.cpp:XSis in--outSAMattributes, force-enableoutSAMstrandField = intronMotifand addATTR_XSto the writer order.outSAMstrandField == intronMotifandXSisn't already in the order list, addATTR_XSto the writer order.Then implement the per-record XS write in the genome-BAM record builder: walk the spliced CIGAR, look up donor/acceptor dinucleotides in the genome, and stamp
+or-based on the motif (canonical GT/AG ->+, canonical CT/AC ->-, non-canonical -> omit).rustar already computes
junction_motifs(seesrc/align/transcript.rs::Transcript::junction_motifsandsrc/align/score::SpliceMotif), so the underlying motif-detection infrastructure is in place. This should be a write-time change, not an algorithmic one.Test plan
--outSAMstrandField intronMotifon PE data with annotated splices.samtools view ... | grep -c 'XS:A:' > 0.Why this matters
infer_experiment.pyruns on every nf-core/rnaseq sample to detect library stranding. With zeroXStags it can't infer anything, and the MultiQC report's RSeQC section reads "stranding could not be inferred" for every rustar-aligned sample. Stringtie's stranded assembly mode also fails silently (assembles all reads as if unstranded). Downstream impact is wider than NM/nM (#29) because XS is consumed during analysis, not just QC.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.