Summary
After #27 lands (the SpliceJunctionDb coord-space mismatch fix in PR #45) and Log.final.out correctly reports Annotated (sjdb) > 0, the total splice count stays at roughly half STAR's on the same data, and the non-canonical splice rate stays elevated:
Log.final.out field |
rustar (after #27) |
STAR 2.7.11b |
Number of splices: Total |
366 |
801 |
Number of splices: Annotated (sjdb) |
95 |
620 |
Number of splices: GT/AG |
266 |
686 |
Number of splices: Non-canonical |
93 |
108 |
(Earlier numbers in this issue body cited STAR's total as 720; a fresh paired run on the same inputs gives 801. Adjusted.)
So the annotation lookup is correct (the sjdbScore +3 bonus now fires when read-derived junctions match the GTF), but pass-1 alignment isn't using sjdb-derived junctions as alignment candidates — i.e. rustar is only discovering junctions from the reads themselves, where STAR pre-loads the GTF junctions into the SA-extended genome at index time and seeds pass-1 against them. Read alignments that would extend across a GTF junction with a [align_sjdb_overhang_min=3, align_sj_overhang_min=5) overhang fall through to the stricter unannotated gate and either map as unspliced (CIGAR collapse to 101M) or fall to the Reads unmapped: too short bucket.
This is downstream of #27 but not the same bug. #27 was about the lookup at scoring time; this one is about candidate generation at seeding time.
STAR reference behaviour
STAR's pass-1 uses the sjdb-extended genome (the Gsj buffer appended to the genome and indexed in the SA during genomeGenerate — see source/sjdbPrepare.cpp and source/sjdbBuildIndex.cpp). Seeds that land in the Gsj region get translated back to the corresponding genome donor/acceptor pair and the splice is recorded as a candidate junction without the read having to bridge the gap by itself.
rustar's src/junction/sjdb_insert.rs does port the sjdb-into-Gsj logic at genomeGenerate time. The startup log confirms Junction database loaded: 5 annotated junctions and Wrote sjdb files: 5 junctions on this test profile. So the index machinery is there. The gap must be in pass-1 alignment using those Gsj-region hits to produce splice candidates that propagate into the seed -> cluster -> stitch pipeline.
Reproducer
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-pass1-sjdb && cd /tmp/rustar-mre-pass1-sjdb
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev # or a build with #27 applied
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
echo "=== rustar splices ==="
grep -E 'splices' RUS./Log.final.out
echo
echo "=== STAR splices ==="
grep -E 'splices' STAR.Log.final.out
Note: this reproducer assumes #27's coord-space fix is applied (currently in PR #45). On unpatched main, Number of splices: Annotated (sjdb) = 0 from #27 is the louder symptom, but the total / non-canonical gap is the same regardless.
Observed (verified after applying #27)
rustar:
Number of splices: Total | 366
Number of splices: Annotated (sjdb) | 95
Number of splices: GT/AG | 266
Number of splices: Non-canonical | 93
STAR:
Number of splices: Total | 720
Number of splices: Annotated (sjdb) | 620
Number of splices: GT/AG | 688
Number of splices: Non-canonical | 25
The gap is roughly half the GT/AG splices missing and 3.7x the non-canonical rate. Both are consistent with pass-1 not seeding from sjdb: reads that should extend across a GTF junction either fail to find the splice and emit 101M (CIGAR collapse, no junction recorded) or stitch a non-canonical junction with a worse score.
Per-read CIGAR consequences
Comparing genome BAMs on reads present in both with identical position + orientation, 340 records have different CIGAR on WT_REP2. The dominant pattern (~70 %) is: STAR splices a read across an annotated junction, rustar emits a straight M/S CIGAR. Example fragment SRR6357072.32572100:
- STAR:
96M14674N5M (crosses a GTF-annotated yeast intron). NH=2.
- rustar:
101M at the same exonic start. NH=1.
The unspliced rustar alignment "fits" — the read has a high-identity match to the exonic genome region — but it's the wrong alignment for a read that should have crossed the intron.
Where to look
src/junction/sjdb_insert.rs correctly extends the genome with the Gsj buffer (one segment per annotated junction, flanked by sjdbOverhang bases). At alignment time, seeds landing in the Gsj region need to:
- Be translated back to the genome donor/acceptor pair (the offset arithmetic that maps
Gsj position to (chr, intron_start, intron_end)).
- Generate a candidate splice for the seed -> cluster -> stitch pipeline, so the stitching code considers crossing the annotated junction as one of the alignment options.
Pseudocode for STAR's logic (in source/seedSearch1.cpp / source/aligner.cpp):
for each SA hit:
if hit_pos >= n_genome_real: # landed in Gsj region
(donor, acceptor) = decode_gsj_hit(hit_pos)
emit splice_candidate(donor, acceptor) # alongside the regular seed
rustar appears to do the regular seed/cluster path but not the Gsj-hit decoding. Worth grep'ing for Gsj, n_genome, and n_genome_real in src/align/ to see whether SA hits in the Gsj region are filtered out (instead of decoded to splice candidates).
A targeted unit test would build a 2-exon toy transcript with a known intron, align a read that overhangs the splice by 3 bp on each side (between align_sjdb_overhang_min and align_sj_overhang_min), and assert the resulting CIGAR contains an N op of the intron length. Today's behaviour: emits the straight M CIGAR.
Severity
Medium. Mapping-rate impact in nf-core/rnaseq -profile test,docker is 0.2 pp ( 50 of 49 551 reads per sample shift from unique mapped to unmapped: too short). The per-read CIGAR impact is bigger — 234 of 340 same-position-different-CIGAR reads on WT_REP2 are this pattern. On larger genomes with richer annotation the percentage of reads spanning GTF junctions will be much higher than yeast's 1.5 %, so the effective impact scales up.
After #27 lands and this issue is fixed, the test profile's gene-level Salmon NumReads Pearson should rise from 0.9998 to closer to 1.0 (the residual being RNG / tie-breaking only), and the per-CIGAR diff count should drop materially.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
After #27 lands (the
SpliceJunctionDbcoord-space mismatch fix in PR #45) andLog.final.outcorrectly reportsAnnotated (sjdb) > 0, the total splice count stays at roughly half STAR's on the same data, and the non-canonical splice rate stays elevated:Log.final.outfieldNumber of splices: TotalNumber of splices: Annotated (sjdb)Number of splices: GT/AGNumber of splices: Non-canonical(Earlier numbers in this issue body cited STAR's total as 720; a fresh paired run on the same inputs gives 801. Adjusted.)
So the annotation lookup is correct (the sjdbScore +3 bonus now fires when read-derived junctions match the GTF), but pass-1 alignment isn't using sjdb-derived junctions as alignment candidates — i.e. rustar is only discovering junctions from the reads themselves, where STAR pre-loads the GTF junctions into the SA-extended genome at index time and seeds pass-1 against them. Read alignments that would extend across a GTF junction with a
[align_sjdb_overhang_min=3, align_sj_overhang_min=5)overhang fall through to the stricter unannotated gate and either map as unspliced (CIGAR collapse to101M) or fall to theReads unmapped: too shortbucket.This is downstream of #27 but not the same bug. #27 was about the lookup at scoring time; this one is about candidate generation at seeding time.
STAR reference behaviour
STAR's pass-1 uses the sjdb-extended genome (the
Gsjbuffer appended to the genome and indexed in the SA duringgenomeGenerate— seesource/sjdbPrepare.cppandsource/sjdbBuildIndex.cpp). Seeds that land in theGsjregion get translated back to the corresponding genome donor/acceptor pair and the splice is recorded as a candidate junction without the read having to bridge the gap by itself.rustar's
src/junction/sjdb_insert.rsdoes port the sjdb-into-Gsj logic at genomeGenerate time. The startup log confirmsJunction database loaded: 5 annotated junctionsandWrote sjdb files: 5 junctionson this test profile. So the index machinery is there. The gap must be in pass-1 alignment using those Gsj-region hits to produce splice candidates that propagate into the seed -> cluster -> stitch pipeline.Reproducer
Note: this reproducer assumes #27's coord-space fix is applied (currently in PR #45). On unpatched
main,Number of splices: Annotated (sjdb) = 0from #27 is the louder symptom, but the total / non-canonical gap is the same regardless.Observed (verified after applying #27)
The gap is roughly half the GT/AG splices missing and 3.7x the non-canonical rate. Both are consistent with pass-1 not seeding from sjdb: reads that should extend across a GTF junction either fail to find the splice and emit
101M(CIGAR collapse, no junction recorded) or stitch a non-canonical junction with a worse score.Per-read CIGAR consequences
Comparing genome BAMs on reads present in both with identical position + orientation, 340 records have different CIGAR on
WT_REP2. The dominant pattern (~70 %) is: STAR splices a read across an annotated junction, rustar emits a straightM/SCIGAR. Example fragmentSRR6357072.32572100:96M14674N5M(crosses a GTF-annotated yeast intron). NH=2.101Mat the same exonic start. NH=1.The unspliced rustar alignment "fits" — the read has a high-identity match to the exonic genome region — but it's the wrong alignment for a read that should have crossed the intron.
Where to look
src/junction/sjdb_insert.rscorrectly extends the genome with theGsjbuffer (one segment per annotated junction, flanked bysjdbOverhangbases). At alignment time, seeds landing in theGsjregion need to:Gsjposition to(chr, intron_start, intron_end)).Pseudocode for STAR's logic (in
source/seedSearch1.cpp/source/aligner.cpp):rustar appears to do the regular seed/cluster path but not the
Gsj-hit decoding. Worth grep'ing forGsj,n_genome, andn_genome_realinsrc/align/to see whether SA hits in the Gsj region are filtered out (instead of decoded to splice candidates).A targeted unit test would build a 2-exon toy transcript with a known intron, align a read that overhangs the splice by 3 bp on each side (between
align_sjdb_overhang_minandalign_sj_overhang_min), and assert the resulting CIGAR contains anNop of the intron length. Today's behaviour: emits the straightMCIGAR.Severity
Medium. Mapping-rate impact in
nf-core/rnaseq -profile test,dockeris0.2 pp (50 of 49 551 reads per sample shift fromunique mappedtounmapped: too short). The per-read CIGAR impact is bigger — 234 of 340 same-position-different-CIGAR reads onWT_REP2are this pattern. On larger genomes with richer annotation the percentage of reads spanning GTF junctions will be much higher than yeast's 1.5 %, so the effective impact scales up.After #27 lands and this issue is fixed, the test profile's gene-level Salmon
NumReadsPearson should rise from 0.9998 to closer to 1.0 (the residual being RNG / tie-breaking only), and the per-CIGAR diff count should drop materially.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.