Summary
Log.final.out always reports Number of splices: Annotated (sjdb) = 0 even when --sjdbGTFfile is provided. The total splice count is ~half STAR's and the non-canonical splice rate roughly triples — pointing at a functional bug, not a reporting bug. Root cause now localised (see below): the GTF-derived SpliceJunctionDb is keyed in chromosome-local 1-based coordinates but consulted in genome-absolute coordinates at two independent call sites that disagree with each other and with the DB.
The smoking gun: on my own rustar WT_REP2 run, the same BAM produces 2 rows with annotated=1 in SJ.out.tab but Annotated (sjdb) = 0 in Log.final.out. Two internal counters reading the same DB disagree because they query it in different (broken) coordinate spaces.
STAR reference behaviour
STAR's two-pass alignment uses the GTF-derived sjdb as seed candidates from pass 1. Read-derived splice events are credited as "Annotated" when they match an entry in the sjdb. The sjdb is built from GTF exon boundaries in source/SjdbClass.cpp / source/sjdbPrepare.cpp (1-based chromosome-local intron coords) and queried in the same space at alignment time.
Root cause in rustar
The SpliceJunctionDb is keyed in chromosome-local 1-based intron coordinates. From src/junction/gtf.rs:207-209:
// Intron coordinates (1-based, STAR convention)
let intron_start = exon1.end + 1;
let intron_end = exon2.start - 1;
exon1.end is the GTF column-5 value (1-based, chr-local per GTF spec). SpliceJunctionDb::from_raw_junctions stores those tuples verbatim — DB keys are (chr_idx, intron_start_local_1based, intron_end_local_1based, strand).
Two independent call sites consult this DB, neither in the right coordinate space:
Stitch-time (drives sjdb_score bonus + align_sjdb_overhang_min gate)
src/align/stitch.rs:1305-1314:
let is_annotated = junction_db.is_some_and(|db| {
let junc_donor_sa = (donor_sa as i64 + jr_shift as i64) as u64;
let donor_fwd =
index.sa_pos_to_forward(junc_donor_sa, cluster.is_reverse, del as usize);
let acceptor_fwd = donor_fwd + del as u64;
db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 0)
|| db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 1)
|| db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 2)
});
donor_fwd comes from sa_pos_to_forward, which returns a position in the padded genome buffer — genome-absolute 0-based. On chr_start[0] = 0 (the only chromosome with annotated junctions in our yeast test data), the value is 1 less than the chr-local 1-based key the DB stores. So every stitch-time lookup misses by exactly 1 on chr 0, and by chr_start[N] on higher chromosomes.
This is the load-bearing miss: the sjdb_score bonus is never applied, and the stricter align_sj_overhang_min (5) gate is used instead of align_sjdb_overhang_min (3) for every splice. Pass 1 drops every GT/AG splice whose overhang falls in the [3, 5) window that STAR would accept as annotated — explaining the ~50 % total-splice deficit and the 3× non-canonical rate.
Stats-time (drives annotated column in SJ.out.tab + Log.final.out counter)
src/lib.rs:1860-1894:
let intron_start = genome_pos + 1; // 1-based, first intronic base
let intron_end = genome_pos + intron_len as u64;
// ...
let annotated = index.junction_db.is_annotated(
transcript.chr_idx, intron_start, intron_end, strand,
);
genome_pos is "WITHOUT n_genome offset" (per the docstring at stitch.rs:21) — i.e. genome-absolute 1-based-equivalent. On chr 0, this matches the DB's chr-local 1-based keys and the lookup succeeds. On chr 1+ it misses by chr_start[N].
So stats-time finds chr-0 annotations correctly, stitch-time misses everything.
Smoking gun: the two paths disagree within a single run
Just verified on commit 5f8ad08 with WT_REP2:
SJ.out.tab annotated column distribution (stats-time lookup):
14 rows with annotated=0
2 rows with annotated=1 # chrI introns 87388-87500 and 142254-142619 — match the GTF
Log.final.out (stitch-time lookup):
Number of splices: Annotated (sjdb) | 0 # <-- inconsistent with SJ.out.tab above
The 2 vs 0 split inside rustar's own outputs is the giveaway: both counters consult the same DB; they disagree because they query it in different (broken) coordinate spaces. Stitch-time is off by 1 even on chr 0; stats-time is off by chr_start[N] only on chr 1+. On chr-0-only yeast test data the stats-time lookup gets 2 of 16 junctions right; stitch-time gets 0.
Reproducer (paired STAR + rustar)
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-27 && cd /tmp/rustar-mre-27
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
echo "=== rustar Log.final.out (stitch-time counter) ==="
grep -E 'Annotated|splices' RUS./Log.final.out
echo
echo "=== rustar SJ.out.tab annotated col (stats-time counter) ==="
awk '{print $6}' RUS./SJ.out.tab | sort | uniq -c
echo
echo "=== STAR Log.final.out, same data ==="
grep -E 'Annotated|splices' STAR.Log.final.out
echo
echo "=== STAR SJ.out.tab annotated col ==="
awk '{print $6}' STAR.SJ.out.tab | sort | uniq -c
Observed (verified, fresh paired run)
=== rustar Log.final.out ===
Number of splices: Total | 366
Number of splices: Annotated (sjdb) | 0 # <-- stitch-time, all misses
Number of splices: GT/AG | 266
Number of splices: Non-canonical | 93 # 3× STAR's rate
=== rustar SJ.out.tab annotated col ===
14 0
2 1 # <-- stats-time found 2 of 16
=== STAR Log.final.out ===
Number of splices: Total | 720
Number of splices: Annotated (sjdb) | 620
Number of splices: Non-canonical | 25
=== STAR SJ.out.tab annotated col ===
~5 0
~10 1
Suggested fix
One of (option 1 preferred):
Option 1 (preferred): normalise the DB to genome-absolute 0-based
Convert the keys at SpliceJunctionDb construction time to match the genome-absolute 0-based convention that the rest of the file already uses (prepared_junctions in index/mod.rs:92-105 and SpliceJunctionStats already store genome-absolute keys; the chr-local conversion happens at sj_output.rs:262-264 only at write time).
Then change the stats-time call site at src/lib.rs:1860-1894 to use genome_pos (0-based) and genome_pos + intron_len - 1 instead of genome_pos + 1 and genome_pos + intron_len. Leave stitch-time at src/align/stitch.rs:1305-1314 unchanged.
Single source of truth, contained in SpliceJunctionDb.
Option 2: convert at both call sites
Subtract chr_start[chr_idx] and add 1 at the stitch-time site:
let chr_off = index.genome.chr_start[cluster.chr_idx];
let donor_local_1b = donor_fwd - chr_off + 1;
let acceptor_local_1b = acceptor_fwd - chr_off + 1;
db.is_annotated(cluster.chr_idx, donor_local_1b, acceptor_local_1b, ...)
And subtract chr_start[chr_idx] from intron_start/intron_end at the stats-time site (it already has the +1 offset). Two-site change; the "what coord space does the DB store" contract remains undocumented.
Test plan
A focused unit test on a 10 kb single-chromosome toy genome with one 2-exon transcript:
- Build a
SpliceJunctionDb from a 4-line GTF defining the intron at known coordinates.
- Call
db.is_annotated(chr_idx=0, donor, acceptor, strand) with both the chr-local 1-based and the genome-absolute 0-based coordinates of the intron.
- Assert the correct one returns
true (currently a coin flip per convention; the bug is that neither call site uses the convention the DB actually stores).
- Run a paired-end FASTQ over the toy reference with
--sjdbGTFfile --twopassMode Basic and assert:
Number of splices: Annotated (sjdb) > 0 in Log.final.out.
Number of splices: Total matches STAR within a small tolerance.
- The annotated junction appears with
annotated=1 in SJ.out.tab.
Per-read CIGAR consequences
Comparing genome BAMs on records present in both with identical position + orientation, 340 records have different CIGAR. Dominant pattern (~70 %): STAR splices, rustar emits a straight match. Example fragment SRR6357072.32572100:
- STAR:
96M14674N5M crossing one of the GTF-annotated yeast introns. NH=2.
- rustar:
101M at the same exonic start. NH=1.
So the missing sjdb seeding changes the per-read alignment, not just the log header.
Severity
Medium. Mapping-rate hit in our nf-core/rnaseq test profile is ~0.2 pp; per-sample TPM effects roll up through #22 transcriptome mate fields. The functional impact (50 % of GT/AG splices dropped from pass 1, 3× non-canonical rate, AS scores systematically lower for splice-containing alignments per #33) is what's worth fixing — Annotated (sjdb) = 0 is the user-visible signal but the bigger consequence is the dropped splices and the sjdb_score bonus never landing.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
Log.final.outalways reportsNumber of splices: Annotated (sjdb) = 0even when--sjdbGTFfileis provided. The total splice count is ~half STAR's and the non-canonical splice rate roughly triples — pointing at a functional bug, not a reporting bug. Root cause now localised (see below): the GTF-derivedSpliceJunctionDbis keyed in chromosome-local 1-based coordinates but consulted in genome-absolute coordinates at two independent call sites that disagree with each other and with the DB.The smoking gun: on my own rustar
WT_REP2run, the same BAM produces 2 rows withannotated=1inSJ.out.tabbutAnnotated (sjdb) = 0inLog.final.out. Two internal counters reading the same DB disagree because they query it in different (broken) coordinate spaces.STAR reference behaviour
STAR's two-pass alignment uses the GTF-derived sjdb as seed candidates from pass 1. Read-derived splice events are credited as "Annotated" when they match an entry in the sjdb. The sjdb is built from GTF exon boundaries in
source/SjdbClass.cpp/source/sjdbPrepare.cpp(1-based chromosome-local intron coords) and queried in the same space at alignment time.Root cause in rustar
The
SpliceJunctionDbis keyed in chromosome-local 1-based intron coordinates. Fromsrc/junction/gtf.rs:207-209:exon1.endis the GTF column-5 value (1-based, chr-local per GTF spec).SpliceJunctionDb::from_raw_junctionsstores those tuples verbatim — DB keys are (chr_idx, intron_start_local_1based, intron_end_local_1based, strand).Two independent call sites consult this DB, neither in the right coordinate space:
Stitch-time (drives
sjdb_scorebonus +align_sjdb_overhang_mingate)src/align/stitch.rs:1305-1314:donor_fwdcomes fromsa_pos_to_forward, which returns a position in the padded genome buffer — genome-absolute 0-based. Onchr_start[0] = 0(the only chromosome with annotated junctions in our yeast test data), the value is 1 less than the chr-local 1-based key the DB stores. So every stitch-time lookup misses by exactly 1 on chr 0, and bychr_start[N]on higher chromosomes.This is the load-bearing miss: the
sjdb_scorebonus is never applied, and the stricteralign_sj_overhang_min(5) gate is used instead ofalign_sjdb_overhang_min(3) for every splice. Pass 1 drops every GT/AG splice whose overhang falls in the[3, 5)window that STAR would accept as annotated — explaining the ~50 % total-splice deficit and the 3× non-canonical rate.Stats-time (drives
annotatedcolumn inSJ.out.tab+Log.final.outcounter)src/lib.rs:1860-1894:genome_posis "WITHOUT n_genome offset" (per the docstring atstitch.rs:21) — i.e. genome-absolute 1-based-equivalent. On chr 0, this matches the DB's chr-local 1-based keys and the lookup succeeds. On chr 1+ it misses bychr_start[N].So stats-time finds chr-0 annotations correctly, stitch-time misses everything.
Smoking gun: the two paths disagree within a single run
Just verified on commit
5f8ad08withWT_REP2:The 2 vs 0 split inside rustar's own outputs is the giveaway: both counters consult the same DB; they disagree because they query it in different (broken) coordinate spaces. Stitch-time is off by 1 even on chr 0; stats-time is off by
chr_start[N]only on chr 1+. On chr-0-only yeast test data the stats-time lookup gets 2 of 16 junctions right; stitch-time gets 0.Reproducer (paired STAR + rustar)
Observed (verified, fresh paired run)
Suggested fix
One of (option 1 preferred):
Option 1 (preferred): normalise the DB to genome-absolute 0-based
Convert the keys at
SpliceJunctionDbconstruction time to match the genome-absolute 0-based convention that the rest of the file already uses (prepared_junctionsinindex/mod.rs:92-105andSpliceJunctionStatsalready store genome-absolute keys; the chr-local conversion happens atsj_output.rs:262-264only at write time).Then change the stats-time call site at
src/lib.rs:1860-1894to usegenome_pos(0-based) andgenome_pos + intron_len - 1instead ofgenome_pos + 1andgenome_pos + intron_len. Leave stitch-time atsrc/align/stitch.rs:1305-1314unchanged.Single source of truth, contained in
SpliceJunctionDb.Option 2: convert at both call sites
Subtract
chr_start[chr_idx]and add 1 at the stitch-time site:And subtract
chr_start[chr_idx]fromintron_start/intron_endat the stats-time site (it already has the+1offset). Two-site change; the "what coord space does the DB store" contract remains undocumented.Test plan
A focused unit test on a 10 kb single-chromosome toy genome with one 2-exon transcript:
SpliceJunctionDbfrom a 4-line GTF defining the intron at known coordinates.db.is_annotated(chr_idx=0, donor, acceptor, strand)with both the chr-local 1-based and the genome-absolute 0-based coordinates of the intron.true(currently a coin flip per convention; the bug is that neither call site uses the convention the DB actually stores).--sjdbGTFfile --twopassMode Basicand assert:Number of splices: Annotated (sjdb) > 0inLog.final.out.Number of splices: Totalmatches STAR within a small tolerance.annotated=1inSJ.out.tab.Per-read CIGAR consequences
Comparing genome BAMs on records present in both with identical position + orientation, 340 records have different CIGAR. Dominant pattern (~70 %): STAR splices, rustar emits a straight match. Example fragment
SRR6357072.32572100:96M14674N5Mcrossing one of the GTF-annotated yeast introns. NH=2.101Mat the same exonic start. NH=1.So the missing sjdb seeding changes the per-read alignment, not just the log header.
Severity
Medium. Mapping-rate hit in our
nf-core/rnaseqtest profile is ~0.2 pp; per-sample TPM effects roll up through #22 transcriptome mate fields. The functional impact (50 % of GT/AG splices dropped from pass 1, 3× non-canonical rate, AS scores systematically lower for splice-containing alignments per #33) is what's worth fixing —Annotated (sjdb) = 0is the user-visible signal but the bigger consequence is the dropped splices and thesjdb_scorebonus never landing.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.