Skip to content

Log.final.out always reports Number of splices: Annotated (sjdb) = 0 despite --sjdbGTFfile #27

@pinin4fjords

Description

@pinin4fjords

Summary

Log.final.out always reports Number of splices: Annotated (sjdb) = 0 even when --sjdbGTFfile is provided. The total splice count is ~half STAR's and the non-canonical splice rate roughly triples — pointing at a functional bug, not a reporting bug. Root cause now localised (see below): the GTF-derived SpliceJunctionDb is keyed in chromosome-local 1-based coordinates but consulted in genome-absolute coordinates at two independent call sites that disagree with each other and with the DB.

The smoking gun: on my own rustar WT_REP2 run, the same BAM produces 2 rows with annotated=1 in SJ.out.tab but Annotated (sjdb) = 0 in Log.final.out. Two internal counters reading the same DB disagree because they query it in different (broken) coordinate spaces.

STAR reference behaviour

STAR's two-pass alignment uses the GTF-derived sjdb as seed candidates from pass 1. Read-derived splice events are credited as "Annotated" when they match an entry in the sjdb. The sjdb is built from GTF exon boundaries in source/SjdbClass.cpp / source/sjdbPrepare.cpp (1-based chromosome-local intron coords) and queried in the same space at alignment time.

Root cause in rustar

The SpliceJunctionDb is keyed in chromosome-local 1-based intron coordinates. From src/junction/gtf.rs:207-209:

// Intron coordinates (1-based, STAR convention)
let intron_start = exon1.end + 1;
let intron_end = exon2.start - 1;

exon1.end is the GTF column-5 value (1-based, chr-local per GTF spec). SpliceJunctionDb::from_raw_junctions stores those tuples verbatim — DB keys are (chr_idx, intron_start_local_1based, intron_end_local_1based, strand).

Two independent call sites consult this DB, neither in the right coordinate space:

Stitch-time (drives sjdb_score bonus + align_sjdb_overhang_min gate)

src/align/stitch.rs:1305-1314:

let is_annotated = junction_db.is_some_and(|db| {
    let junc_donor_sa = (donor_sa as i64 + jr_shift as i64) as u64;
    let donor_fwd =
        index.sa_pos_to_forward(junc_donor_sa, cluster.is_reverse, del as usize);
    let acceptor_fwd = donor_fwd + del as u64;
    db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 0)
        || db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 1)
        || db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 2)
});

donor_fwd comes from sa_pos_to_forward, which returns a position in the padded genome buffer — genome-absolute 0-based. On chr_start[0] = 0 (the only chromosome with annotated junctions in our yeast test data), the value is 1 less than the chr-local 1-based key the DB stores. So every stitch-time lookup misses by exactly 1 on chr 0, and by chr_start[N] on higher chromosomes.

This is the load-bearing miss: the sjdb_score bonus is never applied, and the stricter align_sj_overhang_min (5) gate is used instead of align_sjdb_overhang_min (3) for every splice. Pass 1 drops every GT/AG splice whose overhang falls in the [3, 5) window that STAR would accept as annotated — explaining the ~50 % total-splice deficit and the 3× non-canonical rate.

Stats-time (drives annotated column in SJ.out.tab + Log.final.out counter)

src/lib.rs:1860-1894:

let intron_start = genome_pos + 1; // 1-based, first intronic base
let intron_end = genome_pos + intron_len as u64;
// ...
let annotated = index.junction_db.is_annotated(
    transcript.chr_idx, intron_start, intron_end, strand,
);

genome_pos is "WITHOUT n_genome offset" (per the docstring at stitch.rs:21) — i.e. genome-absolute 1-based-equivalent. On chr 0, this matches the DB's chr-local 1-based keys and the lookup succeeds. On chr 1+ it misses by chr_start[N].

So stats-time finds chr-0 annotations correctly, stitch-time misses everything.

Smoking gun: the two paths disagree within a single run

Just verified on commit 5f8ad08 with WT_REP2:

SJ.out.tab annotated column distribution (stats-time lookup):
   14 rows with annotated=0
    2 rows with annotated=1                # chrI introns 87388-87500 and 142254-142619 — match the GTF

Log.final.out (stitch-time lookup):
   Number of splices: Annotated (sjdb) | 0  # <-- inconsistent with SJ.out.tab above

The 2 vs 0 split inside rustar's own outputs is the giveaway: both counters consult the same DB; they disagree because they query it in different (broken) coordinate spaces. Stitch-time is off by 1 even on chr 0; stats-time is off by chr_start[N] only on chr 1+. On chr-0-only yeast test data the stats-time lookup gets 2 of 16 junctions right; stitch-time gets 0.

Reproducer (paired STAR + rustar)

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-27 && cd /tmp/rustar-mre-27

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar Log.final.out (stitch-time counter) ==="
grep -E 'Annotated|splices' RUS./Log.final.out
echo
echo "=== rustar SJ.out.tab annotated col (stats-time counter) ==="
awk '{print $6}' RUS./SJ.out.tab | sort | uniq -c
echo
echo "=== STAR Log.final.out, same data ==="
grep -E 'Annotated|splices' STAR.Log.final.out
echo
echo "=== STAR SJ.out.tab annotated col ==="
awk '{print $6}' STAR.SJ.out.tab | sort | uniq -c

Observed (verified, fresh paired run)

=== rustar Log.final.out ===
Number of splices: Total            | 366
Number of splices: Annotated (sjdb) | 0          # <-- stitch-time, all misses
Number of splices: GT/AG            | 266
Number of splices: Non-canonical    | 93         # 3× STAR's rate

=== rustar SJ.out.tab annotated col ===
     14 0
      2 1                                         # <-- stats-time found 2 of 16

=== STAR Log.final.out ===
Number of splices: Total            | 720
Number of splices: Annotated (sjdb) | 620
Number of splices: Non-canonical    | 25

=== STAR SJ.out.tab annotated col ===
     ~5 0
     ~10 1

Suggested fix

One of (option 1 preferred):

Option 1 (preferred): normalise the DB to genome-absolute 0-based

Convert the keys at SpliceJunctionDb construction time to match the genome-absolute 0-based convention that the rest of the file already uses (prepared_junctions in index/mod.rs:92-105 and SpliceJunctionStats already store genome-absolute keys; the chr-local conversion happens at sj_output.rs:262-264 only at write time).

Then change the stats-time call site at src/lib.rs:1860-1894 to use genome_pos (0-based) and genome_pos + intron_len - 1 instead of genome_pos + 1 and genome_pos + intron_len. Leave stitch-time at src/align/stitch.rs:1305-1314 unchanged.

Single source of truth, contained in SpliceJunctionDb.

Option 2: convert at both call sites

Subtract chr_start[chr_idx] and add 1 at the stitch-time site:

let chr_off = index.genome.chr_start[cluster.chr_idx];
let donor_local_1b = donor_fwd - chr_off + 1;
let acceptor_local_1b = acceptor_fwd - chr_off + 1;
db.is_annotated(cluster.chr_idx, donor_local_1b, acceptor_local_1b, ...)

And subtract chr_start[chr_idx] from intron_start/intron_end at the stats-time site (it already has the +1 offset). Two-site change; the "what coord space does the DB store" contract remains undocumented.

Test plan

A focused unit test on a 10 kb single-chromosome toy genome with one 2-exon transcript:

  1. Build a SpliceJunctionDb from a 4-line GTF defining the intron at known coordinates.
  2. Call db.is_annotated(chr_idx=0, donor, acceptor, strand) with both the chr-local 1-based and the genome-absolute 0-based coordinates of the intron.
  3. Assert the correct one returns true (currently a coin flip per convention; the bug is that neither call site uses the convention the DB actually stores).
  4. Run a paired-end FASTQ over the toy reference with --sjdbGTFfile --twopassMode Basic and assert:
    • Number of splices: Annotated (sjdb) > 0 in Log.final.out.
    • Number of splices: Total matches STAR within a small tolerance.
    • The annotated junction appears with annotated=1 in SJ.out.tab.

Per-read CIGAR consequences

Comparing genome BAMs on records present in both with identical position + orientation, 340 records have different CIGAR. Dominant pattern (~70 %): STAR splices, rustar emits a straight match. Example fragment SRR6357072.32572100:

  • STAR: 96M14674N5M crossing one of the GTF-annotated yeast introns. NH=2.
  • rustar: 101M at the same exonic start. NH=1.

So the missing sjdb seeding changes the per-read alignment, not just the log header.

Severity

Medium. Mapping-rate hit in our nf-core/rnaseq test profile is ~0.2 pp; per-sample TPM effects roll up through #22 transcriptome mate fields. The functional impact (50 % of GT/AG splices dropped from pass 1, 3× non-canonical rate, AS scores systematically lower for splice-containing alignments per #33) is what's worth fixing — Annotated (sjdb) = 0 is the user-visible signal but the bigger consequence is the dropped splices and the sjdb_score bonus never landing.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions