`Log.final.out` always reports `Number of splices: Annotated (sjdb) = 0` despite `--sjdbGTFfile`

## Summary

`Log.final.out` always reports `Number of splices: Annotated (sjdb) = 0` even when `--sjdbGTFfile` is provided. The total splice count is ~half STAR's and the non-canonical splice rate roughly triples — pointing at a functional bug, not a reporting bug. **Root cause now localised** (see below): the GTF-derived `SpliceJunctionDb` is keyed in **chromosome-local 1-based** coordinates but consulted in **genome-absolute** coordinates at two independent call sites that disagree with each other and with the DB.

The smoking gun: on my own rustar `WT_REP2` run, **the same BAM produces 2 rows with `annotated=1` in `SJ.out.tab` but `Annotated (sjdb) = 0` in `Log.final.out`**. Two internal counters reading the same DB disagree because they query it in different (broken) coordinate spaces.

## STAR reference behaviour

STAR's two-pass alignment uses the GTF-derived sjdb as seed candidates from pass 1. Read-derived splice events are credited as "Annotated" when they match an entry in the sjdb. The sjdb is built from GTF exon boundaries in `source/SjdbClass.cpp` / `source/sjdbPrepare.cpp` (1-based chromosome-local intron coords) and queried in the same space at alignment time.

## Root cause in rustar

The `SpliceJunctionDb` is keyed in **chromosome-local 1-based** intron coordinates. From [`src/junction/gtf.rs:207-209`](https://github.com/scverse/rustar-aligner/blob/main/src/junction/gtf.rs#L207-L209):

```rust
// Intron coordinates (1-based, STAR convention)
let intron_start = exon1.end + 1;
let intron_end = exon2.start - 1;
```

`exon1.end` is the GTF column-5 value (1-based, chr-local per GTF spec). [`SpliceJunctionDb::from_raw_junctions`](https://github.com/scverse/rustar-aligner/blob/main/src/index/io.rs#L45-L56) stores those tuples verbatim — DB keys are **(chr_idx, intron_start_local_1based, intron_end_local_1based, strand)**.

Two independent call sites consult this DB, **neither in the right coordinate space**:

### Stitch-time (drives `sjdb_score` bonus + `align_sjdb_overhang_min` gate)

[`src/align/stitch.rs:1305-1314`](https://github.com/scverse/rustar-aligner/blob/main/src/align/stitch.rs#L1305-L1314):

```rust
let is_annotated = junction_db.is_some_and(|db| {
    let junc_donor_sa = (donor_sa as i64 + jr_shift as i64) as u64;
    let donor_fwd =
        index.sa_pos_to_forward(junc_donor_sa, cluster.is_reverse, del as usize);
    let acceptor_fwd = donor_fwd + del as u64;
    db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 0)
        || db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 1)
        || db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, 2)
});
```

`donor_fwd` comes from `sa_pos_to_forward`, which returns a position in the padded genome buffer — **genome-absolute 0-based**. On `chr_start[0] = 0` (the only chromosome with annotated junctions in our yeast test data), the value is 1 less than the chr-local 1-based key the DB stores. So every stitch-time lookup misses by exactly 1 on chr 0, and by `chr_start[N]` on higher chromosomes.

This is the load-bearing miss: the `sjdb_score` bonus is never applied, and the stricter `align_sj_overhang_min` (5) gate is used instead of `align_sjdb_overhang_min` (3) for every splice. Pass 1 drops every GT/AG splice whose overhang falls in the `[3, 5)` window that STAR would accept as annotated — explaining the ~50 % total-splice deficit and the 3× non-canonical rate.

### Stats-time (drives `annotated` column in `SJ.out.tab` + `Log.final.out` counter)

[`src/lib.rs:1860-1894`](https://github.com/scverse/rustar-aligner/blob/main/src/lib.rs#L1860-L1894):

```rust
let intron_start = genome_pos + 1; // 1-based, first intronic base
let intron_end = genome_pos + intron_len as u64;
// ...
let annotated = index.junction_db.is_annotated(
    transcript.chr_idx, intron_start, intron_end, strand,
);
```

`genome_pos` is "WITHOUT n_genome offset" (per the docstring at [`stitch.rs:21`](https://github.com/scverse/rustar-aligner/blob/main/src/align/stitch.rs#L21)) — i.e. **genome-absolute 1-based-equivalent**. On chr 0, this matches the DB's chr-local 1-based keys and the lookup succeeds. On chr 1+ it misses by `chr_start[N]`.

So stats-time finds chr-0 annotations correctly, stitch-time misses everything.

### Smoking gun: the two paths disagree within a single run

Just verified on commit `5f8ad08` with `WT_REP2`:

```
SJ.out.tab annotated column distribution (stats-time lookup):
   14 rows with annotated=0
    2 rows with annotated=1                # chrI introns 87388-87500 and 142254-142619 — match the GTF

Log.final.out (stitch-time lookup):
   Number of splices: Annotated (sjdb) | 0  # <-- inconsistent with SJ.out.tab above
```

The 2 vs 0 split inside rustar's own outputs is the giveaway: both counters consult the same DB; they disagree because they query it in different (broken) coordinate spaces. Stitch-time is off by 1 even on chr 0; stats-time is off by `chr_start[N]` only on chr 1+. On chr-0-only yeast test data the stats-time lookup gets 2 of 16 junctions right; stitch-time gets 0.

## Reproducer (paired STAR + rustar)

```bash
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-27 && cd /tmp/rustar-mre-27

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar Log.final.out (stitch-time counter) ==="
grep -E 'Annotated|splices' RUS./Log.final.out
echo
echo "=== rustar SJ.out.tab annotated col (stats-time counter) ==="
awk '{print $6}' RUS./SJ.out.tab | sort | uniq -c
echo
echo "=== STAR Log.final.out, same data ==="
grep -E 'Annotated|splices' STAR.Log.final.out
echo
echo "=== STAR SJ.out.tab annotated col ==="
awk '{print $6}' STAR.SJ.out.tab | sort | uniq -c
```

## Observed (verified, fresh paired run)

```
=== rustar Log.final.out ===
Number of splices: Total            | 366
Number of splices: Annotated (sjdb) | 0          # <-- stitch-time, all misses
Number of splices: GT/AG            | 266
Number of splices: Non-canonical    | 93         # 3× STAR's rate

=== rustar SJ.out.tab annotated col ===
     14 0
      2 1                                         # <-- stats-time found 2 of 16

=== STAR Log.final.out ===
Number of splices: Total            | 720
Number of splices: Annotated (sjdb) | 620
Number of splices: Non-canonical    | 25

=== STAR SJ.out.tab annotated col ===
     ~5 0
     ~10 1
```

## Suggested fix

One of (option 1 preferred):

### Option 1 (preferred): normalise the DB to **genome-absolute 0-based**

Convert the keys at `SpliceJunctionDb` construction time to match the genome-absolute 0-based convention that the rest of the file already uses (`prepared_junctions` in [`index/mod.rs:92-105`](https://github.com/scverse/rustar-aligner/blob/main/src/index/mod.rs#L92-L105) and `SpliceJunctionStats` already store genome-absolute keys; the chr-local conversion happens at `sj_output.rs:262-264` only at write time).

Then change the stats-time call site at [`src/lib.rs:1860-1894`](https://github.com/scverse/rustar-aligner/blob/main/src/lib.rs#L1860-L1894) to use `genome_pos` (0-based) and `genome_pos + intron_len - 1` instead of `genome_pos + 1` and `genome_pos + intron_len`. Leave stitch-time at [`src/align/stitch.rs:1305-1314`](https://github.com/scverse/rustar-aligner/blob/main/src/align/stitch.rs#L1305-L1314) unchanged.

Single source of truth, contained in `SpliceJunctionDb`.

### Option 2: convert at both call sites

Subtract `chr_start[chr_idx]` and add 1 at the stitch-time site:

```rust
let chr_off = index.genome.chr_start[cluster.chr_idx];
let donor_local_1b = donor_fwd - chr_off + 1;
let acceptor_local_1b = acceptor_fwd - chr_off + 1;
db.is_annotated(cluster.chr_idx, donor_local_1b, acceptor_local_1b, ...)
```

And subtract `chr_start[chr_idx]` from `intron_start`/`intron_end` at the stats-time site (it already has the `+1` offset). Two-site change; the "what coord space does the DB store" contract remains undocumented.

## Test plan

A focused unit test on a 10 kb single-chromosome toy genome with one 2-exon transcript:

1. Build a `SpliceJunctionDb` from a 4-line GTF defining the intron at known coordinates.
2. Call `db.is_annotated(chr_idx=0, donor, acceptor, strand)` with both the chr-local 1-based and the genome-absolute 0-based coordinates of the intron.
3. Assert the correct one returns `true` (currently a coin flip per convention; the bug is that *neither* call site uses the convention the DB actually stores).
4. Run a paired-end FASTQ over the toy reference with `--sjdbGTFfile --twopassMode Basic` and assert:
   - `Number of splices: Annotated (sjdb) > 0` in `Log.final.out`.
   - `Number of splices: Total` matches STAR within a small tolerance.
   - The annotated junction appears with `annotated=1` in `SJ.out.tab`.

## Per-read CIGAR consequences

Comparing genome BAMs on records present in both with **identical position + orientation**, 340 records have different CIGAR. Dominant pattern (~70 %): STAR splices, rustar emits a straight match. Example fragment `SRR6357072.32572100`:

- STAR: `96M14674N5M` crossing one of the GTF-annotated yeast introns. NH=2.
- rustar: `101M` at the same exonic start. NH=1.

So the missing sjdb seeding changes the per-read alignment, not just the log header.

## Severity

Medium. Mapping-rate hit in our `nf-core/rnaseq` test profile is ~0.2 pp; per-sample TPM effects roll up through [#22 transcriptome mate fields](https://github.com/scverse/rustar-aligner/issues/22). The functional impact (50 % of GT/AG splices dropped from pass 1, 3× non-canonical rate, AS scores systematically lower for splice-containing alignments per [#33](https://github.com/scverse/rustar-aligner/issues/33)) is what's worth fixing — `Annotated (sjdb) = 0` is the user-visible signal but the bigger consequence is the dropped splices and the `sjdb_score` bonus never landing.

---

<sub>Filed during **nf-core/rnaseq integration testing** ([nf-core/rnaseq#1855](https://github.com/nf-core/rnaseq/pull/1855)). All sibling issues from this exercise: [`author:pinin4fjords`](https://github.com/scverse/rustar-aligner/issues?q=is%3Aissue+author%3Apinin4fjords) or grep for `nf-core/rnaseq#1855`.</sub>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Log.final.out` always reports `Number of splices: Annotated (sjdb) = 0` despite `--sjdbGTFfile` #27

Summary

STAR reference behaviour

Root cause in rustar

Stitch-time (drives `sjdb_score` bonus + `align_sjdb_overhang_min` gate)

Stats-time (drives `annotated` column in `SJ.out.tab` + `Log.final.out` counter)

Smoking gun: the two paths disagree within a single run

Reproducer (paired STAR + rustar)

Observed (verified, fresh paired run)

Suggested fix

Option 1 (preferred): normalise the DB to genome-absolute 0-based

Option 2: convert at both call sites

Test plan

Per-read CIGAR consequences

Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Log.final.out always reports Number of splices: Annotated (sjdb) = 0 despite --sjdbGTFfile #27

Description

Summary

STAR reference behaviour

Root cause in rustar

Stitch-time (drives sjdb_score bonus + align_sjdb_overhang_min gate)

Stats-time (drives annotated column in SJ.out.tab + Log.final.out counter)

Smoking gun: the two paths disagree within a single run

Reproducer (paired STAR + rustar)

Observed (verified, fresh paired run)

Suggested fix

Option 1 (preferred): normalise the DB to genome-absolute 0-based

Option 2: convert at both call sites

Test plan

Per-read CIGAR consequences

Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`Log.final.out` always reports `Number of splices: Annotated (sjdb) = 0` despite `--sjdbGTFfile` #27

Stitch-time (drives `sjdb_score` bonus + `align_sjdb_overhang_min` gate)

Stats-time (drives `annotated` column in `SJ.out.tab` + `Log.final.out` counter)