pass-1 alignment doesn't seed candidates from `--sjdbGTFfile`; ~50% of GT/AG splices missed even after #27

## Summary

After [#27](https://github.com/scverse/rustar-aligner/issues/27) lands (the `SpliceJunctionDb` coord-space mismatch fix in [PR #45](https://github.com/scverse/rustar-aligner/pull/45)) and `Log.final.out` correctly reports `Annotated (sjdb) > 0`, the total splice count stays at roughly half STAR's on the same data, and the non-canonical splice rate stays elevated:

| `Log.final.out` field | rustar (after #27) | STAR 2.7.11b |
|---|---:|---:|
| `Number of splices: Total`              | 366 | 801 |
| `Number of splices: Annotated (sjdb)`   | 95  | 620 |
| `Number of splices: GT/AG`              | 266 | 686 |
| `Number of splices: Non-canonical`      | 93  | 108 |

(Earlier numbers in this issue body cited STAR's total as 720; a fresh paired run on the same inputs gives 801. Adjusted.)

So the annotation **lookup** is correct (the sjdbScore +3 bonus now fires when read-derived junctions match the GTF), but pass-1 alignment isn't using sjdb-derived junctions as **alignment candidates** — i.e. rustar is only discovering junctions from the reads themselves, where STAR pre-loads the GTF junctions into the SA-extended genome at index time and seeds pass-1 against them. Read alignments that would extend across a GTF junction with a `[align_sjdb_overhang_min=3, align_sj_overhang_min=5)` overhang fall through to the stricter unannotated gate and either map as unspliced (CIGAR collapse to `101M`) or fall to the `Reads unmapped: too short` bucket.

This is downstream of #27 but **not the same bug**. #27 was about the lookup at scoring time; this one is about candidate generation at seeding time.

## STAR reference behaviour

STAR's pass-1 uses the sjdb-extended genome (the `Gsj` buffer appended to the genome and indexed in the SA during `genomeGenerate` — see [`source/sjdbPrepare.cpp`](https://github.com/alexdobin/STAR/blob/master/source/sjdbPrepare.cpp) and [`source/sjdbBuildIndex.cpp`](https://github.com/alexdobin/STAR/blob/master/source/sjdbBuildIndex.cpp)). Seeds that land in the `Gsj` region get translated back to the corresponding genome donor/acceptor pair and the splice is recorded as a candidate junction without the read having to bridge the gap by itself.

rustar's `src/junction/sjdb_insert.rs` does port the sjdb-into-Gsj logic at genomeGenerate time. The startup log confirms `Junction database loaded: 5 annotated junctions` and `Wrote sjdb files: 5 junctions` on this test profile. So the index machinery is there. The gap must be in pass-1 alignment using those Gsj-region hits to produce splice candidates that propagate into the seed -> cluster -> stitch pipeline.

## Reproducer

```bash
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-pass1-sjdb && cd /tmp/rustar-mre-pass1-sjdb

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev   # or a build with #27 applied
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar splices ==="
grep -E 'splices' RUS./Log.final.out
echo
echo "=== STAR splices ==="
grep -E 'splices' STAR.Log.final.out
```

Note: this reproducer assumes #27's coord-space fix is applied (currently in [PR #45](https://github.com/scverse/rustar-aligner/pull/45)). On unpatched `main`, `Number of splices: Annotated (sjdb) = 0` from #27 is the louder symptom, but the total / non-canonical gap is the same regardless.

## Observed (verified after applying #27)

```
rustar:
  Number of splices: Total            | 366
  Number of splices: Annotated (sjdb) | 95
  Number of splices: GT/AG            | 266
  Number of splices: Non-canonical    | 93

STAR:
  Number of splices: Total            | 720
  Number of splices: Annotated (sjdb) | 620
  Number of splices: GT/AG            | 688
  Number of splices: Non-canonical    | 25
```

The gap is roughly **half the GT/AG splices missing** and **3.7x the non-canonical rate**. Both are consistent with pass-1 not seeding from sjdb: reads that should extend across a GTF junction either fail to find the splice and emit `101M` (CIGAR collapse, no junction recorded) or stitch a non-canonical junction with a worse score.

## Per-read CIGAR consequences

Comparing genome BAMs on reads present in both with **identical position + orientation**, 340 records have different CIGAR on `WT_REP2`. The dominant pattern (~70 %) is: STAR splices a read across an annotated junction, rustar emits a straight `M`/`S` CIGAR. Example fragment `SRR6357072.32572100`:

- STAR: `96M14674N5M` (crosses a GTF-annotated yeast intron). NH=2.
- rustar: `101M` at the same exonic start. NH=1.

The unspliced rustar alignment "fits" — the read has a high-identity match to the exonic genome region — but it's the wrong alignment for a read that should have crossed the intron.

## Where to look

`src/junction/sjdb_insert.rs` correctly extends the genome with the `Gsj` buffer (one segment per annotated junction, flanked by `sjdbOverhang` bases). At alignment time, seeds landing in the `Gsj` region need to:

1. Be translated back to the genome donor/acceptor pair (the offset arithmetic that maps `Gsj` position to `(chr, intron_start, intron_end)`).
2. Generate a candidate splice for the seed -> cluster -> stitch pipeline, so the stitching code considers crossing the annotated junction as one of the alignment options.

Pseudocode for STAR's logic (in `source/seedSearch1.cpp` / `source/aligner.cpp`):

```
for each SA hit:
    if hit_pos >= n_genome_real:    # landed in Gsj region
        (donor, acceptor) = decode_gsj_hit(hit_pos)
        emit splice_candidate(donor, acceptor)  # alongside the regular seed
```

rustar appears to do the regular seed/cluster path but not the `Gsj`-hit decoding. Worth grep'ing for `Gsj`, `n_genome`, and `n_genome_real` in `src/align/` to see whether SA hits in the Gsj region are filtered out (instead of decoded to splice candidates).

A targeted unit test would build a 2-exon toy transcript with a known intron, align a read that overhangs the splice by 3 bp on each side (between `align_sjdb_overhang_min` and `align_sj_overhang_min`), and assert the resulting CIGAR contains an `N` op of the intron length. Today's behaviour: emits the straight `M` CIGAR.

## Severity

Medium. Mapping-rate impact in `nf-core/rnaseq -profile test,docker` is ~0.2 pp (~ 50 of 49 551 reads per sample shift from `unique mapped` to `unmapped: too short`). The per-read CIGAR impact is bigger — 234 of 340 same-position-different-CIGAR reads on `WT_REP2` are this pattern. On larger genomes with richer annotation the percentage of reads spanning GTF junctions will be much higher than yeast's 1.5 %, so the effective impact scales up.

After [#27](https://github.com/scverse/rustar-aligner/issues/27) lands and this issue is fixed, the test profile's gene-level Salmon `NumReads` Pearson should rise from 0.9998 to closer to 1.0 (the residual being RNG / tie-breaking only), and the per-CIGAR diff count should drop materially.

---

<sub>Filed during **nf-core/rnaseq integration testing** ([nf-core/rnaseq#1855](https://github.com/nf-core/rnaseq/pull/1855)). All sibling issues from this exercise: [`author:pinin4fjords`](https://github.com/scverse/rustar-aligner/issues?q=is%3Aissue+author%3Apinin4fjords) or grep for `nf-core/rnaseq#1855`.</sub>



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pass-1 alignment doesn't seed candidates from `--sjdbGTFfile`; ~50% of GT/AG splices missed even after #27 #47

Summary

STAR reference behaviour

Reproducer

Observed (verified after applying #27)

Per-read CIGAR consequences

Where to look

Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

`Log.final.out` field	rustar (after #27)	STAR 2.7.11b
`Number of splices: Total`	366	801
`Number of splices: Annotated (sjdb)`	95	620
`Number of splices: GT/AG`	266	686
`Number of splices: Non-canonical`	93	108

pass-1 alignment doesn't seed candidates from --sjdbGTFfile; ~50% of GT/AG splices missed even after #27 #47

Description

Summary

STAR reference behaviour

Reproducer

Observed (verified after applying #27)

Per-read CIGAR consequences

Where to look

Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

pass-1 alignment doesn't seed candidates from `--sjdbGTFfile`; ~50% of GT/AG splices missed even after #27 #47