Skip to content

pass-1 alignment doesn't seed candidates from --sjdbGTFfile; ~50% of GT/AG splices missed even after #27 #47

@pinin4fjords

Description

@pinin4fjords

Summary

After #27 lands (the SpliceJunctionDb coord-space mismatch fix in PR #45) and Log.final.out correctly reports Annotated (sjdb) > 0, the total splice count stays at roughly half STAR's on the same data, and the non-canonical splice rate stays elevated:

Log.final.out field rustar (after #27) STAR 2.7.11b
Number of splices: Total 366 801
Number of splices: Annotated (sjdb) 95 620
Number of splices: GT/AG 266 686
Number of splices: Non-canonical 93 108

(Earlier numbers in this issue body cited STAR's total as 720; a fresh paired run on the same inputs gives 801. Adjusted.)

So the annotation lookup is correct (the sjdbScore +3 bonus now fires when read-derived junctions match the GTF), but pass-1 alignment isn't using sjdb-derived junctions as alignment candidates — i.e. rustar is only discovering junctions from the reads themselves, where STAR pre-loads the GTF junctions into the SA-extended genome at index time and seeds pass-1 against them. Read alignments that would extend across a GTF junction with a [align_sjdb_overhang_min=3, align_sj_overhang_min=5) overhang fall through to the stricter unannotated gate and either map as unspliced (CIGAR collapse to 101M) or fall to the Reads unmapped: too short bucket.

This is downstream of #27 but not the same bug. #27 was about the lookup at scoring time; this one is about candidate generation at seeding time.

STAR reference behaviour

STAR's pass-1 uses the sjdb-extended genome (the Gsj buffer appended to the genome and indexed in the SA during genomeGenerate — see source/sjdbPrepare.cpp and source/sjdbBuildIndex.cpp). Seeds that land in the Gsj region get translated back to the corresponding genome donor/acceptor pair and the splice is recorded as a candidate junction without the read having to bridge the gap by itself.

rustar's src/junction/sjdb_insert.rs does port the sjdb-into-Gsj logic at genomeGenerate time. The startup log confirms Junction database loaded: 5 annotated junctions and Wrote sjdb files: 5 junctions on this test profile. So the index machinery is there. The gap must be in pass-1 alignment using those Gsj-region hits to produce splice candidates that propagate into the seed -> cluster -> stitch pipeline.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-pass1-sjdb && cd /tmp/rustar-mre-pass1-sjdb

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev   # or a build with #27 applied
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar splices ==="
grep -E 'splices' RUS./Log.final.out
echo
echo "=== STAR splices ==="
grep -E 'splices' STAR.Log.final.out

Note: this reproducer assumes #27's coord-space fix is applied (currently in PR #45). On unpatched main, Number of splices: Annotated (sjdb) = 0 from #27 is the louder symptom, but the total / non-canonical gap is the same regardless.

Observed (verified after applying #27)

rustar:
  Number of splices: Total            | 366
  Number of splices: Annotated (sjdb) | 95
  Number of splices: GT/AG            | 266
  Number of splices: Non-canonical    | 93

STAR:
  Number of splices: Total            | 720
  Number of splices: Annotated (sjdb) | 620
  Number of splices: GT/AG            | 688
  Number of splices: Non-canonical    | 25

The gap is roughly half the GT/AG splices missing and 3.7x the non-canonical rate. Both are consistent with pass-1 not seeding from sjdb: reads that should extend across a GTF junction either fail to find the splice and emit 101M (CIGAR collapse, no junction recorded) or stitch a non-canonical junction with a worse score.

Per-read CIGAR consequences

Comparing genome BAMs on reads present in both with identical position + orientation, 340 records have different CIGAR on WT_REP2. The dominant pattern (~70 %) is: STAR splices a read across an annotated junction, rustar emits a straight M/S CIGAR. Example fragment SRR6357072.32572100:

  • STAR: 96M14674N5M (crosses a GTF-annotated yeast intron). NH=2.
  • rustar: 101M at the same exonic start. NH=1.

The unspliced rustar alignment "fits" — the read has a high-identity match to the exonic genome region — but it's the wrong alignment for a read that should have crossed the intron.

Where to look

src/junction/sjdb_insert.rs correctly extends the genome with the Gsj buffer (one segment per annotated junction, flanked by sjdbOverhang bases). At alignment time, seeds landing in the Gsj region need to:

  1. Be translated back to the genome donor/acceptor pair (the offset arithmetic that maps Gsj position to (chr, intron_start, intron_end)).
  2. Generate a candidate splice for the seed -> cluster -> stitch pipeline, so the stitching code considers crossing the annotated junction as one of the alignment options.

Pseudocode for STAR's logic (in source/seedSearch1.cpp / source/aligner.cpp):

for each SA hit:
    if hit_pos >= n_genome_real:    # landed in Gsj region
        (donor, acceptor) = decode_gsj_hit(hit_pos)
        emit splice_candidate(donor, acceptor)  # alongside the regular seed

rustar appears to do the regular seed/cluster path but not the Gsj-hit decoding. Worth grep'ing for Gsj, n_genome, and n_genome_real in src/align/ to see whether SA hits in the Gsj region are filtered out (instead of decoded to splice candidates).

A targeted unit test would build a 2-exon toy transcript with a known intron, align a read that overhangs the splice by 3 bp on each side (between align_sjdb_overhang_min and align_sj_overhang_min), and assert the resulting CIGAR contains an N op of the intron length. Today's behaviour: emits the straight M CIGAR.

Severity

Medium. Mapping-rate impact in nf-core/rnaseq -profile test,docker is 0.2 pp ( 50 of 49 551 reads per sample shift from unique mapped to unmapped: too short). The per-read CIGAR impact is bigger — 234 of 340 same-position-different-CIGAR reads on WT_REP2 are this pattern. On larger genomes with richer annotation the percentage of reads spanning GTF junctions will be much higher than yeast's 1.5 %, so the effective impact scales up.

After #27 lands and this issue is fixed, the test profile's gene-level Salmon NumReads Pearson should rise from 0.9998 to closer to 1.0 (the residual being RNG / tie-breaking only), and the per-CIGAR diff count should drop materially.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions