Skip to content

fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45

Open
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/sjdb-coord-space
Open

fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/sjdb-coord-space

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

SpliceJunctionDb was keyed in chromosome-local 1-based coordinates (extracted straight from GTF column 5) but consulted in two different coordinate spaces at runtime:

Call site Convention On chr 0 (chr_start=0) On chr N+
stitch-time (src/align/stitch.rs:1305-1314) genome-absolute 0-based off by 1 -> miss off by chr_start[N] -> miss
stats-time (src/lib.rs:1860-1894) genome-absolute 1-based matches -> hit off by chr_start[N] -> miss

Neither matched the DB. Smoking gun: on the same WT_REP2 BAM, SJ.out.tab had 2 rows annotated=1 (stats-time accidentally hits chr-0) while Log.final.out reported Number of splices: Annotated (sjdb) | 0 (stitch-time always misses). The stitch-time miss is the load-bearing one - it drops sjdb_score and pulls in the stricter align_sj_overhang_min gate, costing ~50 % of GT/AG splices on the nf-core/rnaseq test profile.

Fix

Normalise the DB to genome-absolute 0-based at construction (matching the existing convention used by prepared_junctions in src/index/mod.rs and by SpliceJunctionStats). Single source of truth contained in the GTF extraction step.

  • extract_junctions_configured in src/junction/gtf.rs now adds chr_start[chr_idx] and subtracts 1 before returning each (intron_start, intron_end). SpliceJunctionDb::from_raw_junctions consumes those tuples verbatim.
  • Stats-time call sites at src/lib.rs:1860 and src/lib.rs:796 now record (genome_pos, genome_pos + intron_len - 1) instead of (genome_pos + 1, genome_pos + intron_len). genome_pos is the 0-based first intronic base, which is what detect_splice_motif already expects.
  • Stitch-time call site at src/align/stitch.rs is unchanged - it already passes genome-absolute 0-based.
  • SJ.out.tab writer in src/junction/sj_output.rs now adds + 1 after subtracting chr_start[chr_idx] so the chr-local 1-based output is unchanged.
  • prepared_junctions construction in src/index/mod.rs is simplified: the chr-local-to-absolute conversion that used to live there now happens upstream in the GTF extractor.

Test plan

  • New unit test test_db_keyed_in_genome_absolute_zero_based_multi_chr builds a SpliceJunctionDb on a 2-chromosome toy genome (chr_start[1] = 1000) and asserts is_annotated succeeds for the chr-1 junction at its genome-absolute 0-based key. The pre-fix chr-local key and the pre-fix stitch-time off-by-one key must both miss
  • Existing tests that encoded the chr-local 1-based assumption have been updated to genome-absolute 0-based equivalents (those were testing the bug)
  • cargo build
  • cargo clippy --lib -- -D warnings
  • cargo fmt --check
  • cargo test (384 lib tests + integration suites pass)

After this fix, Number of splices: Annotated (sjdb) in Log.final.out matches the count of annotated=1 rows in SJ.out.tab on the same BAM, and the per-sample Annotated count is in the same order as STAR's on equivalent inputs.

Fixes #27

The DB was keyed in chromosome-local 1-based coords (straight from the
GTF) but consulted in genome-absolute 0-based at stitch time, so every
sjdb lookup during alignment missed. Annotated splice events never got
the sjdb_score bonus and the stricter overhang gate fired in their
place, dropping ~50 % of GT/AG splices on the test profile.

The stats-time site queried in genome-absolute 1-based-equivalent, so
it accidentally matched on chr 0 (chr_start=0) but missed on every
other chromosome -- producing inconsistent answers between SJ.out.tab
and Log.final.out on the same BAM.

Normalise the DB to genome-absolute 0-based at construction, matching
the convention prepared_junctions and SpliceJunctionStats already use.
Update the stats-time call site to query in the new space. Stitch-time
needs no change.

Fixes scverse#27

Co-Authored-By: Claude <noreply@anthropic.com>
@pinin4fjords
Copy link
Copy Markdown
Author

Verification on macOS/aarch64 against the rebuilt fix branch: the smoking-gun behaviour from #27 is still present, so this does not yet close the issue.

Rebuilt the index with the patched binary and reran the PE yeast reproducer (SRR6357072_{1,2}.fastq, --twopassMode Basic --sjdbGTFfile genes.gtf, --runRNGseed 0):

$ grep -E 'splices|sjdb' V27./Log.final.out
                       Number of splices: Total |	366
            Number of splices: Annotated (sjdb) |	0
                       Number of splices: GT/AG |	266
                       Number of splices: GC/AG |	2
                       Number of splices: AT/AC |	5
               Number of splices: Non-canonical |	93

$ awk '{print $6}' V27./SJ.out.tab | sort | uniq -c
  14 0
   2 1

Log.final.out Annotated (sjdb) = 0, SJ.out.tab has 2 rows with annotated=1, Total = 366. These are the same numbers as pre-fix.

Why it still fails

The stitch-time call site at src/align/stitch.rs:1305-1314 queries:

let donor_fwd =
    index.sa_pos_to_forward(junc_donor_sa, cluster.is_reverse, del as usize);
let acceptor_fwd = donor_fwd + del as u64;
db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, ...)

donor_sa = last_exon.genome_end is exclusive (= first intron base), so donor_fwd is genome-abs 0-based first intronic base — fine, matches the new DB convention. But acceptor_fwd = donor_fwd + del lands on the first base AFTER the intron (the first exonic base of the downstream exon), not the last intronic base. del = genome_gap - read_gap = intron length, so the correct expression is donor_fwd + del - 1.

The DB is keyed with intron_end = last intron base (0-based), per the new convention (PreparedJunction.end_pos, extract_junctions_configured now returns chr_off + exon2.start - 2). So stitch-time queries are off-by-one on the end and always miss.

Confirmed by experiment

Locally patching just acceptor_fwd = donor_fwd + del as u64 - 1 on top of this branch flips Log.final.out Annotated from 0 to 95 with the same Total. So the PR's two construction-site and stats-time changes are correct (SJ.out.tab annotated col already agrees with the DB), but the description's claim that "stitch-time is unchanged — it already passes genome-absolute 0-based" is true only for donor_fwd; acceptor_fwd needs a -1.

(Side note: the experimental patch lifts Annotated to 95 but Total stays at 366 — recovering the ~50% missing GT/AG splices the description mentions probably needs the same fix to feed the sjdb_score bonus through to the gating decision, which I didn't dig into further. Possibly another lookup site, or the bonus alone isn't enough to flip the overhang gate. Worth a closer look once the off-by-one is in.)

The previous commit normalised the DB to genome-absolute 0-based and
updated the stats-time query. Stitch-time was left unchanged on the
assumption that it already passed genome-absolute 0-based — half right.
donor_fwd was correct (first intron base, 0-based) but acceptor_fwd =
donor_fwd + del landed on the first base AFTER the intron, while the
DB keys store the last intron base. Lookup missed every annotated
junction.

Subtract 1 from the acceptor to land on the last intron base. After
this, Log.final.out Annotated (sjdb) goes from 0 to a non-zero count
that's consistent with the annotated=1 row count in SJ.out.tab on
the same BAM.

Co-Authored-By: Claude <noreply@anthropic.com>
@pinin4fjords
Copy link
Copy Markdown
Author

Followed up: the verification agent caught a real second bug in this PR's first revision. After the genome-absolute 0-based normalisation, stitch-time was still passing the wrong acceptor coordinate — donor_fwd + del lands on the first base after the intron, while the DB keys store the last intron base. So the lookup kept missing.

Subtracted 1 from acceptor_fwd (commit just pushed).

Verified end-to-end on macOS/aarch64 (PE yeast + --twopassMode Basic --sjdbGTFfile genes.gtf):

=== rustar Log.final.out splice section ===
Number of splices: Total            | 366
Number of splices: Annotated (sjdb) | 95         <- was 0 pre-fix
Number of splices: GT/AG            | 266
Number of splices: GC/AG            | 2
Number of splices: AT/AC            | 5
Number of splices: Non-canonical    | 93

=== rustar SJ.out.tab annotated col distribution ===
    14  annotated=0
     2  annotated=1

Both counters are now consistent (the pre-fix Log.final.out = 0 vs SJ.out.tab = 2 annotated smoking gun is gone — the lookup is now genuinely hitting on annotated junctions and the +3 sjdb_score bonus fires).

Remaining gap (separate from this PR): total splice count is still 366 vs STAR's ~720 on the same data, with non-canonical at 93 vs STAR's ~25. So pass-1 is still not using the sjdb-derived junctions as alignment candidates to recover the ~50 % of GT/AG splices STAR finds — that's a deeper algorithmic difference (pass-1 candidate generation, not just scoring). Worth tracking as a follow-up; the coord-space mismatch this PR fixes was a precondition for any downstream fix to work, so closing this one is meaningful even though the headline total is still gappy.

LGTM on the scope of the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Log.final.out always reports Number of splices: Annotated (sjdb) = 0 despite --sjdbGTFfile

1 participant