fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45
fix(sjdb): key SpliceJunctionDb in genome-absolute 0-based coordinates#45pinin4fjords wants to merge 2 commits into
SpliceJunctionDb in genome-absolute 0-based coordinates#45Conversation
The DB was keyed in chromosome-local 1-based coords (straight from the GTF) but consulted in genome-absolute 0-based at stitch time, so every sjdb lookup during alignment missed. Annotated splice events never got the sjdb_score bonus and the stricter overhang gate fired in their place, dropping ~50 % of GT/AG splices on the test profile. The stats-time site queried in genome-absolute 1-based-equivalent, so it accidentally matched on chr 0 (chr_start=0) but missed on every other chromosome -- producing inconsistent answers between SJ.out.tab and Log.final.out on the same BAM. Normalise the DB to genome-absolute 0-based at construction, matching the convention prepared_junctions and SpliceJunctionStats already use. Update the stats-time call site to query in the new space. Stitch-time needs no change. Fixes scverse#27 Co-Authored-By: Claude <noreply@anthropic.com>
|
Verification on macOS/aarch64 against the rebuilt fix branch: the smoking-gun behaviour from #27 is still present, so this does not yet close the issue. Rebuilt the index with the patched binary and reran the PE yeast reproducer (
Why it still failsThe stitch-time call site at let donor_fwd =
index.sa_pos_to_forward(junc_donor_sa, cluster.is_reverse, del as usize);
let acceptor_fwd = donor_fwd + del as u64;
db.is_annotated(cluster.chr_idx, donor_fwd, acceptor_fwd, ...)
The DB is keyed with Confirmed by experimentLocally patching just (Side note: the experimental patch lifts Annotated to 95 but Total stays at 366 — recovering the ~50% missing GT/AG splices the description mentions probably needs the same fix to feed the |
The previous commit normalised the DB to genome-absolute 0-based and updated the stats-time query. Stitch-time was left unchanged on the assumption that it already passed genome-absolute 0-based — half right. donor_fwd was correct (first intron base, 0-based) but acceptor_fwd = donor_fwd + del landed on the first base AFTER the intron, while the DB keys store the last intron base. Lookup missed every annotated junction. Subtract 1 from the acceptor to land on the last intron base. After this, Log.final.out Annotated (sjdb) goes from 0 to a non-zero count that's consistent with the annotated=1 row count in SJ.out.tab on the same BAM. Co-Authored-By: Claude <noreply@anthropic.com>
|
Followed up: the verification agent caught a real second bug in this PR's first revision. After the genome-absolute 0-based normalisation, stitch-time was still passing the wrong acceptor coordinate — Subtracted 1 from Verified end-to-end on macOS/aarch64 (PE yeast + Both counters are now consistent (the pre-fix Remaining gap (separate from this PR): total splice count is still 366 vs STAR's ~720 on the same data, with non-canonical at 93 vs STAR's ~25. So pass-1 is still not using the sjdb-derived junctions as alignment candidates to recover the ~50 % of GT/AG splices STAR finds — that's a deeper algorithmic difference (pass-1 candidate generation, not just scoring). Worth tracking as a follow-up; the coord-space mismatch this PR fixes was a precondition for any downstream fix to work, so closing this one is meaningful even though the headline total is still gappy. LGTM on the scope of the PR. |
Summary
SpliceJunctionDbwas keyed in chromosome-local 1-based coordinates (extracted straight from GTF column 5) but consulted in two different coordinate spaces at runtime:src/align/stitch.rs:1305-1314)chr_start[N]-> misssrc/lib.rs:1860-1894)chr_start[N]-> missNeither matched the DB. Smoking gun: on the same WT_REP2 BAM,
SJ.out.tabhad 2 rowsannotated=1(stats-time accidentally hits chr-0) whileLog.final.outreportedNumber of splices: Annotated (sjdb) | 0(stitch-time always misses). The stitch-time miss is the load-bearing one - it dropssjdb_scoreand pulls in the stricteralign_sj_overhang_mingate, costing ~50 % of GT/AG splices on the nf-core/rnaseq test profile.Fix
Normalise the DB to genome-absolute 0-based at construction (matching the existing convention used by
prepared_junctionsinsrc/index/mod.rsand bySpliceJunctionStats). Single source of truth contained in the GTF extraction step.extract_junctions_configuredinsrc/junction/gtf.rsnow addschr_start[chr_idx]and subtracts 1 before returning each(intron_start, intron_end).SpliceJunctionDb::from_raw_junctionsconsumes those tuples verbatim.src/lib.rs:1860andsrc/lib.rs:796now record(genome_pos, genome_pos + intron_len - 1)instead of(genome_pos + 1, genome_pos + intron_len).genome_posis the 0-based first intronic base, which is whatdetect_splice_motifalready expects.src/align/stitch.rsis unchanged - it already passes genome-absolute 0-based.SJ.out.tabwriter insrc/junction/sj_output.rsnow adds+ 1after subtractingchr_start[chr_idx]so the chr-local 1-based output is unchanged.prepared_junctionsconstruction insrc/index/mod.rsis simplified: the chr-local-to-absolute conversion that used to live there now happens upstream in the GTF extractor.Test plan
test_db_keyed_in_genome_absolute_zero_based_multi_chrbuilds aSpliceJunctionDbon a 2-chromosome toy genome (chr_start[1] = 1000) and assertsis_annotatedsucceeds for the chr-1 junction at its genome-absolute 0-based key. The pre-fix chr-local key and the pre-fix stitch-time off-by-one key must both misscargo buildcargo clippy --lib -- -D warningscargo fmt --checkcargo test(384 lib tests + integration suites pass)After this fix,
Number of splices: Annotated (sjdb)inLog.final.outmatches the count ofannotated=1rows inSJ.out.tabon the same BAM, and the per-sample Annotated count is in the same order as STAR's on equivalent inputs.Fixes #27