fix(transcriptome-bam): populate mate fields on paired-end records#41
Open
pinin4fjords wants to merge 1 commit into
Open
fix(transcriptome-bam): populate mate fields on paired-end records#41pinin4fjords wants to merge 1 commit into
pinin4fjords wants to merge 1 commit into
Conversation
build_transcriptome_records_pe was stamping SEGMENTED + FIRST/LAST_SEGMENT but leaving PROPERLY_SEGMENTED, MATE_REVERSE_COMPLEMENTED, RNEXT, PNEXT, and TLEN unset on every transcriptome record. Salmon's alignment-mode fragment-length inference reads those mate fields; with them missing it falls back to its 250+/-25 prior, distorting paired-end TPMs. Walk the projected (rec1, rec2) pairs after the existing flag-stamping loops and mirror the mate-bookkeeping the genome-space build_paired_mate_record already does: PROPERLY_SEGMENTED + MATE_REVERSE_COMPLEMENTED + mate ref id + mate position + signed TLEN (leftmost +, rightmost -). Fixes scverse#22
Author
|
Verified end-to-end on macOS/aarch64 against the rebuilt fix branch with the issue's exact reproducer (PE yeast + First record pair
Pre-fix the same invocation produced |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Paired-end records in
Aligned.toTranscriptome.out.bamwere missing every mate field: noPROPERLY_SEGMENTED(0x2) flag, noMATE_REVERSE_COMPLEMENTED(0x20),RNEXT=*,PNEXT=0,TLEN=0. Salmon's alignment-mode fragment-length inference reads these fields; with them missing it falls back to its default prior (mean=250, sd=25) andEffectiveLengthfor short transcripts gets over-shrunk, inflating their TPMs.samtools flagstatpre-fix: 0 / 77 300 properly-paired records. STAR on the same data: 78 886 / 78 886.Fix
In
build_transcriptome_records_pe, after the existing flag-stamping loops and before the interleave, walk the projected(rec1, rec2)pairs alongside their(p1, p2)transcripts and stamp:PROPERLY_SEGMENTEDon both (concordance is guaranteed at this point - both mates project to the same transcript by construction).MATE_REVERSE_COMPLEMENTEDmirroring the other mate'sREVERSE_COMPLEMENTED.mate_reference_sequence_idandmate_alignment_startfrom the other projection.template_lengthwith STAR's sign convention: magnitudemax(end1,end2) - min(start1,start2) + 1, leftmost mate positive, rightmost negative.Mirrors the genome-space
build_paired_mate_record(which already does all of this) so the transcriptome path is now byte-symmetric for shared fields. The TLEN-signing logic is the samecalculate_insert_sizehelper the genome path uses (lifted topub(crate)); the mate-bookkeeping itself is factored into a newapply_pe_transcriptome_mate_fieldshelper insrc/io/sam.rsso it can be unit-tested in isolation.Test plan
PROPERLY_SEGMENTED, populated mate ref id, non-zero signed TLEN (forward and reverse cluster cases)cargo buildcargo clippy --lib -- -D warningscargo fmt --checkAfter the fix,
samtools view -c -f 2 <transcriptome.bam>returns the count of primary paired records (not 0).Fixes #22