Skip to content

fix(transcriptome-bam): write per-record RG:Z: tag matching @RG header#39

Open
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/transcriptome-rg-tag
Open

fix(transcriptome-bam): write per-record RG:Z: tag matching @RG header#39
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/transcriptome-rg-tag

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

When --outSAMattrRGline ID:... SM:... is supplied, rustar writes the @RG header to both the genome and transcriptome BAMs. It also writes RG:Z:<id> per record on the genome BAM - but not on the transcriptome BAM. STAR writes RG:Z: on every transcriptome record (78886/78886 in the issue's test sample), while rustar wrote 0.

Any tool that splits transcriptome BAM records by read group (multi-sample bundled BAMs, custom QC keyed on RG) silently sees no RG.

Fix

Add the per-record RG:Z: stamp to the transcriptome record builder (paired-end and single-end paths), mirroring the genome record builder. The approach matches STAR's outSAMattrOrderQuant.push_back(ATTR_RG) at Parameters_samAttributes.cpp:201-205.

SamWriter::build_transcriptome_records already receives &Parameters, and the existing maybe_insert_rg_tag helper (already used by every genome-BAM record builder) is reused unchanged. Both the SE and PE call sites in lib.rs go through this single builder, so the fix lands on both paths without touching them.

Test plan

  • New unit test asserts every transcriptome record carries RG:Z:<id> when --outSAMattrRGline ID:foo ... is supplied
  • New unit test asserts no RG:Z: tag is emitted when --outSAMattrRGline is unset (mirrors genome-BAM gating)
  • Existing lib tests pass (385 passed, 0 failed)
  • cargo build
  • cargo fmt --check
  • cargo clippy --lib -- -D warnings clean; cargo clippy --all-targets has 46 pre-existing errors on main unrelated to this change (deprecated assert_cmd::Command::cargo_bin, modulo_one in integration tests, unused import in src/chimeric/output.rs). Picked up separately by pa/lint-all-targets.

Fixes #32

When --outSAMattrRGline is supplied, rustar writes the @rg header to
both the genome and transcriptome BAMs, but only stamps the per-record
RG:Z: tag on the genome BAM. The transcriptome BAM was emitting
0 RG:Z: records vs STAR's 1-per-record output.

Add the RG tag stamp to the transcriptome record builder (paired-end
and single-end paths) so every record carries RG:Z:<id> matching the
@rg header, byte-symmetric with STAR.

Fixes scverse#32
@pinin4fjords
Copy link
Copy Markdown
Author

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch.

Same PE yeast input + --outSAMattrRGline ID:WT_REP2 SM:WT_REP2, transcriptome BAM (Aligned.toTranscriptome.out.bam):

total transcriptome records: 77300
records with RG:Z:         : 77300
first record RG tag        : RG:Z:WT_REP2

Pre-fix the same invocation produced 0 / 77300 records with RG. After the fix every record carries RG:Z:WT_REP2 matching the @RG header. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transcriptome BAM (--quantMode TranscriptomeSAM) omits per-record RG:Z: tag

1 participant