feat(bam): populate @PG header with PN/VN/CL fields#40
Open
pinin4fjords wants to merge 2 commits into
Open
Conversation
Per SAM spec §1.3, the @pg line conventionally carries PN (program name), VN (version), and CL (command line) alongside ID. rustar was emitting only ID:rustar-aligner, leaving downstream provenance tools (MultiQC's program-version table, dx-toolkit lineage tracking) with a blank entry. Expand the header writer to emit: @pg\tID:rustar-aligner\tPN:rustar-aligner\tVN:<cargo pkg version>\tCL:<args> The full command line is captured in main() before clap parses it, then threaded into Parameters via a new (skip) field so it reaches the SAM header builder. Version comes from CARGO_PKG_VERSION at compile time. This matches STAR's @pg format and gives downstream tools the provenance they need. Fixes scverse#33 (the @pg header gap; AS divergence is a separate item).
Author
|
Verified end-to-end on macOS/aarch64 against the rebuilt fix branch.
Pre-fix the same line was just |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The BAM
@PGline in rustar was content-free - just@PG\tID:rustar-aligner, noPN,VN, orCLfields. Downstream provenance tools (MultiQC program-version table, dx-toolkit lineage tracking, internal QC dashboards) end up with a blank entry where STAR shows a fully populated record.This PR expands the
@PGline to match STAR's format and the SAM spec §1.3 conventions:The full command line is captured via
std::env::args()inmain.rsbefore clap consumes it, then threaded intoParametersvia a new#[arg(skip)]field (command_line: Option<String>) so it reachesbuild_sam_header_from_refsinsrc/io/sam.rs. Version comes fromenv!("CARGO_PKG_VERSION")at compile time. Whencommand_lineisNone(e.g. library use, internal callers),CLfalls back to"rustar-aligner"so the field is still non-empty.Scope
This addresses only Gap 1 of #33 (the
@PGcontent). Gap 2 (ASvalue divergence on ~2.4% of identical-CIGAR records) is left to a separate investigation since its root cause appears to overlap with #27 (sjdb seeding).Test plan
test_build_sam_header_pg_line_populatedasserts the@PGline carriesPN:rustar-aligner,VN:<cargo pkg version>, andCL:<expected command line>test_build_sam_header_pg_line_default_cl_when_unsetcovers thecommand_line = Nonefallback soCLis still non-emptycargo buildcargo clippy --lib -- -D warningscleancargo fmt --checkcleanRefs #33 (Gap 1 only)