Skip to content

feat(bam): populate @PG header with PN/VN/CL fields#40

Open
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/pg-header-content
Open

feat(bam): populate @PG header with PN/VN/CL fields#40
pinin4fjords wants to merge 2 commits into
scverse:mainfrom
pinin4fjords:fix/pg-header-content

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

The BAM @PG line in rustar was content-free - just @PG\tID:rustar-aligner, no PN, VN, or CL fields. Downstream provenance tools (MultiQC program-version table, dx-toolkit lineage tracking, internal QC dashboards) end up with a blank entry where STAR shows a fully populated record.

This PR expands the @PG line to match STAR's format and the SAM spec §1.3 conventions:

@PG\tID:rustar-aligner\tPN:rustar-aligner\tVN:<cargo pkg version>\tCL:<command line>

The full command line is captured via std::env::args() in main.rs before clap consumes it, then threaded into Parameters via a new #[arg(skip)] field (command_line: Option<String>) so it reaches build_sam_header_from_refs in src/io/sam.rs. Version comes from env!("CARGO_PKG_VERSION") at compile time. When command_line is None (e.g. library use, internal callers), CL falls back to "rustar-aligner" so the field is still non-empty.

Scope

This addresses only Gap 1 of #33 (the @PG content). Gap 2 (AS value divergence on ~2.4% of identical-CIGAR records) is left to a separate investigation since its root cause appears to overlap with #27 (sjdb seeding).

Test plan

  • New unit test test_build_sam_header_pg_line_populated asserts the @PG line carries PN:rustar-aligner, VN:<cargo pkg version>, and CL:<expected command line>
  • New unit test test_build_sam_header_pg_line_default_cl_when_unset covers the command_line = None fallback so CL is still non-empty
  • All 385 existing lib tests still pass (no test asserted the prior content-free format)
  • cargo build
  • cargo clippy --lib -- -D warnings clean
  • cargo fmt --check clean

Refs #33 (Gap 1 only)

Per SAM spec §1.3, the @pg line conventionally carries PN (program name),
VN (version), and CL (command line) alongside ID. rustar was emitting
only ID:rustar-aligner, leaving downstream provenance tools (MultiQC's
program-version table, dx-toolkit lineage tracking) with a blank entry.

Expand the header writer to emit:
  @pg\tID:rustar-aligner\tPN:rustar-aligner\tVN:<cargo pkg version>\tCL:<args>

The full command line is captured in main() before clap parses it, then
threaded into Parameters via a new (skip) field so it reaches the SAM
header builder. Version comes from CARGO_PKG_VERSION at compile time.

This matches STAR's @pg format and gives downstream tools the provenance
they need.

Fixes scverse#33 (the @pg header gap; AS divergence is a separate item).
@pinin4fjords
Copy link
Copy Markdown
Author

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch.

samtools view -H <bam> @PG line:

@PG  ID:rustar-aligner  PN:rustar-aligner  VN:0.1.0  CL:/tmp/rustar-fix-33/target/debug/rustar-aligner --genomeDir idx-rustar --readFilesIn ... --outFileNamePrefix VP.

Pre-fix the same line was just @PG\tID:rustar-aligner (no PN, no VN, no CL). After the fix all four fields are populated — PN and VN from the cargo package, CL from the captured std::env::args(). Matches STAR's format. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant