Skip to content

@PG BAM header line is content-free (no PN/VN/CL) #33

@pinin4fjords

Description

@pinin4fjords

Summary

rustar's BAM @PG line is content-free — just @PG\tID:rustar-aligner, no PN, VN, or CL fields. STAR emits a fully populated PG record including the full command line. Provenance tools (MultiQC program-version table, dx-toolkit lineage tracking, internal QC dashboards) end up with a blank entry.

STAR reference behaviour

source/samHeaders.cpp:61-63:

samHeaderStream << "@PG\tID:STAR\tPN:STAR\tVN:" << STAR_VERSION
                << "\tCL:" << P.commandLineFull << "\n";

SAM spec §1.3 defines the @PG fields:

  • ID — required, program record identifier.
  • PN — program name.
  • VN — program version.
  • CL — command line.

PN, VN, CL are optional in the spec but conventionally populated by every modern aligner (bwa, bwa-mem2, STAR, HISAT2, minimap2, bowtie2).

Suggested fix

In the rustar header writer (src/io/sam.rs::SamWriter::write_header), expand the @PG line to:

@PG\tID:rustar-aligner\tPN:rustar-aligner\tVN:{version}\tCL:{full command line}

version from env!("CARGO_PKG_VERSION"). Command line from std::env::args().collect::<Vec<_>>().join(" ") captured in main() before clap parsing.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-33 && cd /tmp/rustar-mre-33

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

echo "=== rustar @PG ==="; docker run --rm -v $PWD:/w $STAR samtools view -H /w/RUS./Aligned.out.bam | grep '^@PG' | head -1
echo "=== STAR @PG ===" ; docker run --rm -v $PWD:/w $STAR samtools view -H /w/STAR.Aligned.out.bam | grep '^@PG' | head -1

Observed:

=== rustar @PG ===
@PG	ID:rustar-aligner

=== STAR @PG ===
@PG	ID:STAR	PN:STAR	VN:2.7.11b	CL:/opt/conda/bin/STAR-avx2  --runThreadN 4  ... --twopassMode Basic

Severity

Low. Doesn't break specific downstream tools but blanks the provenance entry for any tool that reads @PG.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions