Skip to content

--outSAMattributes NM emits nM:i: instead of NM:i: (SAM-spec NM tag missing) #29

@pinin4fjords

Description

@pinin4fjords

Summary

When --outSAMattributes ... NM ... is requested, rustar-aligner emits nM:i: tags instead of NM:i: tags. The two are different SAM tags with different semantics:

  • NM (SAM spec): edit distance to the reference = substitutions + inserted bases + deleted bases.
  • nM (STAR-specific): number of mismatches per (paired) alignment — substitutions only, indels not counted.

rustar appears to compute the STAR-internal nM and emit it under that name even when the user asked for NM. Net result: requests for NM are silently honoured with a different tag holding a different value, and any downstream tool that reads NM:i: (samtools stats, picard CollectAlignmentSummaryMetrics, MultiQC's BAM-stats parsers, mappability filters keyed on edit distance) gets nothing.

STAR reference behaviour

STAR treats NM and nM as independent tags with independent code paths.

source/Parameters_samAttributes.cpp:69-77:

} else if (vAttr1.at(ii)== "NM") {
    outSAMattrOrder.push_back(ATTR_NM);
    outSAMattrPresent.NM=true;
} else if (vAttr1.at(ii)== "nM") {
    outSAMattrOrder.push_back(ATTR_nM);
    outSAMattrPresent.nM=true;

source/ReadAlign_outputTranscriptSAM.cpp:242-271 computes the SAM-spec edit distance into tagNM, counting mismatches + deletion bases + insertion bases separately; line 305-306 writes \tNM:i: + tagNM. STAR-the-reference accepts NM, nM, or both, and emits each from its own computation.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-29 && cd /tmp/rustar-mre-29

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

RBAM=RUS./Aligned.out.bam
SBAM=STAR.Aligned.out.bam

echo "=== Tag inventory ==="
for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=$RBAM; else BAM=$SBAM; fi
    nm=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | grep -c 'NM:i:')
    nm_lower=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | grep -c 'nM:i:')
    echo "$label NM:i: $nm | nM:i: $nm_lower"
done

echo
echo "=== Spot-check on records with I/D in CIGAR (CIGAR \t tag) ==="
echo "--- rustar (nM) ---"
docker run --rm -v $PWD:/w $STAR samtools view /w/$RBAM \
  | awk '$6 ~ /[ID]/ {for(i=12;i<=NF;i++) if($i ~ /^nM:/){t=$i;break}; print $6"\t"t}' | head -5
echo "--- STAR (NM) ---"
docker run --rm -v $PWD:/w $STAR samtools view /w/$SBAM \
  | awk '$6 ~ /[ID]/ {for(i=12;i<=NF;i++) if($i ~ /^NM:/){t=$i;break}; print $6"\t"t}' | head -5

Observed (verified, fresh paired run)

rustar NM:i: 0     | nM:i: 88592
STAR   NM:i: 88072 | nM:i: 0

Same first records with indels in CIGAR:

--- rustar (nM) ---
80M2D21M    nM:i:0
42M1D58M    nM:i:0
18M1I82M    nM:i:0
77M1D23M    nM:i:1
32M1D69M    nM:i:1

--- STAR (NM) ---
80M2D21M    NM:i:2
42M1D58M    NM:i:1
18M1I82M    NM:i:1
77M1D23M    NM:i:2
32M1D69M    NM:i:2

Every indel-containing record has rustar's nM:i:N less than STAR's NM:i:M by the count of inserted+deleted bases — consistent with rustar's nM being substitutions-only (matching STAR's documented nM semantics). The bug isn't the value of nM; the bug is that NM was requested and not delivered.

Suggested fix

  1. In rustar's --outSAMattributes parser, treat NM and nM as distinct tokens, the way STAR does at Parameters_samAttributes.cpp:69-77. Currently the NM token appears to route to the nM writer or be silently treated as nM.
  2. Implement the SAM-spec edit-distance computation for NM (substitutions + inserted bases + deleted bases), separately from rustar's existing nM mismatches-only counter. STAR's algorithm at ReadAlign_outputTranscriptSAM.cpp:242-271 walks the CIGAR and adds mismatches + deletion-gap lengths (excluding intron N skips) + insertion bases.

Both tags can be supported simultaneously (STAR's default --outSAMattributes Standard is NH HI AS nM, but NH HI AS NM MD is also valid; users routinely request NM for SAM-spec compliance).

Test plan

  1. Run rustar with --outSAMattributes NH HI AS NM MD on any paired-end data with a few indel-containing reads.
  2. Assert NM:i: records > 0 and nM:i: records == 0.
  3. Assert NM:i:N value equals (mismatches via MD:Z: decode) + (sum of I and D op lengths in CIGAR).

Why this matters

samtools stats, picard CollectAlignmentSummaryMetrics, mosdepth --fast-mode, BWA-style mappability filters, MultiQC's bam-stats and picard-metrics modules all consume NM:i:. Today they get an empty result on rustar BAMs, leading to silently-blank stats blocks in QC reports.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions