Summary
When --outSAMattributes ... NM ... is requested, rustar-aligner emits nM:i: tags instead of NM:i: tags. The two are different SAM tags with different semantics:
NM (SAM spec): edit distance to the reference = substitutions + inserted bases + deleted bases.
nM (STAR-specific): number of mismatches per (paired) alignment — substitutions only, indels not counted.
rustar appears to compute the STAR-internal nM and emit it under that name even when the user asked for NM. Net result: requests for NM are silently honoured with a different tag holding a different value, and any downstream tool that reads NM:i: (samtools stats, picard CollectAlignmentSummaryMetrics, MultiQC's BAM-stats parsers, mappability filters keyed on edit distance) gets nothing.
STAR reference behaviour
STAR treats NM and nM as independent tags with independent code paths.
source/Parameters_samAttributes.cpp:69-77:
} else if (vAttr1.at(ii)== "NM") {
outSAMattrOrder.push_back(ATTR_NM);
outSAMattrPresent.NM=true;
} else if (vAttr1.at(ii)== "nM") {
outSAMattrOrder.push_back(ATTR_nM);
outSAMattrPresent.nM=true;
source/ReadAlign_outputTranscriptSAM.cpp:242-271 computes the SAM-spec edit distance into tagNM, counting mismatches + deletion bases + insertion bases separately; line 305-306 writes \tNM:i: + tagNM. STAR-the-reference accepts NM, nM, or both, and emits each from its own computation.
Reproducer
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-29 && cd /tmp/rustar-mre-29
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
RBAM=RUS./Aligned.out.bam
SBAM=STAR.Aligned.out.bam
echo "=== Tag inventory ==="
for label in rustar STAR; do
if [ $label = rustar ]; then BAM=$RBAM; else BAM=$SBAM; fi
nm=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | grep -c 'NM:i:')
nm_lower=$(docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM | grep -c 'nM:i:')
echo "$label NM:i: $nm | nM:i: $nm_lower"
done
echo
echo "=== Spot-check on records with I/D in CIGAR (CIGAR \t tag) ==="
echo "--- rustar (nM) ---"
docker run --rm -v $PWD:/w $STAR samtools view /w/$RBAM \
| awk '$6 ~ /[ID]/ {for(i=12;i<=NF;i++) if($i ~ /^nM:/){t=$i;break}; print $6"\t"t}' | head -5
echo "--- STAR (NM) ---"
docker run --rm -v $PWD:/w $STAR samtools view /w/$SBAM \
| awk '$6 ~ /[ID]/ {for(i=12;i<=NF;i++) if($i ~ /^NM:/){t=$i;break}; print $6"\t"t}' | head -5
Observed (verified, fresh paired run)
rustar NM:i: 0 | nM:i: 88592
STAR NM:i: 88072 | nM:i: 0
Same first records with indels in CIGAR:
--- rustar (nM) ---
80M2D21M nM:i:0
42M1D58M nM:i:0
18M1I82M nM:i:0
77M1D23M nM:i:1
32M1D69M nM:i:1
--- STAR (NM) ---
80M2D21M NM:i:2
42M1D58M NM:i:1
18M1I82M NM:i:1
77M1D23M NM:i:2
32M1D69M NM:i:2
Every indel-containing record has rustar's nM:i:N less than STAR's NM:i:M by the count of inserted+deleted bases — consistent with rustar's nM being substitutions-only (matching STAR's documented nM semantics). The bug isn't the value of nM; the bug is that NM was requested and not delivered.
Suggested fix
- In rustar's
--outSAMattributes parser, treat NM and nM as distinct tokens, the way STAR does at Parameters_samAttributes.cpp:69-77. Currently the NM token appears to route to the nM writer or be silently treated as nM.
- Implement the SAM-spec edit-distance computation for
NM (substitutions + inserted bases + deleted bases), separately from rustar's existing nM mismatches-only counter. STAR's algorithm at ReadAlign_outputTranscriptSAM.cpp:242-271 walks the CIGAR and adds mismatches + deletion-gap lengths (excluding intron N skips) + insertion bases.
Both tags can be supported simultaneously (STAR's default --outSAMattributes Standard is NH HI AS nM, but NH HI AS NM MD is also valid; users routinely request NM for SAM-spec compliance).
Test plan
- Run rustar with
--outSAMattributes NH HI AS NM MD on any paired-end data with a few indel-containing reads.
- Assert
NM:i: records > 0 and nM:i: records == 0.
- Assert
NM:i:N value equals (mismatches via MD:Z: decode) + (sum of I and D op lengths in CIGAR).
Why this matters
samtools stats, picard CollectAlignmentSummaryMetrics, mosdepth --fast-mode, BWA-style mappability filters, MultiQC's bam-stats and picard-metrics modules all consume NM:i:. Today they get an empty result on rustar BAMs, leading to silently-blank stats blocks in QC reports.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
When
--outSAMattributes ... NM ...is requested,rustar-aligneremitsnM:i:tags instead ofNM:i:tags. The two are different SAM tags with different semantics:NM(SAM spec): edit distance to the reference = substitutions + inserted bases + deleted bases.nM(STAR-specific): number of mismatches per (paired) alignment — substitutions only, indels not counted.rustar appears to compute the STAR-internal
nMand emit it under that name even when the user asked forNM. Net result: requests forNMare silently honoured with a different tag holding a different value, and any downstream tool that readsNM:i:(samtools stats,picard CollectAlignmentSummaryMetrics, MultiQC's BAM-stats parsers, mappability filters keyed on edit distance) gets nothing.STAR reference behaviour
STAR treats
NMandnMas independent tags with independent code paths.source/Parameters_samAttributes.cpp:69-77:source/ReadAlign_outputTranscriptSAM.cpp:242-271computes the SAM-spec edit distance intotagNM, counting mismatches + deletion bases + insertion bases separately; line 305-306 writes\tNM:i:+tagNM. STAR-the-reference acceptsNM,nM, or both, and emits each from its own computation.Reproducer
Observed (verified, fresh paired run)
Same first records with indels in CIGAR:
Every indel-containing record has rustar's
nM:i:Nless than STAR'sNM:i:Mby the count of inserted+deleted bases — consistent with rustar'snMbeing substitutions-only (matching STAR's documentednMsemantics). The bug isn't the value ofnM; the bug is thatNMwas requested and not delivered.Suggested fix
--outSAMattributesparser, treatNMandnMas distinct tokens, the way STAR does atParameters_samAttributes.cpp:69-77. Currently theNMtoken appears to route to thenMwriter or be silently treated asnM.NM(substitutions + inserted bases + deleted bases), separately from rustar's existingnMmismatches-only counter. STAR's algorithm atReadAlign_outputTranscriptSAM.cpp:242-271walks the CIGAR and adds mismatches + deletion-gap lengths (excluding intronNskips) + insertion bases.Both tags can be supported simultaneously (STAR's default
--outSAMattributes StandardisNH HI AS nM, butNH HI AS NM MDis also valid; users routinely requestNMfor SAM-spec compliance).Test plan
--outSAMattributes NH HI AS NM MDon any paired-end data with a few indel-containing reads.NM:i:records > 0 andnM:i:records == 0.NM:i:Nvalue equals (mismatches viaMD:Z:decode) + (sum ofIandDop lengths in CIGAR).Why this matters
samtools stats,picard CollectAlignmentSummaryMetrics,mosdepth --fast-mode, BWA-style mappability filters, MultiQC's bam-stats and picard-metrics modules all consumeNM:i:. Today they get an empty result on rustar BAMs, leading to silently-blank stats blocks in QC reports.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.