Summary
With identical --outFilterMultimapNmax 20 on the same paired-end inputs, rustar emits multi-mappers with NH values up to 20, while STAR caps at NH=6 on the same data. The differing tail isn't fully accounted for by tie-breaking — rustar is keeping additional, lower-scoring secondary alignments that STAR filters out.
Could be intentional, but the visible pattern (a long tail of NH=7..20 records that STAR doesn't produce) looks like a missing or differently-configured score-range filter.
STAR reference behaviour
STAR applies two caps on multi-mapper count, not one. From source/Parameters.cpp:141-142:
parArray.push_back(new ParameterInfoScalar <uint> (-1, -1, "outFilterMultimapNmax", &outFilterMultimapNmax));
parArray.push_back(new ParameterInfoScalar <intScore> (-1, -1, "outFilterMultimapScoreRange", &outFilterMultimapScoreRange));
--outFilterMultimapNmax (default 10): hard ceiling on number of loci a read can map to.
--outFilterMultimapScoreRange (default 1): max score gap between best and a kept multi-mapper. Alignments with score < best_score - scoreRange are dropped before the Nmax cap is applied.
So with --outFilterMultimapNmax 20 and the default --outFilterMultimapScoreRange 1, STAR keeps only loci within 1 score unit of the best, then caps at 20. The empirical NH<=6 tail is a consequence of the score-range filter trimming much earlier than the count filter.
rustar's --help output doesn't list a --outFilterMultimapScoreRange option, consistent with the filter not being implemented yet.
Reproducer
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-31 && cd /tmp/rustar-mre-31
BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz
RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4
mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
--genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
--genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
--sjdbOverhang 100 --genomeSAindexNbases 7
COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
--runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
--outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
--outFilterMultimapNmax 20
--outSAMattrRGline ID:WT_REP2 SM:WT_REP2)
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
--genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
--genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.
for label in rustar STAR; do
if [ $label = rustar ]; then BAM=RUS./Aligned.out.bam; else BAM=STAR.Aligned.out.bam; fi
echo "=== $label NH distribution ==="
docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM \
| grep -oE 'NH:i:[0-9]+' | sort | uniq -c | sort -k2 -n
echo
done
Observed (verified, fresh paired run)
=== rustar NH distribution ===
16 NH:i:8
24 NH:i:12
32 NH:i:16
36 NH:i:9
40 NH:i:20
42 NH:i:7
70 NH:i:5
96 NH:i:6
176 NH:i:4
1540 NH:i:2
3180 NH:i:3
83340 NH:i:1
=== STAR NH distribution ===
10 NH:i:5
48 NH:i:4
72 NH:i:6
1448 NH:i:2
2982 NH:i:3
83512 NH:i:1
STAR caps at NH=6 (the highest NH value emitted). rustar's tail extends to NH=20 — same as --outFilterMultimapNmax, suggesting only the Nmax filter is being applied, not a score-range filter.
Open questions for the maintainers
- Is the absence of a
--outFilterMultimapScoreRange analogue intentional in v0.1.x, or planned but not yet implemented?
- If implemented but with different defaults (e.g. wider score range), what's the rationale? Documenting this in the README's "Accuracy Comparison vs STAR" section would help users understand why their multi-mapper counts differ.
- If unimplemented, would you accept a PR that adds the flag + the per-cluster score-range filter, defaulting to 1 to match STAR?
Why this matters
- Multi-mapper-aware quant: tools like Salmon EM and RSEM weight multi-mappers by the locus count emitted in the BAM. Extra noisy hits at NH=10-20 dilute the EM and shift transcript-level assignments.
- MAPQ schemes: STAR's MAPQ
{0, 1, 3, 255} is keyed on NH (NH=1 -> 255, NH=2 -> 3, NH=3 -> 1, NH>=4 -> 0). Reads that should be MAPQ=255 in STAR (only one within-score-range hit) become MAPQ=0 in rustar when the score-range filter is absent.
- Reproducibility claim: the README states "Multi-mapped 7.4 % | 7.4 %" agreement with STAR on a 10 k single-end benchmark. That parity probably breaks on paired-end with a longer NH tail; worth re-running the comparison on paired-end.
Severity
Behavioural / possibly missing-feature rather than fix-now bug. Filing as a clarifying issue: if --outFilterMultimapScoreRange (or an equivalent filter) is on the roadmap, this can be closed with a "tracked in #X" comment; if not, the discussion above is the spec for adding it.
Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.
Summary
With identical
--outFilterMultimapNmax 20on the same paired-end inputs, rustar emits multi-mappers with NH values up to 20, while STAR caps at NH=6 on the same data. The differing tail isn't fully accounted for by tie-breaking — rustar is keeping additional, lower-scoring secondary alignments that STAR filters out.Could be intentional, but the visible pattern (a long tail of NH=7..20 records that STAR doesn't produce) looks like a missing or differently-configured score-range filter.
STAR reference behaviour
STAR applies two caps on multi-mapper count, not one. From
source/Parameters.cpp:141-142:--outFilterMultimapNmax(default 10): hard ceiling on number of loci a read can map to.--outFilterMultimapScoreRange(default 1): max score gap between best and a kept multi-mapper. Alignments with score < best_score - scoreRange are dropped before the Nmax cap is applied.So with
--outFilterMultimapNmax 20and the default--outFilterMultimapScoreRange 1, STAR keeps only loci within 1 score unit of the best, then caps at 20. The empirical NH<=6 tail is a consequence of the score-range filter trimming much earlier than the count filter.rustar's
--helpoutput doesn't list a--outFilterMultimapScoreRangeoption, consistent with the filter not being implemented yet.Reproducer
Observed (verified, fresh paired run)
STAR caps at NH=6 (the highest NH value emitted). rustar's tail extends to NH=20 — same as
--outFilterMultimapNmax, suggesting only the Nmax filter is being applied, not a score-range filter.Open questions for the maintainers
--outFilterMultimapScoreRangeanalogue intentional in v0.1.x, or planned but not yet implemented?Why this matters
{0, 1, 3, 255}is keyed on NH (NH=1 -> 255, NH=2 -> 3, NH=3 -> 1, NH>=4 -> 0). Reads that should be MAPQ=255 in STAR (only one within-score-range hit) become MAPQ=0 in rustar when the score-range filter is absent.Severity
Behavioural / possibly missing-feature rather than fix-now bug. Filing as a clarifying issue: if
--outFilterMultimapScoreRange(or an equivalent filter) is on the roadmap, this can be closed with a "tracked in #X" comment; if not, the discussion above is the spec for adding it.Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise:
author:pinin4fjordsor grep fornf-core/rnaseq#1855.