Skip to content

Multi-mapper NH cap differs from STAR (NH up to 20 vs STAR's 7); possibly missing --outFilterMultimapScoreRange #31

@pinin4fjords

Description

@pinin4fjords

Summary

With identical --outFilterMultimapNmax 20 on the same paired-end inputs, rustar emits multi-mappers with NH values up to 20, while STAR caps at NH=6 on the same data. The differing tail isn't fully accounted for by tie-breaking — rustar is keeping additional, lower-scoring secondary alignments that STAR filters out.

Could be intentional, but the visible pattern (a long tail of NH=7..20 records that STAR doesn't produce) looks like a missing or differently-configured score-range filter.

STAR reference behaviour

STAR applies two caps on multi-mapper count, not one. From source/Parameters.cpp:141-142:

parArray.push_back(new ParameterInfoScalar <uint>     (-1, -1, "outFilterMultimapNmax",      &outFilterMultimapNmax));
parArray.push_back(new ParameterInfoScalar <intScore> (-1, -1, "outFilterMultimapScoreRange", &outFilterMultimapScoreRange));
  • --outFilterMultimapNmax (default 10): hard ceiling on number of loci a read can map to.
  • --outFilterMultimapScoreRange (default 1): max score gap between best and a kept multi-mapper. Alignments with score < best_score - scoreRange are dropped before the Nmax cap is applied.

So with --outFilterMultimapNmax 20 and the default --outFilterMultimapScoreRange 1, STAR keeps only loci within 1 score unit of the best, then caps at 20. The empirical NH<=6 tail is a consequence of the score-range filter trimming much earlier than the count filter.

rustar's --help output doesn't list a --outFilterMultimapScoreRange option, consistent with the filter not being implemented yet.

Reproducer

#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-31 && cd /tmp/rustar-mre-31

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outFilterMultimapNmax 20
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=RUS./Aligned.out.bam; else BAM=STAR.Aligned.out.bam; fi
    echo "=== $label NH distribution ==="
    docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM \
        | grep -oE 'NH:i:[0-9]+' | sort | uniq -c | sort -k2 -n
    echo
done

Observed (verified, fresh paired run)

=== rustar NH distribution ===
     16 NH:i:8
     24 NH:i:12
     32 NH:i:16
     36 NH:i:9
     40 NH:i:20
     42 NH:i:7
     70 NH:i:5
     96 NH:i:6
    176 NH:i:4
   1540 NH:i:2
   3180 NH:i:3
  83340 NH:i:1

=== STAR NH distribution ===
     10 NH:i:5
     48 NH:i:4
     72 NH:i:6
   1448 NH:i:2
   2982 NH:i:3
  83512 NH:i:1

STAR caps at NH=6 (the highest NH value emitted). rustar's tail extends to NH=20 — same as --outFilterMultimapNmax, suggesting only the Nmax filter is being applied, not a score-range filter.

Open questions for the maintainers

  1. Is the absence of a --outFilterMultimapScoreRange analogue intentional in v0.1.x, or planned but not yet implemented?
  2. If implemented but with different defaults (e.g. wider score range), what's the rationale? Documenting this in the README's "Accuracy Comparison vs STAR" section would help users understand why their multi-mapper counts differ.
  3. If unimplemented, would you accept a PR that adds the flag + the per-cluster score-range filter, defaulting to 1 to match STAR?

Why this matters

  • Multi-mapper-aware quant: tools like Salmon EM and RSEM weight multi-mappers by the locus count emitted in the BAM. Extra noisy hits at NH=10-20 dilute the EM and shift transcript-level assignments.
  • MAPQ schemes: STAR's MAPQ {0, 1, 3, 255} is keyed on NH (NH=1 -> 255, NH=2 -> 3, NH=3 -> 1, NH>=4 -> 0). Reads that should be MAPQ=255 in STAR (only one within-score-range hit) become MAPQ=0 in rustar when the score-range filter is absent.
  • Reproducibility claim: the README states "Multi-mapped 7.4 % | 7.4 %" agreement with STAR on a 10 k single-end benchmark. That parity probably breaks on paired-end with a longer NH tail; worth re-running the comparison on paired-end.

Severity

Behavioural / possibly missing-feature rather than fix-now bug. Filing as a clarifying issue: if --outFilterMultimapScoreRange (or an equivalent filter) is on the roadmap, this can be closed with a "tracked in #X" comment; if not, the discussion above is the spec for adding it.


Filed during nf-core/rnaseq integration testing (nf-core/rnaseq#1855). All sibling issues from this exercise: author:pinin4fjords or grep for nf-core/rnaseq#1855.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions