Multi-mapper NH cap differs from STAR (NH up to 20 vs STAR's 7); possibly missing `--outFilterMultimapScoreRange`

## Summary

With identical `--outFilterMultimapNmax 20` on the same paired-end inputs, rustar emits multi-mappers with NH values up to 20, while STAR caps at NH=6 on the same data. The differing tail isn't fully accounted for by tie-breaking — rustar is keeping additional, lower-scoring secondary alignments that STAR filters out.

Could be intentional, but the visible pattern (a long tail of NH=7..20 records that STAR doesn't produce) looks like a missing or differently-configured score-range filter.

## STAR reference behaviour

STAR applies **two** caps on multi-mapper count, not one. From [`source/Parameters.cpp:141-142`](https://github.com/alexdobin/STAR/blob/master/source/Parameters.cpp#L141-L142):

```cpp
parArray.push_back(new ParameterInfoScalar <uint>     (-1, -1, "outFilterMultimapNmax",      &outFilterMultimapNmax));
parArray.push_back(new ParameterInfoScalar <intScore> (-1, -1, "outFilterMultimapScoreRange", &outFilterMultimapScoreRange));
```

- `--outFilterMultimapNmax` (default 10): hard ceiling on number of loci a read can map to.
- `--outFilterMultimapScoreRange` (default **1**): max score gap between best and a kept multi-mapper. Alignments with score < best_score - scoreRange are dropped before the Nmax cap is applied.

So with `--outFilterMultimapNmax 20` and the default `--outFilterMultimapScoreRange 1`, STAR keeps only loci within 1 score unit of the best, then caps at 20. The empirical NH<=6 tail is a consequence of the score-range filter trimming much earlier than the count filter.

rustar's `--help` output doesn't list a `--outFilterMultimapScoreRange` option, consistent with the filter not being implemented yet.

## Reproducer

```bash
#!/usr/bin/env bash
set -euo pipefail
mkdir -p /tmp/rustar-mre-31 && cd /tmp/rustar-mre-31

BASE=https://raw.githubusercontent.com/nf-core/test-datasets/626c8fab639062eade4b10747e919341cbf9b41a
curl -fsLO $BASE/reference/genome.fasta
curl -fsL  $BASE/reference/genes_with_empty_tid.gtf.gz | gunzip -c > genes.gtf
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_1.fastq.gz
curl -fsLO $BASE/testdata/GSE110004/SRR6357072_2.fastq.gz

RUSTAR=ghcr.io/scverse/rustar-aligner:dev
STAR=community.wave.seqera.io/library/htslib_samtools_star_gawk:ae438e9a604351a4

mkdir -p idx-rustar idx-star
docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner --runMode genomeGenerate \
    --genomeDir idx-rustar --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7
docker run --rm -v $PWD:/w -w /w $STAR STAR --runMode genomeGenerate \
    --genomeDir idx-star --genomeFastaFiles genome.fasta --sjdbGTFfile genes.gtf \
    --sjdbOverhang 100 --genomeSAindexNbases 7

COMMON=(--readFilesIn SRR6357072_1.fastq.gz SRR6357072_2.fastq.gz --readFilesCommand zcat
        --runThreadN 4 --sjdbGTFfile genes.gtf --twopassMode Basic --runRNGseed 0
        --outSAMtype BAM Unsorted --outSAMattributes NH HI AS NM MD
        --outFilterMultimapNmax 20
        --outSAMattrRGline ID:WT_REP2 SM:WT_REP2)

docker run --rm -v $PWD:/w -w /w $RUSTAR rustar-aligner \
    --genomeDir idx-rustar "${COMMON[@]}" --outFileNamePrefix RUS.
docker run --rm -v $PWD:/w -w /w $STAR STAR \
    --genomeDir idx-star "${COMMON[@]}" --outFileNamePrefix STAR.

for label in rustar STAR; do
    if [ $label = rustar ]; then BAM=RUS./Aligned.out.bam; else BAM=STAR.Aligned.out.bam; fi
    echo "=== $label NH distribution ==="
    docker run --rm -v $PWD:/w $STAR samtools view /w/$BAM \
        | grep -oE 'NH:i:[0-9]+' | sort | uniq -c | sort -k2 -n
    echo
done
```

## Observed (verified, fresh paired run)

```
=== rustar NH distribution ===
     16 NH:i:8
     24 NH:i:12
     32 NH:i:16
     36 NH:i:9
     40 NH:i:20
     42 NH:i:7
     70 NH:i:5
     96 NH:i:6
    176 NH:i:4
   1540 NH:i:2
   3180 NH:i:3
  83340 NH:i:1

=== STAR NH distribution ===
     10 NH:i:5
     48 NH:i:4
     72 NH:i:6
   1448 NH:i:2
   2982 NH:i:3
  83512 NH:i:1
```

STAR caps at NH=6 (the highest NH value emitted). rustar's tail extends to NH=20 — same as `--outFilterMultimapNmax`, suggesting only the Nmax filter is being applied, not a score-range filter.

## Open questions for the maintainers

1. Is the absence of a `--outFilterMultimapScoreRange` analogue intentional in v0.1.x, or planned but not yet implemented?
2. If implemented but with different defaults (e.g. wider score range), what's the rationale? Documenting this in the README's "Accuracy Comparison vs STAR" section would help users understand why their multi-mapper counts differ.
3. If unimplemented, would you accept a PR that adds the flag + the per-cluster score-range filter, defaulting to 1 to match STAR?

## Why this matters

- **Multi-mapper-aware quant**: tools like Salmon EM and RSEM weight multi-mappers by the locus count emitted in the BAM. Extra noisy hits at NH=10-20 dilute the EM and shift transcript-level assignments.
- **MAPQ schemes**: STAR's MAPQ `{0, 1, 3, 255}` is keyed on NH (NH=1 -> 255, NH=2 -> 3, NH=3 -> 1, NH>=4 -> 0). Reads that should be MAPQ=255 in STAR (only one within-score-range hit) become MAPQ=0 in rustar when the score-range filter is absent.
- **Reproducibility claim**: the README states "Multi-mapped 7.4 % | 7.4 %" agreement with STAR on a 10 k single-end benchmark. That parity probably breaks on paired-end with a longer NH tail; worth re-running the comparison on paired-end.

## Severity

Behavioural / possibly missing-feature rather than fix-now bug. Filing as a clarifying issue: if `--outFilterMultimapScoreRange` (or an equivalent filter) is on the roadmap, this can be closed with a "tracked in #X" comment; if not, the discussion above is the spec for adding it.


---

<sub>Filed during **nf-core/rnaseq integration testing** ([nf-core/rnaseq#1855](https://github.com/nf-core/rnaseq/pull/1855)). All sibling issues from this exercise: [`author:pinin4fjords`](https://github.com/scverse/rustar-aligner/issues?q=is%3Aissue+author%3Apinin4fjords) or grep for `nf-core/rnaseq#1855`.</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-mapper NH cap differs from STAR (NH up to 20 vs STAR's 7); possibly missing `--outFilterMultimapScoreRange` #31

Summary

STAR reference behaviour

Reproducer

Observed (verified, fresh paired run)

Open questions for the maintainers

Why this matters

Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-mapper NH cap differs from STAR (NH up to 20 vs STAR's 7); possibly missing --outFilterMultimapScoreRange #31

Description

Summary

STAR reference behaviour

Reproducer

Observed (verified, fresh paired run)

Open questions for the maintainers

Why this matters

Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Multi-mapper NH cap differs from STAR (NH up to 20 vs STAR's 7); possibly missing `--outFilterMultimapScoreRange` #31