Skip to content

feat(rustar): wire rustar-aligner as opt-in STAR drop-in#1855

Draft
pinin4fjords wants to merge 10 commits into
devfrom
rustar-aligner
Draft

feat(rustar): wire rustar-aligner as opt-in STAR drop-in#1855
pinin4fjords wants to merge 10 commits into
devfrom
rustar-aligner

Conversation

@pinin4fjords
Copy link
Copy Markdown
Member

@pinin4fjords pinin4fjords commented May 12, 2026

Summary

Add an opt-in --use_rustar_star toggle that swaps STAR alignment and index generation for rustar-aligner, a Rust port of STAR with the same --camelCase CLI and the same on-disk index format. Wiring follows the existing --use_sentieon_star / --use_parabricks_star pattern - the user-facing aligner choice stays star_salmon / star_rsem, only the engine swaps.

New / changed:

  • modules/local/rustar_align/{align,genomegenerate}/main.nf - thin clones of the upstream STAR modules, container swapped to ghcr.io/scverse/rustar-aligner:dev.
  • subworkflows/local/align_star/main.nf - adds a RUSTAR_ALIGN branch alongside Sentieon and Parabricks.
  • subworkflows/local/prepare_genome_indices/main.nf - builds the index with rustar when use_rustar_star && fasta_provided && !star_index. Pre-built STAR indices are reused unchanged because rustar reads the STAR format.
  • conf/modules/align_star.config - the existing withName: selector now also matches RUSTAR_ALIGN, so the same args block fires.
  • tests/rustar_default.nf.test (+ snapshot) - four cases mirroring parabricks_default.nf.test, no tag "gpu".
  • nf-test.config - SKIP_RUSTAR env-var gate so CI shards can opt out if the image isn't pullable.
  • bin/compare_aligner_runs.py - stdlib-only harness for back-to-back aligner runs (% mapped delta, Salmon TPM / NumReads correlations, trace timings).
  • docs/usage.md - experimental section after the Parabricks block.
  • docs/rustar_differences.md - rolling capture of every divergence observed during verification.
  • docs/rustar_investigation_wt_rep2.md - deep-dive on the headline TPM divergence, now filed upstream as scverse/rustar-aligner#22.
  • docs/rustar_bam_comparison.md - categorical BAM-level comparison surfacing further rustar v0.1.0 issues, all now filed upstream.
  • docs/rustar_quant_and_multiqc.md - per-transcript Salmon quant.sf and per-module MultiQC comparison; surfaced the BAM QUAL-offset bug as the one new file-able item beyond what the BAM-level catalog already found.
  • docs/rustar_two_pass_and_determinism.md - root-cause analysis of the Annotated (sjdb) = 0 symptom (coordinate-space mismatch in SpliceJunctionDb lookups; full diagnosis folded into scverse/rustar-aligner#27) and a determinism check (same --runRNGseed 0 reruns are content-identical, only record-emission order varies — not a bug).
  • docs/rustar_cli_compat.md - CLI flag compatibility matrix (100 STAR flags probed: 53 accepted-honoured, 40 rejected, 7 accepted-broken). Surfaced one new file-able bug (scverse/rustar-aligner#35); the others map onto already-filed issues.

VM verification (test profile, docker)

STAR baseline rustar Δ
Pipeline wall time 3 m 49 s 2 m 40 s -30 %
*_ALIGN median (per task) 68.0 s 33.8 s -50 %
*_ALIGN peak RSS (per task) 0.92 GB 0.12 GB -87 %
Mapping rate Δ vs STAR - within ±0.21 pp on all samples -
gene_counts Pearson - ≥ 0.9998 on every sample -
gene_tpm Pearson - 0.985-0.999 (see investigation) -

nf-test test tests/rustar_default.nf.test --profile +test,docker: 4 / 4 passing in 594 s against the committed snapshot.

Known rustar v0.1.0 divergences

Two categories. Things we already worked around in this PR (so the test pipeline runs green) and things we observed but didn't mask (so they're visible to anyone enabling --use_rustar_star). Every item below is filed upstream with a paired STAR + rustar MRE; the full rolling list is at docs/rustar_differences.md and the BAM-level catalog at docs/rustar_bam_comparison.md.

Module-level workarounds in this PR

  1. --limitGenomeGenerateRAM rejected at startupscverse/rustar-aligner#25. Dropped from RUSTAR_GENOMEGENERATE.
  2. --outFileNamePrefix SAMPLE. treated as a directory, with bare-named outputs inside it — scverse/rustar-aligner#26. STAR uses it as a filename prefix. The module flattens the directory back to STAR-style filenames so downstream emit globs still match.
  3. Log.out / Log.progress.out not emitted; SJ.pass1.out.tab at top level instead of inside <prefix>_STARpass1/scverse/rustar-aligner#28. Marked optional in the module.

Open upstream bugs surfaced during VM verification

  1. Paired-end Aligned.toTranscriptome.out.bam missing RNEXT / PNEXT / TLEN and the proper-pair flag (high) — scverse/rustar-aligner#22. Salmon falls back to its default fragment-length prior (mean=250, sd=25); distorts paired-end TPMs (gene_tpm Pearson 0.985 on WT_REP2). Full analysis in docs/rustar_investigation_wt_rep2.md.
  2. NM tag silently swapped to nM (high) — scverse/rustar-aligner#29. --outSAMattributes NM produces nM:i: (substitutions only) instead of NM:i: (SAM-spec edit distance = mismatches + indel bases). Breaks any tool reading NM:i: (samtools stats, Picard, MultiQC).
  3. XS tag never emitted despite --outSAMstrandField intronMotif (high) — scverse/rustar-aligner#30. Breaks StringTie, Cufflinks, rseqc's infer_experiment.py.
  4. GTF junctions not credited at align time (medium) — scverse/rustar-aligner#27. Number of splices: Annotated (sjdb) = 0 in every Log.final.out even with --sjdbGTFfile. ~50 % of splices dropped; ~70 % of CIGAR diffs are reads where STAR splices and rustar emits straight 101M. Root cause now localised: the SpliceJunctionDb is keyed in chr-local 1-based coordinates but consulted in genome-absolute coordinates at two independent call sites that disagree with each other (smoking gun: rustar's own SJ.out.tab reports 2 annotated rows while Log.final.out reports 0 — see docs/rustar_two_pass_and_determinism.md).
  5. More secondary alignments than STAR (medium, possibly intentional) — scverse/rustar-aligner#31. NH tail extends to 20 vs STAR's 6 on the same --outFilterMultimapNmax 20. Looks like the equivalent of STAR's --outFilterMultimapScoreRange isn't applied; filed as a clarifying question.
  6. Transcriptome BAM missing per-record RG:Z: (low) — scverse/rustar-aligner#32. @RG header is present, every record is bare.
  7. @PG header content-free + AS value divergence on ~2.4 % of identical-CIGAR records (low) — scverse/rustar-aligner#33. PG line is just ID:rustar-aligner (no PN/VN/CL). AS deltas span ±1-14 with mixed direction (modal rustar +1 over STAR); the negative tail may partly resolve when Improve genome not found error message ? #27 lands.
  8. BAM QUAL bytes are +33 above SAM-spec values (high) — scverse/rustar-aligner#34. The writer at src/io/sam.rs:633 passes raw FASTQ ASCII (Phred+33) into the BAM binary QUAL field where the spec mandates raw Phred. samtools stats average_quality reads as 68.3 vs STAR's 35.3 (+33 on every base across every sample); error_rate reads as 0 (compounding #29). Surfaced via the MultiQC report comparison in docs/rustar_quant_and_multiqc.md.
  9. --chimSegmentMin > 0 + --twopassMode Basic aborts the run (medium) — scverse/rustar-aligner#35. Path-builder for chimeric output relies on <prefix>/ directory existing, but in two-pass mode the chim writer fires before the dir-creation step. Affects any pipeline that adds --chimSegmentMin via extra_star_align_args together with --twopassMode Basic (the STAR-fusion / arriba flag pattern). Workaround: --outFileNamePrefix dir/ with the parent pre-created.

None of these block the test pipeline today (it runs green) but several would affect production pipelines as soon as they hit a tool that actually reads those tags or fields. Every issue has a self-contained dual-aligner shell MRE in the body; copy-paste and a maintainer can reproduce in ~2 min on any machine with Docker.

All upstream issues from this integration can be browsed via is:issue author:pinin4fjords on scverse/rustar-aligner or by searching the nf-core/rnaseq#1855 body tag.

Status

Out of draft once CI passes. The PR adds an opt-in flag, so existing users see no change in behaviour. Promoting rustar from experimental to recommended is a follow-up and depends on the upstream BAM fix landing and a full-size run.

Test plan

  • Module wiring compiles (nextflow config -profile test clean).
  • Subworkflow tests adjusted for the new positional take: arg in ALIGN_STAR and PREPARE_GENOME_INDICES.
  • STAR vs rustar back-to-back on nf-dev-rnaseq, -profile test,docker, comparison harness output captured above.
  • nf-test test tests/rustar_default.nf.test passes 4/4 against the committed snapshot.
  • CI green across the existing matrix.

🤖 Generated with Claude Code

pinin4fjords and others added 5 commits May 12, 2026 14:55
Add local RUSTAR_ALIGN and RUSTAR_GENOMEGENERATE modules using
ghcr.io/scverse/rustar-aligner. Both reuse STAR's CLI and on-disk index
format, so the dispatch in align_star and prepare_genome_indices just
gets one more conditional. The new --use_rustar_star toggle mirrors the
existing --use_sentieon_star / --use_parabricks_star pattern.

Tests, the cross-aligner comparison harness, and on-VM verification
land in follow-up commits on this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ess, docs [skip ci]

- tests/rustar_default.nf.test: four cases (with/without markdups, plus
  stub variants) mirroring tests/parabricks_default.nf.test. Snapshot
  will be generated on the VM.
- nf-test.config: SKIP_RUSTAR env var lets CI shards opt out cleanly
  while the rustar container is still in flux.
- bin/compare_aligner_runs.py: stdlib-only harness that compares two
  pipeline outdirs - per-sample % uniquely mapped, Salmon merged TPM
  / counts Pearson+Spearman per sample, and trace wall-time / peak-RSS
  per process. JSON + Markdown output. Pass criteria: TPM Pearson
  >= 0.999 and |delta % mapped| <= 0.5 pp.
- docs/usage.md + CHANGELOG.md: experimental rustar section after the
  Parabricks block; one-line changelog entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While running the test profile back-to-back STAR vs rustar on the VM,
three rustar v0.1.0 deltas needed taming in the local modules:

- RUSTAR_GENOMEGENERATE: drop --limitGenomeGenerateRAM (unsupported by
  rustar at startup).
- RUSTAR_ALIGN: rustar treats a trailing-dot --outFileNamePrefix as a
  directory and writes bare-named files inside; flatten back to STAR-
  style prefixed filenames so the existing emit globs still match.
- RUSTAR_ALIGN: rustar only writes Log.final.out, not Log.out or
  Log.progress.out; mark those optional so the task doesn't fail.

Harness side: read_tsv_matrix now skips an optional gene_name column,
and trace parsing reads the modern Nextflow "name" column instead of
the legacy "process" one.

docs/rustar_differences.md captures every divergence observed so far
(module workarounds, output-file layout, mapping-rate and TPM/counts
correlations) so the PR has something concrete to review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation [skip ci]

Snapshot tests/rustar_default.nf.test.snap captured from the
nf-dev-rnaseq VM with -profile test,docker. Four cases (with /
without markdups, real + stub).

Investigation doc on the WT_REP2 TPM divergence (Pearson 0.985 vs
STAR's matched run): rustar v0.1.0's Aligned.toTranscriptome.out.bam
omits mate-pair fields (RNEXT/PNEXT/TLEN) and the proper-pair flag
on paired-end records, so Salmon falls back to its default
fragment-length prior. EffectiveLength shrinks disproportionately
for short transcripts, distorting TPM while leaving NumReads
near-identical. The hit reproduces on every paired-end sample;
WT_REP2 just had the most visible magnitude. Root cause located
at rustar src/lib.rs:762-768 and src/io/sam.rs:566-660 (no
template_length_mut / mate_alignment_start_mut calls). Suggested
upstream issue body included.

docs/rustar_differences.md updated to point at the investigation
and reframe the divergence as a paired-end issue (not a sample-
specific anomaly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot generated and verified on the nf-dev-rnaseq VM
(test,docker profile, 4/4 tests passing in 594s). Time to let
the full CI matrix have a go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 12, 2026

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 855456a

+| ✅ 215 tests passed       |+
#| ❔  19 tests were ignored |#
#| ❔   1 tests had warnings |#
!| ❗   7 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in awsfulltest.yml: You can customise AWS full pipeline tests as required
  • pipeline_todos - TODO string in nextflow.config: Specify any additional parameters here
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in base.config: Check the defaults for all processes

❔ Tests ignored:

  • files_exist - File is ignored: conf/modules.config
  • files_exist - File is ignored: conf/containers_conda_lock_files_amd64.config
  • files_exist - File is ignored: conf/containers_conda_lock_files_arm64.config
  • files_exist - File is ignored: conf/containers_docker_amd64.config
  • files_exist - File is ignored: conf/containers_docker_arm64.config
  • files_exist - File is ignored: conf/containers_singularity_https_amd64.config
  • files_exist - File is ignored: conf/containers_singularity_https_arm64.config
  • files_exist - File is ignored: conf/containers_singularity_oras_amd64.config
  • files_exist - File is ignored: conf/containers_singularity_oras_arm64.config
  • nextflow_config - Config default ignored: params.ribo_database_manifest
  • nf_test_content - nf_test_content
  • files_unchanged - File ignored due to lint config: assets/email_template.html
  • files_unchanged - File ignored due to lint config: assets/email_template.txt
  • files_unchanged - File ignored due to lint config: assets/nf-core-rnaseq_logo_light.png
  • files_unchanged - File ignored due to lint config: docs/images/nf-core-rnaseq_logo_dark.png
  • files_unchanged - File ignored due to lint config: .gitignore or .prettierignore
  • actions_nf_test - actions_nf_test
  • modules_config - modules_config
  • container_configs - container_configs

❔ Tests fixed:

✅ Tests passed:

Run details

  • nf-core/tools version 4.0.2
  • Run at 2026-05-12 16:58:23

pinin4fjords and others added 2 commits May 12, 2026 16:02
The paired-end transcriptome-BAM mate-fields finding has been filed
upstream. Add the issue link to docs/rustar_differences.md and the
top-of-doc status line in docs/rustar_investigation_wt_rep2.md so
readers land on the upstream tracker rather than scanning the
ready-to-paste section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s [skip ci]

Adds docs/rustar_bam_comparison.md - a category-by-category STAR vs
rustar BAM comparison on the test profile (WT_REP2 + RAP1_UNINDUCED_REP1
fully characterised, the other three samples spot-checked).

Net-new findings on top of the already-filed scverse/rustar-aligner#22
(paired-end transcriptome BAM mate fields):

- NM -> nM tag rename with semantics change (high): rustar's nM only
  counts substitutions; STAR's NM is the SAM-spec edit distance. 2%
  of identical-CIGAR records disagree.
- XS strand tag never emitted (high): breaks StringTie / Cufflinks /
  infer_experiment.py despite --outSAMstrandField intronMotif.
- GTF junctions not seeded into pass 1 (medium): ~50% of splices
  dropped, Annotated (sjdb) = 0 in Log.final.out, manifests as
  per-read CIGAR collapse spliced -> 101M.
- ~17% more secondary alignments than STAR (medium): NH tail extends
  to 20 vs STAR's 7 on identical input, possibly intentional but
  undocumented.
- @pg header content-free (low): no PN/VN/CL.
- Transcriptome BAM missing per-record RG:Z: (low).
- AS tag drifts +/-2-5 on identical CIGAR (low), against rustar's
  README byte-equivalence promise.

rustar_differences.md is reorganised so the "Tracked upstream" section
now lists all eight upstream-bound issues with severity tags rather
than just the one filed bug. Helper scripts and VM-side artefacts are
referenced from the BAM-comparison doc, not committed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pinin4fjords and others added 3 commits May 12, 2026 17:54
conf/modules/prepare_genome.config's withName: selector for the
prokaryotic-specific --sjdbGTFfeatureExon CDS flag listed
STAR_GENOMEGENERATE and PARABRICKS_STARGENOMEGENERATE but not
RUSTAR_GENOMEGENERATE. So under --prokaryotic the flag was silently
dropped from rustar's index build and rustar built an index from
the GFF's exon features (zero rows in a CDS-only annotation),
leaving Aligned.toTranscriptome.out.bam header-only and crashing
SALMON_QUANT.

Adding RUSTAR_GENOMEGENERATE to the selector restores parity with
STAR: rustar produces 13 @sq + 8 082 records, byte-equivalent to
STAR on the same inputs.

The earlier docs/rustar_mode_smoke_tests.md diagnosis claimed
rustar's transcriptome projection ignored --sjdbGTFfeatureExon; on
follow-up verification by another session that turned out to be
wrong - rustar honours the flag fine when it's plumbed through;
our pipeline wasn't plumbing it through. Doc reclassified from
"upstream BUG (high)" to "pipeline-integration gap, fixed in this
PR". Also tightened the publishDir selector so a future
--save_reference + rustar prokaryotic run publishes its index the
same way STAR does.

Eukaryotic test profile is unaffected: params.prokaryotic = false
means the args block resolves to '' for all three matching
processes, so the committed nf-test snapshot is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip ci]

Add the docs produced by the parallel investigation agents:

- docs/rustar_two_pass_and_determinism.md - two-pass / sjdb root
  cause hunt (located the is_annotated() coord-space bug at
  src/align/stitch.rs:1306-1314, with the SJ.out.tab vs
  Log.final.out two-counter inconsistency as the smoking gun;
  filed as scverse/rustar-aligner#27) plus determinism check
  (deterministic except for record order; downstream-irrelevant
  because the pipeline name-sorts).

- docs/rustar_cli_compat.md - STAR-vs-rustar CLI flag matrix
  across 100 flags (53 honoured / 3 different-default / 40
  rejected / 4 advertised-but-broken). Key observation: rustar
  uses clap so there is no silent-ignore class at the CLI layer;
  the dangerous category is "advertised-but-broken". Surfaced
  scverse/rustar-aligner#35 (chim path-builder bug that crashes
  the run when --outFileNamePrefix doesn't end in '/').

- docs/rustar_quant_and_multiqc.md - per-transcript Salmon
  quant.sf check on the SE samples (clean: EffectiveLength matches
  STAR to 6 dp, no PE mate-fields analogue on SE) and a MultiQC
  module-by-module misread inventory (six user-visible misleading
  numbers under --use_rustar_star today; all projections of
  already-filed BAM bugs). Includes the average_quality = 68 vs
  35 finding that turned out to be scverse/rustar-aligner#34
  (BAM QUAL +33 offset) once the verification session diagnosed
  the root cause.

Also: docs/rustar_differences.md's "Tracked upstream" section
rewritten as a table cross-referencing each filed upstream issue
to the doc that produced the evidence, and a "Fixed in this PR
(was originally suspected upstream)" note for the prokaryotic
selector gap.

[skip ci] - eukaryotic-path verification CI is already in
flight on 855456a.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… [skip ci]

docs/rustar_noise_floor.md - frames the existing STAR-vs-rustar
gene_tpm Pearson range (0.985-0.9997) against the actual RNG noise
envelope. STAR seed-to-seed worst is 0.9999999994; rustar
seed-to-seed worst is 0.9999999996. STAR-vs-rustar deltas are
7-9 orders of magnitude outside the noise envelope, so every bug
already filed upstream is real signal, not RNG drift. STAR with
the same seed is alignment-bit-identical at the record-content
level; published .markdup.sorted.bam bytes differ only because of
MarkDuplicates' input-order sensitivity (per-sample dup count is
constant across STAR reruns). MarkDuplicates dup-bit agreement
across STAR vs rustar is 98.7-99.8 % per sample - propagation of
the upstream BAM divergence, not amplification.

docs/rustar_singularity_and_chim_workaround.md - verifies (a)
rustar runs IDENTICALLY through Singularity/Apptainer 1.5.0 via
nf-core's singularity_pull_docker_container path (0.00 pp mapped
delta, gene_tpm Pearson >= 0.99999999968, gene_counts 1.0) and
(b) the proposed `--outFileNamePrefix dir/` workaround for
scverse/rustar-aligner#35 works end-to-end at the rustar CLI but
is not compatible with the pipeline's STAR-style prefix
convention; agent recommends waiting for upstream #35 rather
than diverging the rustar module.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant