feat(rustar): wire rustar-aligner as opt-in STAR drop-in#1855
Draft
pinin4fjords wants to merge 10 commits into
Draft
feat(rustar): wire rustar-aligner as opt-in STAR drop-in#1855pinin4fjords wants to merge 10 commits into
pinin4fjords wants to merge 10 commits into
Conversation
Add local RUSTAR_ALIGN and RUSTAR_GENOMEGENERATE modules using ghcr.io/scverse/rustar-aligner. Both reuse STAR's CLI and on-disk index format, so the dispatch in align_star and prepare_genome_indices just gets one more conditional. The new --use_rustar_star toggle mirrors the existing --use_sentieon_star / --use_parabricks_star pattern. Tests, the cross-aligner comparison harness, and on-VM verification land in follow-up commits on this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ess, docs [skip ci] - tests/rustar_default.nf.test: four cases (with/without markdups, plus stub variants) mirroring tests/parabricks_default.nf.test. Snapshot will be generated on the VM. - nf-test.config: SKIP_RUSTAR env var lets CI shards opt out cleanly while the rustar container is still in flux. - bin/compare_aligner_runs.py: stdlib-only harness that compares two pipeline outdirs - per-sample % uniquely mapped, Salmon merged TPM / counts Pearson+Spearman per sample, and trace wall-time / peak-RSS per process. JSON + Markdown output. Pass criteria: TPM Pearson >= 0.999 and |delta % mapped| <= 0.5 pp. - docs/usage.md + CHANGELOG.md: experimental rustar section after the Parabricks block; one-line changelog entry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While running the test profile back-to-back STAR vs rustar on the VM, three rustar v0.1.0 deltas needed taming in the local modules: - RUSTAR_GENOMEGENERATE: drop --limitGenomeGenerateRAM (unsupported by rustar at startup). - RUSTAR_ALIGN: rustar treats a trailing-dot --outFileNamePrefix as a directory and writes bare-named files inside; flatten back to STAR- style prefixed filenames so the existing emit globs still match. - RUSTAR_ALIGN: rustar only writes Log.final.out, not Log.out or Log.progress.out; mark those optional so the task doesn't fail. Harness side: read_tsv_matrix now skips an optional gene_name column, and trace parsing reads the modern Nextflow "name" column instead of the legacy "process" one. docs/rustar_differences.md captures every divergence observed so far (module workarounds, output-file layout, mapping-rate and TPM/counts correlations) so the PR has something concrete to review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation [skip ci] Snapshot tests/rustar_default.nf.test.snap captured from the nf-dev-rnaseq VM with -profile test,docker. Four cases (with / without markdups, real + stub). Investigation doc on the WT_REP2 TPM divergence (Pearson 0.985 vs STAR's matched run): rustar v0.1.0's Aligned.toTranscriptome.out.bam omits mate-pair fields (RNEXT/PNEXT/TLEN) and the proper-pair flag on paired-end records, so Salmon falls back to its default fragment-length prior. EffectiveLength shrinks disproportionately for short transcripts, distorting TPM while leaving NumReads near-identical. The hit reproduces on every paired-end sample; WT_REP2 just had the most visible magnitude. Root cause located at rustar src/lib.rs:762-768 and src/io/sam.rs:566-660 (no template_length_mut / mate_alignment_start_mut calls). Suggested upstream issue body included. docs/rustar_differences.md updated to point at the investigation and reframe the divergence as a paired-end issue (not a sample- specific anomaly). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Snapshot generated and verified on the nf-dev-rnaseq VM (test,docker profile, 4/4 tests passing in 594s). Time to let the full CI matrix have a go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The paired-end transcriptome-BAM mate-fields finding has been filed upstream. Add the issue link to docs/rustar_differences.md and the top-of-doc status line in docs/rustar_investigation_wt_rep2.md so readers land on the upstream tracker rather than scanning the ready-to-paste section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s [skip ci] Adds docs/rustar_bam_comparison.md - a category-by-category STAR vs rustar BAM comparison on the test profile (WT_REP2 + RAP1_UNINDUCED_REP1 fully characterised, the other three samples spot-checked). Net-new findings on top of the already-filed scverse/rustar-aligner#22 (paired-end transcriptome BAM mate fields): - NM -> nM tag rename with semantics change (high): rustar's nM only counts substitutions; STAR's NM is the SAM-spec edit distance. 2% of identical-CIGAR records disagree. - XS strand tag never emitted (high): breaks StringTie / Cufflinks / infer_experiment.py despite --outSAMstrandField intronMotif. - GTF junctions not seeded into pass 1 (medium): ~50% of splices dropped, Annotated (sjdb) = 0 in Log.final.out, manifests as per-read CIGAR collapse spliced -> 101M. - ~17% more secondary alignments than STAR (medium): NH tail extends to 20 vs STAR's 7 on identical input, possibly intentional but undocumented. - @pg header content-free (low): no PN/VN/CL. - Transcriptome BAM missing per-record RG:Z: (low). - AS tag drifts +/-2-5 on identical CIGAR (low), against rustar's README byte-equivalence promise. rustar_differences.md is reorganised so the "Tracked upstream" section now lists all eight upstream-bound issues with severity tags rather than just the one filed bug. Helper scripts and VM-side artefacts are referenced from the BAM-comparison doc, not committed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 12, 2026
Closed
conf/modules/prepare_genome.config's withName: selector for the prokaryotic-specific --sjdbGTFfeatureExon CDS flag listed STAR_GENOMEGENERATE and PARABRICKS_STARGENOMEGENERATE but not RUSTAR_GENOMEGENERATE. So under --prokaryotic the flag was silently dropped from rustar's index build and rustar built an index from the GFF's exon features (zero rows in a CDS-only annotation), leaving Aligned.toTranscriptome.out.bam header-only and crashing SALMON_QUANT. Adding RUSTAR_GENOMEGENERATE to the selector restores parity with STAR: rustar produces 13 @sq + 8 082 records, byte-equivalent to STAR on the same inputs. The earlier docs/rustar_mode_smoke_tests.md diagnosis claimed rustar's transcriptome projection ignored --sjdbGTFfeatureExon; on follow-up verification by another session that turned out to be wrong - rustar honours the flag fine when it's plumbed through; our pipeline wasn't plumbing it through. Doc reclassified from "upstream BUG (high)" to "pipeline-integration gap, fixed in this PR". Also tightened the publishDir selector so a future --save_reference + rustar prokaryotic run publishes its index the same way STAR does. Eukaryotic test profile is unaffected: params.prokaryotic = false means the args block resolves to '' for all three matching processes, so the committed nf-test snapshot is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ip ci] Add the docs produced by the parallel investigation agents: - docs/rustar_two_pass_and_determinism.md - two-pass / sjdb root cause hunt (located the is_annotated() coord-space bug at src/align/stitch.rs:1306-1314, with the SJ.out.tab vs Log.final.out two-counter inconsistency as the smoking gun; filed as scverse/rustar-aligner#27) plus determinism check (deterministic except for record order; downstream-irrelevant because the pipeline name-sorts). - docs/rustar_cli_compat.md - STAR-vs-rustar CLI flag matrix across 100 flags (53 honoured / 3 different-default / 40 rejected / 4 advertised-but-broken). Key observation: rustar uses clap so there is no silent-ignore class at the CLI layer; the dangerous category is "advertised-but-broken". Surfaced scverse/rustar-aligner#35 (chim path-builder bug that crashes the run when --outFileNamePrefix doesn't end in '/'). - docs/rustar_quant_and_multiqc.md - per-transcript Salmon quant.sf check on the SE samples (clean: EffectiveLength matches STAR to 6 dp, no PE mate-fields analogue on SE) and a MultiQC module-by-module misread inventory (six user-visible misleading numbers under --use_rustar_star today; all projections of already-filed BAM bugs). Includes the average_quality = 68 vs 35 finding that turned out to be scverse/rustar-aligner#34 (BAM QUAL +33 offset) once the verification session diagnosed the root cause. Also: docs/rustar_differences.md's "Tracked upstream" section rewritten as a table cross-referencing each filed upstream issue to the doc that produced the evidence, and a "Fixed in this PR (was originally suspected upstream)" note for the prokaryotic selector gap. [skip ci] - eukaryotic-path verification CI is already in flight on 855456a. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… [skip ci] docs/rustar_noise_floor.md - frames the existing STAR-vs-rustar gene_tpm Pearson range (0.985-0.9997) against the actual RNG noise envelope. STAR seed-to-seed worst is 0.9999999994; rustar seed-to-seed worst is 0.9999999996. STAR-vs-rustar deltas are 7-9 orders of magnitude outside the noise envelope, so every bug already filed upstream is real signal, not RNG drift. STAR with the same seed is alignment-bit-identical at the record-content level; published .markdup.sorted.bam bytes differ only because of MarkDuplicates' input-order sensitivity (per-sample dup count is constant across STAR reruns). MarkDuplicates dup-bit agreement across STAR vs rustar is 98.7-99.8 % per sample - propagation of the upstream BAM divergence, not amplification. docs/rustar_singularity_and_chim_workaround.md - verifies (a) rustar runs IDENTICALLY through Singularity/Apptainer 1.5.0 via nf-core's singularity_pull_docker_container path (0.00 pp mapped delta, gene_tpm Pearson >= 0.99999999968, gene_counts 1.0) and (b) the proposed `--outFileNamePrefix dir/` workaround for scverse/rustar-aligner#35 works end-to-end at the rustar CLI but is not compatible with the pipeline's STAR-style prefix convention; agent recommends waiting for upstream #35 rather than diverging the rustar module. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add an opt-in
--use_rustar_startoggle that swaps STAR alignment and index generation forrustar-aligner, a Rust port of STAR with the same--camelCaseCLI and the same on-disk index format. Wiring follows the existing--use_sentieon_star/--use_parabricks_starpattern - the user-facing aligner choice staysstar_salmon/star_rsem, only the engine swaps.New / changed:
modules/local/rustar_align/{align,genomegenerate}/main.nf- thin clones of the upstream STAR modules, container swapped toghcr.io/scverse/rustar-aligner:dev.subworkflows/local/align_star/main.nf- adds aRUSTAR_ALIGNbranch alongside Sentieon and Parabricks.subworkflows/local/prepare_genome_indices/main.nf- builds the index with rustar whenuse_rustar_star && fasta_provided && !star_index. Pre-built STAR indices are reused unchanged because rustar reads the STAR format.conf/modules/align_star.config- the existingwithName:selector now also matchesRUSTAR_ALIGN, so the same args block fires.tests/rustar_default.nf.test(+ snapshot) - four cases mirroringparabricks_default.nf.test, notag "gpu".nf-test.config-SKIP_RUSTARenv-var gate so CI shards can opt out if the image isn't pullable.bin/compare_aligner_runs.py- stdlib-only harness for back-to-back aligner runs (% mapped delta, Salmon TPM / NumReads correlations, trace timings).docs/usage.md- experimental section after the Parabricks block.docs/rustar_differences.md- rolling capture of every divergence observed during verification.docs/rustar_investigation_wt_rep2.md- deep-dive on the headline TPM divergence, now filed upstream as scverse/rustar-aligner#22.docs/rustar_bam_comparison.md- categorical BAM-level comparison surfacing further rustar v0.1.0 issues, all now filed upstream.docs/rustar_quant_and_multiqc.md- per-transcript Salmonquant.sfand per-module MultiQC comparison; surfaced the BAM QUAL-offset bug as the one new file-able item beyond what the BAM-level catalog already found.docs/rustar_two_pass_and_determinism.md- root-cause analysis of theAnnotated (sjdb) = 0symptom (coordinate-space mismatch inSpliceJunctionDblookups; full diagnosis folded into scverse/rustar-aligner#27) and a determinism check (same--runRNGseed 0reruns are content-identical, only record-emission order varies — not a bug).docs/rustar_cli_compat.md- CLI flag compatibility matrix (100 STAR flags probed: 53 accepted-honoured, 40 rejected, 7 accepted-broken). Surfaced one new file-able bug (scverse/rustar-aligner#35); the others map onto already-filed issues.VM verification (test profile, docker)
*_ALIGNmedian (per task)*_ALIGNpeak RSS (per task)nf-test test tests/rustar_default.nf.test --profile +test,docker: 4 / 4 passing in 594 s against the committed snapshot.Known rustar v0.1.0 divergences
Two categories. Things we already worked around in this PR (so the test pipeline runs green) and things we observed but didn't mask (so they're visible to anyone enabling
--use_rustar_star). Every item below is filed upstream with a paired STAR + rustar MRE; the full rolling list is atdocs/rustar_differences.mdand the BAM-level catalog atdocs/rustar_bam_comparison.md.Module-level workarounds in this PR
--limitGenomeGenerateRAMrejected at startup — scverse/rustar-aligner#25. Dropped fromRUSTAR_GENOMEGENERATE.--outFileNamePrefix SAMPLE.treated as a directory, with bare-named outputs inside it — scverse/rustar-aligner#26. STAR uses it as a filename prefix. The module flattens the directory back to STAR-style filenames so downstream emit globs still match.Log.out/Log.progress.outnot emitted;SJ.pass1.out.tabat top level instead of inside<prefix>_STARpass1/— scverse/rustar-aligner#28. Marked optional in the module.Open upstream bugs surfaced during VM verification
Aligned.toTranscriptome.out.bammissingRNEXT/PNEXT/TLENand the proper-pair flag (high) — scverse/rustar-aligner#22. Salmon falls back to its default fragment-length prior (mean=250, sd=25); distorts paired-end TPMs (gene_tpm Pearson 0.985 on WT_REP2). Full analysis indocs/rustar_investigation_wt_rep2.md.NMtag silently swapped tonM(high) — scverse/rustar-aligner#29.--outSAMattributes NMproducesnM:i:(substitutions only) instead ofNM:i:(SAM-spec edit distance = mismatches + indel bases). Breaks any tool readingNM:i:(samtools stats, Picard, MultiQC).XStag never emitted despite--outSAMstrandField intronMotif(high) — scverse/rustar-aligner#30. Breaks StringTie, Cufflinks, rseqc'sinfer_experiment.py.Number of splices: Annotated (sjdb) = 0in everyLog.final.outeven with--sjdbGTFfile. ~50 % of splices dropped; ~70 % of CIGAR diffs are reads where STAR splices and rustar emits straight101M. Root cause now localised: theSpliceJunctionDbis keyed in chr-local 1-based coordinates but consulted in genome-absolute coordinates at two independent call sites that disagree with each other (smoking gun: rustar's ownSJ.out.tabreports 2 annotated rows whileLog.final.outreports 0 — seedocs/rustar_two_pass_and_determinism.md).--outFilterMultimapNmax 20. Looks like the equivalent of STAR's--outFilterMultimapScoreRangeisn't applied; filed as a clarifying question.RG:Z:(low) — scverse/rustar-aligner#32.@RGheader is present, every record is bare.@PGheader content-free +ASvalue divergence on ~2.4 % of identical-CIGAR records (low) — scverse/rustar-aligner#33. PG line is justID:rustar-aligner(noPN/VN/CL). AS deltas span ±1-14 with mixed direction (modal rustar +1 over STAR); the negative tail may partly resolve when Improve genome not found error message ? #27 lands.QUALbytes are +33 above SAM-spec values (high) — scverse/rustar-aligner#34. The writer atsrc/io/sam.rs:633passes raw FASTQ ASCII (Phred+33) into the BAM binary QUAL field where the spec mandates raw Phred.samtools stats average_qualityreads as 68.3 vs STAR's 35.3 (+33 on every base across every sample);error_ratereads as 0 (compounding #29). Surfaced via the MultiQC report comparison indocs/rustar_quant_and_multiqc.md.--chimSegmentMin > 0+--twopassMode Basicaborts the run (medium) — scverse/rustar-aligner#35. Path-builder for chimeric output relies on<prefix>/directory existing, but in two-pass mode the chim writer fires before the dir-creation step. Affects any pipeline that adds--chimSegmentMinviaextra_star_align_argstogether with--twopassMode Basic(the STAR-fusion / arriba flag pattern). Workaround:--outFileNamePrefix dir/with the parent pre-created.None of these block the test pipeline today (it runs green) but several would affect production pipelines as soon as they hit a tool that actually reads those tags or fields. Every issue has a self-contained dual-aligner shell MRE in the body; copy-paste and a maintainer can reproduce in ~2 min on any machine with Docker.
All upstream issues from this integration can be browsed via
is:issue author:pinin4fjordson scverse/rustar-aligner or by searching thenf-core/rnaseq#1855body tag.Status
Out of draft once CI passes. The PR adds an opt-in flag, so existing users see no change in behaviour. Promoting rustar from experimental to recommended is a follow-up and depends on the upstream BAM fix landing and a full-size run.
Test plan
nextflow config -profile testclean).take:arg inALIGN_STARandPREPARE_GENOME_INDICES.nf-dev-rnaseq,-profile test,docker, comparison harness output captured above.nf-test test tests/rustar_default.nf.testpasses 4/4 against the committed snapshot.🤖 Generated with Claude Code