Skip to content

Commit de95bf3

Browse files
authored
Merge pull request #14 from Psy-Fer/dev
Merging dev to main for release
2 parents 0763745 + fccc423 commit de95bf3

38 files changed

Lines changed: 9394 additions & 654 deletions

CLAUDE.md

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Always run `cargo clippy`, `cargo fmt --check`, and `cargo test` before consider
3232

3333
## Current Status
3434

35-
**278 tests passing, 0 clippy warnings.** SE: 8796/8926 compare_sam.py (98.5%), 2.2% splice rate (STAR: 2.2%), 66 shared junctions, **100.0% MAPQ agreement, MAPQ inflation: 0, deflation: 0**. 127 position disagreements (ALL verified as genuine ties). 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8767 both-mapped** (STAR: 8390), **236 half-mapped** (new behavior), 28 MAPQ inflations / 192 deflations (rDNA N² problem), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness** (pos+CIGAR+MAPQ+proper+NH). Phase 17.A: `scoreSeedBest` pre-extension. Phase 17.B: per-mate seeding. Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter (0 MAPQ inflations). Phase 17.D: combined-span penalty fix + dedup ordering (248→236 half-mapped). Phase 17.8: `--quantMode GeneCounts`. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
35+
**396 tests passing, 0 clippy warnings.** SE: 8613/8926 compare_sam.py (96.5%; note: lower due to seeded-RNG tie-break PR diverging from STAR's mt19937), **99.815% faithfulness (tie-adjusted)** (8611/8627 non-tie reads exact), 299 tie-breaking diffs excluded. 1 CIGAR-only disagree (ERR12389696.13573895, insertion placement, seed-level tie). **0 STAR-only / 0 ruSTAR-only SE reads**. PE: **8390 both-mapped** (STAR: 8390), **0 half-mapped**, 0 MAPQ inflations / 0 deflations, **99.883% PE exact faithfulness (tie-adjusted)** (16284/16306, 475 tie-breaking diffs excluded), **0 proper-pair diffs**, **0 NH diffs**. Phase 17.A: `scoreSeedBest` pre-extension. Phase 17.B: per-mate seeding. Phase 17.C: STAR-faithful SCORE-GATE + mappedFilter. Phase 17.D: combined-span penalty fix + dedup ordering. Phase 17.8: `--quantMode GeneCounts`. Phase E fix (2026-04-21): mate_id-aware diagonal dedup. Phase E2 (2026-04-22): STAR-faithful combined-read seeding. Phase E3 (2026-04-22): combined-threshold half-mapped fallback. Phase E4 (2026-04-22): PE-CHECK2 unconditional. Phase E5 (2026-04-23): split_combined_wt n_mismatch propagation. Phase E6 (2026-04-24): tie-adjusted faithfulness metric in assess_faithfulness.py. Phase F1: --runRNGseed + seeded primary tie-break (PR #5). Phase F2: --outSAMattrRGline (PR #6). Phase F3: --quantMode TranscriptomeSAM (PR #7). Phase F4: SJDB insertion into Genome+SA at genomeGenerate (PR #8). Phase G1 (2026-04-29): junction_shifts fix in split_combined_wt (rDNA cross-copy false-splice filter). Phase G2 (2026-04-29): MAX_RECURSION 10k→100k + sa_pos_to_forward overflow fix (ERR12389696.7118031 NH=3→9). Phase 17.2 (2026-04-29): coordinate-sorted BAM output (`--outSAMtype BAM SortedByCoordinate` → `Aligned.sortedByCoord.out.bam`). Phase 17.4 (2026-04-29): `--outReadsUnmapped Fastx` → `Unmapped.out.mate1` / `Unmapped.out.mate2`; writes unmapped + TooManyLoci reads; PE writes both mates for fully-unmapped and half-mapped pairs. Phase 17.6 (2026-05-01): `--outStd SAM/BAM_Unsorted/BAM_SortedByCoordinate` — routes primary alignment output to stdout via `Box<dyn AlignmentWriter>` trait dispatch; `SamStdoutWriter`, `BamStdoutWriter`, `SortedBamStdoutWriter` in sam.rs/bam.rs; verified with samtools pipe (967 records). Phase G3 (2026-05-01): SA tie-breaking fix — `compare_suffixes` tie-breaker changed from `pos_b.cmp(&pos_a)` to `packed_a.cmp(&packed_b)` (ascending by packed SA value with strand bit); ruSTAR SA is now **byte-for-byte identical** to STAR's SA for the yeast genome (10,862 → 0 entry diffs). diff AS: 6→4 cases (4 remaining are ruSTAR improvements: .844151 VIII 0mm vs STAR VII 6mm, .4972950 spliced vs unspliced mate2). Phase 17.3 (2026-05-01): PE chimeric detection — `detect_inter_mate_chimeric` in `chimeric/detect.rs`; intra-mate multi-cluster chimeric via cluster splitting + mate2 read_pos adjustment; inter-mate chimeric for discordant pairs (diff chr, same strand, or >1Mb); `align_paired_read` returns 4-tuple including `Vec<ChimericAlignment>`; no benchmark regression (8390 both-mapped, 0 half-mapped). Phase 17.11 (2026-05-01): `--chimOutType WithinBAM` — chimeric alignments written as supplementary records (FLAG 0x800) in primary BAM; donor record has full SEQ + SA tag; acceptor has FLAG 0x800 + SA tag + empty SEQ; `build_within_bam_records` in `chimeric/output.rs`; `chim_out_junctions()` / `chim_out_within_bam()` helpers in params.rs; supports mixed `--chimOutType Junctions WithinBAM`. Phase 17.7 (2026-05-01): GTF tag parameters — `--sjdbGTFchrPrefix`, `--sjdbGTFfeatureExon`, `--sjdbGTFtagExonParentTranscript`, `--sjdbGTFtagExonParentGene`; `_configured` variants in `junction/gtf.rs`, `quant/mod.rs`, `quant/transcriptome.rs`, `junction/mod.rs`; all 4 production paths thread params; backward-compat wrappers preserve zero test disruption. Phase 17.9 (2026-05-01): `--outBAMcompression` (BGZF level -1–9, default 1; -1/0=NONE, 1-8=flate2 levels, ≥9=BEST) + `--limitBAMsortRAM` (bytes, 0=unlimited; aborts sort if ~400 bytes/record estimate exceeds limit); `bgzf_compression()` + `make_bgzf_writer()` helpers in `io/bam.rs`; threaded through all 4 BAM writers (unsorted file, sorted file, unsorted stdout, sorted stdout). PE chimericDetectionOld (2026-05-01): per-mate `detect_chimeric_old` called on `all_m1_transcripts` / `all_m2_transcripts` pools after `filter_paired_transcripts` in `read_align.rs`. Phase 17.12 (2026-05-01): BySJout disk buffering — `BySJReadMeta` struct + `NamedTempFile` SAM temp file replaces `Vec<AlignmentBatchResults>`; `create_bysj_writer` / `bysj_write_records` / `bysj_read_n_records` helpers in `io/sam.rs`; `tempfile` moved to `[dependencies]`. Phase 17.13 (2026-05-01): 8 integration tests in `tests/alignment_features.rs` — synthetic 20kb genome with planted GT-AG intron; tests cover BAM output, PE alignment, spliced reads, BySJout, GeneCounts, unmapped output, two-pass mode. Phase 12.2 (2026-05-04): SE chimeric Tier 1b soft-clip re-mapping — `detect_from_soft_clips` in `chimeric/detect.rs` re-seeds the primary alignment's soft-clipped bases when `detect_chimeric_old` finds no partner; `adjust_read_positions` helper shifts sub-seq coords into full-read space for right clips; called as Step 3c in `read_align.rs`. Phase 17.10 (2026-05-04): Chimeric Tier 3 — `detect_from_chimeric_residuals` in `chimeric/detect.rs` re-seeds outer uncovered read regions (before donor / after acceptor) of each found chimeric pair; enables 3-way gene-fusion detection; called as Step 3d in `read_align.rs`. See [ROADMAP.md](ROADMAP.md) for detailed phase tracking and [docs/](docs/) for per-phase notes.
3636

3737
## Source Layout
3838

@@ -64,7 +64,7 @@ src/
6464
mod.rs -- Module exports
6565
fastq.rs -- FASTQ reader (plain + gzip, noodles wrapper)
6666
sam.rs -- SAM writer (header + records, noodles wrapper)
67-
bam.rs -- BAM writer (BGZF compression, streaming unsorted output)
67+
bam.rs -- BAM writer (BGZF compression, streaming unsorted + in-memory sorted output)
6868
junction/
6969
mod.rs -- GTF parsing, junction database, motif detection, two-pass filtering
7070
sj_output.rs -- SJ.out.tab writer
@@ -73,7 +73,7 @@ src/
7373
mod.rs -- Gene-level read counting (--quantMode GeneCounts, ReadsPerGene.out.tab)
7474
chimeric/
7575
mod.rs -- Module exports
76-
detect.rs -- Chimeric detection (Tier 1: soft-clip, Tier 2: multi-cluster)
76+
detect.rs -- Chimeric detection (Tier 1: transcript-pair, Tier 1b: soft-clip re-seed, Tier 2: multi-cluster, Tier 3: residual re-seed)
7777
segment.rs -- ChimericSegment and ChimericAlignment data structures
7878
score.rs -- Junction type classification, repeat length calculation
7979
output.rs -- Chimeric.out.junction writer (14-column format)
@@ -130,16 +130,16 @@ predicates = "3"
130130
- Every phase uses differential testing against STAR where applicable
131131
- Test data tiers: synthetic micro-genome → chr22 → full human genome
132132

133-
**Current test status**: 278/278 tests passing (274 unit + 4 integration), 0 clippy warnings
133+
**Current test status**: 364/364 tests passing (359 unit + 5 integration), 0 clippy warnings
134134

135135
## Known Issues — Disagreement Root Causes (10k SE yeast)
136136

137-
**127 total position disagreements — ALL verified as genuine ties** (confirmed via STAR debug tracing):
137+
**299 total position disagreements — ALL verified as genuine ties** (SA-order ties + seeded-RNG tie-break divergence from STAR's mt19937):
138138

139-
Both tools find identical alignment sets for all 127 disagreements. The primary difference is tie-breaking order (SA iteration order). Neither alignment is more correct than the other.
139+
Both tools find identical alignment sets for all 299 disagreements. Primary selection differs either due to SA iteration order or RNG seed divergence (PR #5: `--runRNGseed`, uses `StdRng`, not `mt19937`).
140140

141-
- **100 diff-chr ties** — same set of alignments, different repeat copy chosen as primary.
142-
- **27 same-chr ties** — same alignment set, different primary due to tie-breaking (includes multi-intron reads where both tools find same 2 alignments but select different primaries).
141+
- **100+ diff-chr ties** — same set of alignments, different repeat copy chosen as primary.
142+
- **27+ same-chr ties** — same alignment set, different primary due to tie-breaking.
143143

144144
**1 CIGAR-only disagreement (same position, different CIGAR):**
145145
- `ERR12389696.13573895`: both tools align to XV:218357 MAPQ=255, but ruSTAR gives `100M1I45M4S` (insertion at read pos 100) while STAR gives `108M1I37M4S` (insertion at 108). Root cause: both alignments score AS=133. The 71-base seed is found at RC pos 29 (ruSTAR) vs RC pos 37 (STAR) due to different Lmapped chain paths through a long homopolymer region. Same diagonal, different starting position → different insertion placement. Seed-level tie.
@@ -161,22 +161,19 @@ Previously listed issues now resolved:
161161

162162
See [ROADMAP.md](ROADMAP.md) and [docs/](docs/) for full issue tracking.
163163

164-
## PE Status (Updated 2026-04-17 — Phase 17.D)
164+
## PE Status (Updated 2026-04-29 — Phase G2)
165165

166-
**Phase 17.D** (combined-span penalty + dedup ordering): **PE both-mapped = 8767** (STAR: 8390), **half-mapped = 236** (was 248 Phase 17.B), **98.2% per-mate position agreement**, **93.920% PE exact faithfulness**.
166+
**Current**: **PE both-mapped = 8390** (STAR: 8390, exact match!), **half-mapped = 0**, **99.883% PE exact faithfulness (tie-adjusted)** (16284/16306, 475 tie-breaking diffs excluded). MAPQ inflations: **0**, deflations: 0. NH diffs: **0**. 1 STAR-only mate (`.18919121`, SA-level diff), 1 ruSTAR-only mate (`.6302610`, pre-existing FP).
167167

168-
**Two fixes in Phase 17.D**:
169-
1. `try_pair_transcripts` now computes STAR-faithful combined-span penalty: `combined_wt_score = t1.score + t2.score - p1 - p2 + combined_p`. Previously double-applied per-mate span penalties → AS tag wrong for 99.6% of PE reads. Now 3.1%.
170-
2. Decision tree reordered to (1) position dedup → (2) score-range filter → (3) TooManyLoci → (4) quality filter. Fixes 12 half-mapped pairs.
168+
**Phase G2** (2026-04-29): `MAX_RECURSION` 10,000→100,000 + `sa_pos_to_forward` overflow fix. `ERR12389696.7118031` was the sole source of both NH diffs and MAPQ inflations (NH=3 vs STAR's NH=9, ruSTAR MAPQ=1 vs STAR MAPQ=0). Root cause: the 47-WA rDNA cluster (4 copies × multiple seeds per mate) exhausted 10k recursions before exploring the 4th within-copy pair. Fix: raise the per-cluster recursion budget from 10k to 100k. Also fixed: `sa_pos_to_forward` underflow panic for reverse-strand seeds near genome boundary (now `saturating_sub`). Also added guard in `finalize_transcript` to reject WTs where `adjusted_genome_start + ref_len > n_genome`.
171169

172-
**Current PE parity**: 8767 vs STAR 8390 (+377 extra, mostly rDNA N² cross-copy pairs). 236 half-mapped. 28 MAPQ inflations / 192 deflations (rDNA N² problem). `.18919121` fixed (Phase 17.B). `.6302610` still FP.
170+
**Phase G1** (2026-04-29): `split_combined_wt` junction_idx fix. Reduced `.16980960`'s pairs from 11 to 9, matching STAR's NH=9.
173171

174-
## Remaining Limitations (Top 5)
172+
**Note on faithfulness change**: Phase F1 (--runRNGseed PR) changed PE tie-breaking from SA-order to seeded StdRng, increasing tie-breaking diffs. Phase G1 improved faithfulness from 99.755% → 99.865%. Phase G2: 99.865% → 99.883%.
173+
174+
## Remaining Limitations (Top 2)
175175

176-
- No coordinate-sorted BAM output (use `samtools sort`) — Phase 17.2
177-
- No PE chimeric detection — Phase 17.3
178-
- No `--outStd SAM/BAM` (stdout output) — Phase 17.6
179-
- No `--outReadsUnmapped Fastx` — Phase 17.4
180176
- No STARsolo single-cell features — Phase 14 (deferred)
177+
- 4 PE AS diffs (ruSTAR improvements, not bugs): `.844151` finds VIII:451791 0mm vs STAR's VII:1001391 6mm; `.4972950` finds correct spliced mate2 vs STAR's unspliced. Both cases: STAR's combined-window approach fails to stitch a PE pair at the better location.
181178

182179
See [docs/phase17_features.md](docs/phase17_features.md) for full feature status.

CONTRIBUTING.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Contributing to ruSTAR
2+
3+
## Building and testing
4+
5+
Rust 2024 edition. Standard Cargo commands:
6+
7+
```bash
8+
cargo build # debug build
9+
cargo build --release # release build
10+
cargo test # run all tests
11+
cargo clippy # lint (zero warnings expected)
12+
cargo fmt --check # formatting check
13+
```
14+
15+
CI runs on Linux (x86_64, x86-64-v3, aarch64), macOS (aarch64), and Windows (x86_64). PRs must pass all CI checks before merging.
16+
17+
## Test data
18+
19+
Small synthetic and yeast test data lives in `test/`. Integration tests in `tests/` use the synthetic genome. Differential testing against STAR reference outputs is done via `test/compare_sam.py` and `test/compare_pe.py`.
20+
21+
## Project history
22+
23+
ruSTAR was written as a faithful port of [STAR](https://github.com/alexdobin/STAR) by Alexander Dobin. Up to the initial release, the goal was behavioral parity with STAR — matching its algorithms, thresholds, and output formats as closely as possible. Notes from that development phase are in `docs/dev/`.
24+
25+
Future development is not bound by that constraint. Adding STARsolo, new features, or diverging from STAR behavior is entirely welcome.
26+
27+
## License
28+
29+
MIT, matching the original STAR license.

Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,9 +50,9 @@ rayon = "1"
5050
dashmap = "6"
5151
chrono = "0.4"
5252
rand = "0.8"
53+
tempfile = "3"
5354

5455
[dev-dependencies]
55-
tempfile = "3"
5656
assert_cmd = "2"
5757
predicates = "3"
5858

0 commit comments

Comments
 (0)