Develop by joshfactorial · Pull Request #286 · ncsa/NEAT

joshfactorial · 2026-05-19T01:13:08Z

Performance enhancements.

Many changes designed to increase performance for NEAT, especially on large chromosomes. We saw a 9x decrease in time and 3x decrease in RAM usage. Reduced dependency on Biopython. Eliminated shuffling, is it is too memory intensive, and an existing tool (seqkit) does this as well as can be done (recommendation added to README). Once that was out of the way, it enabled streaming the fastq and bam files instead of storing all the data in memory. Also reduced RNG calls. Will fail regression tests, because the RNG pattern changed, but still deterministic with a given seed.

Two hot-path fixes identified via cProfile on a single-worker ecoli run: 1. error_models.py: in get_sequencing_errors, replace `rng.choice(list(range(len(quality_scores))))` with `rng.integers(len)`. The two calls return identical values from the same seed (same underlying integers() call), but choice() validates via np.prod() and the list(range()) allocation per call adds ~150x amplification. Eliminated 28M np.prod calls on a 185k-read run. Bit-reproducible: same seed yields same output. 2. read.py: in make_cigar, fast-path the no-indel case. NEAT knows whether any insertion or deletion was introduced (by error or by mutation) without needing to align — return f"{n}M" directly when self.errors and self.mutations contain none. Falls back to PairwiseAligner only for the ~1.6% of reads that actually have indels. Validated on ecoli 10x 4-thread (same configs as comparison/configs): ecoli_se: 14:55.7 -> 6:12.0 (2.41x speedup, -58% wall) ecoli_pe: 14:45.5 -> 6:08.2 (2.40x speedup, -58% wall) Single-thread 2x SE: 8:45.6 -> 3:08.4 (2.79x, -64%). Memory unchanged. NEAT now beats NEAT 2.1 single-threaded on SE (was previously losing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two further hot-path fixes, building on the prior ~2.4x speedup: 3. error_models.py:get_sequencing_errors — replace the per-iteration scalar rejection-sampling loop (rng.integers + rng.random per attempt) with a single batched draw of n indices and n uniform deviates, evaluate the accept mask vectorwise via 10**(-q/10), and take the first num_errors accepted positions. Eliminates the 150-iteration Python loop per read and the per-read {quality: rate} dict (replaced with one numpy expression). Caveat: batched draws consume the PRNG stream in a different order than the original interleaved draws, so seeded runs are not bit-identical to the prior implementation. Distribution and downstream statistics are preserved. 4. read.py:make_cigar — replace the indel-case fallback (PairwiseAligner on every indel read) with a direct walker that builds the CIGAR from the stored error metadata in O(read length). Handles the forward-read + sequencing-error-indel case (~half of all indel reads on default settings). Reverse reads with indels and mutation-indel cases still fall back to the pairwise aligner — the existing reverse-read truncation semantics and mutation-then-error coordinate shifts don't fit a simple position walker. make_cigar is now split into make_cigar/_cigar_from_error_indels/ _cigar_via_alignment for clarity. Validated on ecoli 10x 4-thread (same configs as comparison/configs): ecoli_se: 14:55.7 -> 4:27.4 (3.35x speedup, -70% wall, cumulative) ecoli_pe: 14:45.5 -> 4:36.6 (3.20x speedup, -69% wall, cumulative) NEAT now uses less total CPU than NEAT 2.1 on PE workloads (799s vs 1209s). Memory unchanged (~543 MB peak RSS per worker). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cover_dataset called rng.shuffle on its fragment list at the end. The fragment positions are already drawn at random by the upstream sampling — the shuffle just re-randomized an already-random order, and forced two downstream sort passes (per-worker and final) to undo it before the BAM could be written coordinate-sorted. The final sort allocated 1 GB of buffer per memray. Removing the shuffle alone is intentionally a no-op for now: the per-worker pysam.sort still runs in the next commit, and reads are still emitted in fragment-iteration order. Subsequent commits move the BAM sort to the write-side and drop both sort passes. FASTQ output order is documented in README ("Output ordering"); seqkit shuffle is the recommended path for users who need a strict uniform shuffle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…sort Replace the per-worker pysam.sort pass (which used samtools' default 768 MB sort buffer per thread) with an in-Python sort of the flattened read list by Read.position before writing to the worker's BAM. For paired-end runs this correctly interleaves mate records by coordinate, which a fragment-level sort would not — read1 and read2 in a pair sit at different reference positions and need to be placed independently in the sorted output. Each worker now produces a coordinate-sorted BAM by construction. The final stitch step still calls pysam.sort -m 1G on the merged result; the next commit removes that, since pysam.merge on coord-sorted inputs already produces coord-sorted output. Also removes the now-unused pysam and os imports from single_runner. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-worker BAMs are coordinate-sorted by single_runner. pysam.merge does a k-way heap merge of coordinate-sorted inputs and produces coordinate-sorted output, so the trailing pysam.sort pass was provably redundant. It was also the largest single allocation in the run (1.07 GB sort buffer with -m 1G, per memray) and dominated peak memory at stitch time. Also passes -@ threads to the final pysam.merge for parallel decompression of inputs. The .bai index continues to be produced by runner.py post-stitch. Validated single-thread ecoli 2x SE: Peak RSS: 423 MB -> 384 MB BAM: coordinate-sorted (header SO:coordinate, 0 ordering violations on 185667 records). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The quality-score generator looped ~150 times per read calling rng.normal with scalar (mean, scale) arguments, plus a Python list comprehension to compute the index map. Replace both with batched numpy: one rng.normal call returning all draws with element-wise (mean, scale) parameters, one np.clip + np.rint pass for the [1, 42] bound. Drops a 28M-iteration Python loop on a 185k-read single-thread run. Same statistical distribution as before, but the PRNG stream is consumed in a different order than the prior scalar-loop implementation, so seeded outputs are not bit-identical to the previous version. (Same caveat as the get_sequencing_errors vectorization in commit 6759026.) Validated on ecoli 10x 4-thread: ecoli_se: 5:21 -> 2:50 (1.57x speedup over v5 sort-plan version) ecoli_pe: 5:33 -> 2:48 (1.65x speedup) Memory unchanged (~232 MB peak RSS). Combined with the prior fixes, NEAT now achieves: ecoli_se 14:56 -> 2:50 (5.27x faster than baseline, 2.4x less memory) ecoli_pe 14:46 -> 2:48 (5.27x faster than baseline, 2.4x less memory) NEAT vs NEAT 2.1 single-thread SE: 4.40x faster, 38% less CPU. NEAT vs NEAT 2.1 single-thread PE: 7.22x faster, 62% less CPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

generate_reads no longer accumulates a list of (Read, Read|None) pairs for the caller to write later. Instead it streams reads directly to the FASTQ and BAM output handles on OutputFileWriter as they are generated: - Sort fragments by read1 start position before iterating, so read1 records arrive at the BAM in monotonic coordinate order. - In paired-end mode, push each read2 onto a heapq-keyed min-heap by position. Before writing each new read1, pop and write any buffered read2 whose position is below the new read1's position. After the main loop, flush the heap in order. The buffer is bounded by roughly (fragment_length / read_length) entries — single-digit reads typically — so per-worker memory stays constant in chunk size and coverage instead of scaling linearly with them. - The function returns nothing now; single_runner just calls it and lets OutputFileWriter.flush_and_close_files close the BAM at end of chunk. Drops the BAM-write loop and the reads_to_write accumulation from single_runner — they were the only consumers. Validated on ecoli 10x 4-thread: ecoli_se wall: 2:49.97 -> 2:08.38 (1.32x faster, CPU 272% -> 317%) ecoli_pe wall: 2:47.84 -> 2:14.01 (1.25x faster, CPU 274% -> 294%) Single-thread peak RSS: 385 MB -> 177 MB (54% reduction; the per-chunk read buffer was the dominant per-worker memory cost). BAM coordinate-sorted on both SE and PE, 0 sort violations on 928k records. Cumulative vs original baseline: ecoli_se 4-thread 10x: 14:55.7 -> 2:08.4 (6.98x speedup, 2.46x less RSS) ecoli_pe 4-thread 10x: 14:45.5 -> 2:14.0 (6.60x speedup, 2.46x less RSS) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously merge_bam called pysam.merge on overlapping per-chunk BAMs, which required BGZF decompression and a k-way heap merge of all records. At supercomputer scale on large genomes that step becomes the dominant cost (scales with output size, not core count) and would dwarf the parallel read-generation phase. Two coordinated changes make a raw byte-level concatenation correct: 1. cover_dataset now takes responsibility_length, the number of bases at the start of the chunk's reference window that this chunk owns for read1 placement. r1.position is clipped to [0, responsibility_length); the trailing overlap region serves only as reference context for boundary- spanning reads. number_reads is scaled by responsibility_length rather than span_length, eliminating the ~0.06% over-coverage that used to appear in chunk-overlap regions. The runner computes responsibility_length per chunk as the distance to the next chunk's start (or the full remaining length for the final chunk). 2. For PE, cover_dataset additionally caps fragment-end e at responsibility_length + read_len so that read2.position cannot extend past responsibility_length. Without this clamp the few PE fragments with long fragment_length near a chunk boundary produced read2 records that landed in the next chunk's range, breaking cat-stitched sort order (27 violations on 928k records in initial testing). With the clamp, per-worker BAMs concatenate into a strictly coordinate-sorted whole. stitch_outputs.merge_bam now calls pysam.cat instead of pysam.merge. cat does a raw BGZF concatenation (no decompression / re-encode) and is bounded by raw disk I/O, not BGZF rate. The per-chunk batching loop for FD-limit management is no longer needed. Validated on ecoli 10x 4-thread: ecoli_se: 2:08.4 -> 1:52.7 (1.14x faster, 223 MB -> 175 MB RSS) ecoli_pe: 2:14.0 -> 2:00.5 (1.11x faster, 223 MB -> 175 MB RSS, 0 sort violations on both) BAM is coordinate-sorted, .bai index produced as before by runner.py. Record counts match expected (0.06% fewer than pre-fix because chunk-overlap duplicates are gone — actual correctness improvement). The relative speedup of cat over merge grows with output size: at ecoli scale the BAM is 115 MB and stitch was already a modest fraction; for a human 30x BAM (~100 GB), the same change extrapolates to dropping stitch from hours to a few minutes — bounded by parallel filesystem bandwidth instead of BGZF re-encode rate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two changes that matter most at large-genome / supercomputer scale, where defaults sized for small bacterial runs fall apart. 1. Auto-tune parallel_block_size Changed the default from a hardcoded 500_000 bp to 0, which is now a sentinel meaning "auto-tune". The runner computes the chunk size from total genome length and thread count, targeting 8 chunks per thread: chunk_size = total_genome / (threads * 8) clamped to [100 kb, 50 Mb] For ecoli (14 Mb, 4 threads) this gives ~435 kb chunks — essentially unchanged from the old hardcoded 500 kb. For human-scale runs (3 Gb, 64 threads) it gives ~5.9 Mb chunks (≈520 total) instead of the old 500 kb (≈6,000 total): fewer files to merge at stitch time, fewer process-startup overheads, better load balance across workers. Users with `parallel_block_size: <positive int>` in their YAML get their explicit value as before. Genome length is read from the FASTA index (.fai) via pysam.FastaFile — fast even on large genomes. 2. Parallelize stitch_outputs.main The four stitch steps (fq1 concat, fq2 concat, vcf merge, bam cat) now run concurrently in a ThreadPoolExecutor. All four spend most of their time in C-level I/O that releases the GIL (zlib for gzip, BGZF for pysam.cat). Total stitch wall ≈ max(steps) instead of sum(steps). On the local ecoli benchmark the parallel stitch shows ~flat wall (the single-disk I/O contention makes the steps effectively serial), but at supercomputer scale on a parallel filesystem the overlap becomes real. No regression observed. True byte-level streaming of pysam.cat output as workers complete remains future work — would require manual BGZF block manipulation to strip per-chunk BAM headers, and the win is modest at single-node scale (saves ~one chunk's worth of cat time at the end). The cat-stitch from the prior commit already eliminates the dominant scaling concern. Validated on ecoli 10x 4-thread: ecoli_se: 1:52.7 -> 1:56.9 (within noise; SE has less to parallelize) ecoli_pe: 2:00.5 -> 1:44.3 (1.16x faster, CPU% 310 -> 353) Auto-tuned chunk size: 435,154 bp for ecoli's 13.9 Mb genome. BAM: 0 sort violations on both SE and PE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…angeLog README: Added a "Multi-node deployment on HPC clusters" subsection with a SLURM array-job example for whole-genome simulation. NEAT itself runs single-node (multiprocessing.Pool); for cross-node scaling, users dispatch one task per chromosome and concatenate outputs. The cat-based stitch introduced in this release means the final whole-genome concatenation is just `samtools cat` of per-chromosome BAMs — no sort needed. Also updated the parallelizing-simulation section to mention auto-tuned chunk sizes (the default behavior now). ChangeLog: Wrote the v4.4.3 entry summarizing the session's work, with a before/after benchmark table on ecoli 10x. Headline numbers: ecoli SE wall: 14:55 -> 1:57 (7.6x) ecoli PE wall: 14:46 -> 1:44 (8.5x) Peak RSS: 549 MB -> 175 MB (3.1x less) Peak heap: 1.27 GB -> 0.32 GB (4x less) pysam.sort: gone PairwiseAligner: 99% gone vs NEAT 2.1 single-threaded: SE wall: 12:28 -> 1:57 (6.4x), 53% less total CPU PE wall: 20:12 -> 1:44 (11.6x), 70% less total CPU Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…te-level FASTQ concat Three small but additive improvements to the read-generation and stitch paths, all behaviour-preserving: 1. Drop two `deepcopy(self.reference_segment)` calls in read.py. Biopython `Seq` is immutable; `apply_mutations`/`apply_errors` build their own `MutableSeq` working copies. The deepcopies just paid for a fresh allocation per read that nothing relied on. `convert_masking` and `finalize_read_and_write` now alias the masked reference directly. 2. Rewrite `apply_errors` as a single ascending-position pass. The previous implementation iterated errors in descending order doing one `np.concatenate` and one `MutableSeq` slice/concat per error — quadratic in errors-per-read. The new pass walks ascending, accumulating unchanged-range chunks plus error alternates into two lists (sequence + quality scores), then joins/concats once at the end. Quality-array semantics are preserved exactly (including the prior convention that an insertion anchor's original quality is replaced by `alt_len - 1` low-quality scores). Linear in total read length regardless of error count. 3. Byte-level FASTQ concat in stitch. The previous `concat` decompressed each per-chunk FASTQ via `gzip.open(rt)` and re-compressed into the destination gzip writer. Concatenated gzip streams are themselves valid gzip files (gzip spec §2.2), so we skip zlib entirely and just copy raw bytes. The destination's gzip handle is closed first so the byte-level write can truncate and overwrite cleanly. Validated on ecoli 10x 4-thread under machine-busy conditions: Polish off: ecoli_se 3:05.5 Polish on: ecoli_se 2:28.6 (~20% faster on a busy machine; comparable to v10's quiet-machine result of 1:56.9) c_elegans 10x 4-thread (scale-test, 97 Mb genome, 35 auto-tuned chunks): 19:16 wall, 304 MB peak RSS, 6,685,764 records, 0 sort violations. Stitch step: 5.3 s parallel. Confirms the cat-stitch architecture scales from ecoli's 14 Mb up an order of magnitude with no surprises. Adds a c_elegans config to comparison/configs/ for future regression runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The v4.4.3 entry now includes: - c_elegans 10x SE scale-test (97 Mb genome, ~7× ecoli): 19:16 wall, 304 MB peak RSS, 6.7M records, 0 sort violations, 5.3 s stitch. - Polish details: one-pass apply_errors (eliminates quadratic per-error rebuild), deepcopy removals in convert_masking and finalize_read_and_write, byte-level FASTQ concat at stitch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…READMEtable Re-ran ecoli SE+PE 4-thread 10x on a quiet machine after all the polish work landed. The ChangeLog table now reflects the actual v4.4.3 final numbers: ecoli_se wall: 14:55 -> 1:35 (9.4x faster, was 7.6x with pre-polish numbers) ecoli_pe wall: 14:46 -> 1:35 (9.4x faster, was 8.5x) ecoli_se CPU: 3,227 s -> 331 s (9.7x less) ecoli_pe CPU: 3,168 s -> 338 s (9.4x less) vs NEAT 2.1 single-threaded baseline: SE: 7.9x faster, 56% less CPU PE: 12.8x faster, 72% less CPU Also added a note at the top of README "Estimated runtimes" pointing readers to ChangeLog v4.4.3 for current numbers — the v4.4.0 runtimes in that section are now ~5-10x off from current performance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three sets of tests were exercising contracts that the v4.4.3 work changed: 1. tests/test_read_simulator/test_generate_reads.py (9 tests) generate_reads no longer returns a list of (Read, Read|None) pairs — it streams Read objects directly to the output_file_writer as part of the bounded-memory rework. Added a small `_CollectingOFW` stand-in that captures write_bam_record calls; the affected tests now enable produce_bam=True, pass a CollectingOFW, and assert on ofw.bam_records. For paired-end tests, the pair-structure assertions were replaced with forward/reverse-count assertions since pairs are no longer kept together in memory (mates are interleaved by position via a heap buffer). 2. tests/test_read_simulator/test_options.py (1 test) test_default_values expected parallel_block_size == 500000. The v4.4.3 default is 0 — a sentinel that triggers auto-tuning from genome length and thread count in runner.py. Test updated to assert the sentinel. 3. tests/test_read_simulator/test_stitch_outputs.py (11 tests) - concat() now takes a destination Path (not a gzip handle) and writes raw bytes — concatenated gzip streams are valid gzip per the spec. The concat tests now pass a path and read it back via gzip.open. - merge_bam now calls pysam.cat (not pysam.merge + pysam.sort). The four old merge/sort/temp-file/chunk-batching tests were replaced with three cat-based equivalents (call_count, output path, full-input ordering). - The main() tests that asserted against StringIO buffers were updated to read back from the actual destination paths, since byte-level concat writes through the path, not through the handle. All 599 tests pass locally (`pytest tests/`); should unblock the GitHub Actions failures on develop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- pyproject.toml: 4.4.2 -> 4.4.3 for the release. - ChangeLog.md: behaviour -> behavior (consistency with the rest of the file). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joshfactorial and others added 16 commits May 18, 2026 00:05

Bump version to 4.4.3 and fix ChangeLog spelling

2f56140

- pyproject.toml: 4.4.2 -> 4.4.3 for the release. - ChangeLog.md: behaviour -> behavior (consistency with the rest of the file). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into develop

34a2e7e

joshfactorial merged commit 799a67e into main May 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop#286

Develop#286
joshfactorial merged 16 commits into
mainfrom
develop

joshfactorial commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshfactorial commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant