Skip to content

Develop#286

Merged
joshfactorial merged 16 commits into
mainfrom
develop
May 19, 2026
Merged

Develop#286
joshfactorial merged 16 commits into
mainfrom
develop

Conversation

@joshfactorial
Copy link
Copy Markdown
Collaborator

Performance enhancements.

Many changes designed to increase performance for NEAT, especially on large chromosomes. We saw a 9x decrease in time and 3x decrease in RAM usage. Reduced dependency on Biopython. Eliminated shuffling, is it is too memory intensive, and an existing tool (seqkit) does this as well as can be done (recommendation added to README). Once that was out of the way, it enabled streaming the fastq and bam files instead of storing all the data in memory. Also reduced RNG calls. Will fail regression tests, because the RNG pattern changed, but still deterministic with a given seed.

joshfactorial and others added 16 commits May 18, 2026 00:05
Two hot-path fixes identified via cProfile on a single-worker ecoli run:

1. error_models.py: in get_sequencing_errors, replace
   `rng.choice(list(range(len(quality_scores))))` with `rng.integers(len)`.
   The two calls return identical values from the same seed (same underlying
   integers() call), but choice() validates via np.prod() and the list(range())
   allocation per call adds ~150x amplification. Eliminated 28M np.prod calls
   on a 185k-read run. Bit-reproducible: same seed yields same output.

2. read.py: in make_cigar, fast-path the no-indel case. NEAT knows whether
   any insertion or deletion was introduced (by error or by mutation) without
   needing to align — return f"{n}M" directly when self.errors and
   self.mutations contain none. Falls back to PairwiseAligner only for the
   ~1.6% of reads that actually have indels.

Validated on ecoli 10x 4-thread (same configs as comparison/configs):
  ecoli_se: 14:55.7 -> 6:12.0  (2.41x speedup, -58% wall)
  ecoli_pe: 14:45.5 -> 6:08.2  (2.40x speedup, -58% wall)
Single-thread 2x SE: 8:45.6 -> 3:08.4 (2.79x, -64%).
Memory unchanged. NEAT now beats NEAT 2.1 single-threaded on SE
(was previously losing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two further hot-path fixes, building on the prior ~2.4x speedup:

3. error_models.py:get_sequencing_errors — replace the per-iteration scalar
   rejection-sampling loop (rng.integers + rng.random per attempt) with a single
   batched draw of n indices and n uniform deviates, evaluate the accept mask
   vectorwise via 10**(-q/10), and take the first num_errors accepted positions.
   Eliminates the 150-iteration Python loop per read and the per-read
   {quality: rate} dict (replaced with one numpy expression). Caveat: batched
   draws consume the PRNG stream in a different order than the original
   interleaved draws, so seeded runs are not bit-identical to the prior
   implementation. Distribution and downstream statistics are preserved.

4. read.py:make_cigar — replace the indel-case fallback (PairwiseAligner on
   every indel read) with a direct walker that builds the CIGAR from the
   stored error metadata in O(read length). Handles the forward-read +
   sequencing-error-indel case (~half of all indel reads on default settings).
   Reverse reads with indels and mutation-indel cases still fall back to the
   pairwise aligner — the existing reverse-read truncation semantics and
   mutation-then-error coordinate shifts don't fit a simple position walker.
   make_cigar is now split into make_cigar/_cigar_from_error_indels/
   _cigar_via_alignment for clarity.

Validated on ecoli 10x 4-thread (same configs as comparison/configs):
  ecoli_se: 14:55.7 -> 4:27.4  (3.35x speedup, -70% wall, cumulative)
  ecoli_pe: 14:45.5 -> 4:36.6  (3.20x speedup, -69% wall, cumulative)
NEAT now uses less total CPU than NEAT 2.1 on PE workloads (799s vs 1209s).
Memory unchanged (~543 MB peak RSS per worker).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cover_dataset called rng.shuffle on its fragment list at the end. The fragment
positions are already drawn at random by the upstream sampling — the shuffle
just re-randomized an already-random order, and forced two downstream sort
passes (per-worker and final) to undo it before the BAM could be written
coordinate-sorted. The final sort allocated 1 GB of buffer per memray.

Removing the shuffle alone is intentionally a no-op for now: the per-worker
pysam.sort still runs in the next commit, and reads are still emitted in
fragment-iteration order. Subsequent commits move the BAM sort to the
write-side and drop both sort passes.

FASTQ output order is documented in README ("Output ordering"); seqkit shuffle
is the recommended path for users who need a strict uniform shuffle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sort

Replace the per-worker pysam.sort pass (which used samtools' default 768 MB
sort buffer per thread) with an in-Python sort of the flattened read list by
Read.position before writing to the worker's BAM. For paired-end runs this
correctly interleaves mate records by coordinate, which a fragment-level sort
would not — read1 and read2 in a pair sit at different reference positions
and need to be placed independently in the sorted output.

Each worker now produces a coordinate-sorted BAM by construction. The final
stitch step still calls pysam.sort -m 1G on the merged result; the next
commit removes that, since pysam.merge on coord-sorted inputs already
produces coord-sorted output.

Also removes the now-unused pysam and os imports from single_runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-worker BAMs are coordinate-sorted by single_runner. pysam.merge does a
k-way heap merge of coordinate-sorted inputs and produces coordinate-sorted
output, so the trailing pysam.sort pass was provably redundant. It was also
the largest single allocation in the run (1.07 GB sort buffer with -m 1G,
per memray) and dominated peak memory at stitch time.

Also passes -@ threads to the final pysam.merge for parallel decompression
of inputs. The .bai index continues to be produced by runner.py post-stitch.

Validated single-thread ecoli 2x SE:
  Peak RSS: 423 MB -> 384 MB
  BAM: coordinate-sorted (header SO:coordinate, 0 ordering violations on
  185667 records).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The quality-score generator looped ~150 times per read calling rng.normal with
scalar (mean, scale) arguments, plus a Python list comprehension to compute
the index map. Replace both with batched numpy: one rng.normal call returning
all draws with element-wise (mean, scale) parameters, one np.clip + np.rint
pass for the [1, 42] bound. Drops a 28M-iteration Python loop on a 185k-read
single-thread run.

Same statistical distribution as before, but the PRNG stream is consumed in
a different order than the prior scalar-loop implementation, so seeded
outputs are not bit-identical to the previous version. (Same caveat as the
get_sequencing_errors vectorization in commit 6759026.)

Validated on ecoli 10x 4-thread:
  ecoli_se: 5:21 -> 2:50 (1.57x speedup over v5 sort-plan version)
  ecoli_pe: 5:33 -> 2:48 (1.65x speedup)
Memory unchanged (~232 MB peak RSS).

Combined with the prior fixes, NEAT now achieves:
  ecoli_se 14:56 -> 2:50 (5.27x faster than baseline, 2.4x less memory)
  ecoli_pe 14:46 -> 2:48 (5.27x faster than baseline, 2.4x less memory)
NEAT vs NEAT 2.1 single-thread SE: 4.40x faster, 38% less CPU.
NEAT vs NEAT 2.1 single-thread PE: 7.22x faster, 62% less CPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
generate_reads no longer accumulates a list of (Read, Read|None) pairs for the
caller to write later. Instead it streams reads directly to the FASTQ and BAM
output handles on OutputFileWriter as they are generated:

  - Sort fragments by read1 start position before iterating, so read1 records
    arrive at the BAM in monotonic coordinate order.
  - In paired-end mode, push each read2 onto a heapq-keyed min-heap by
    position. Before writing each new read1, pop and write any buffered read2
    whose position is below the new read1's position. After the main loop,
    flush the heap in order. The buffer is bounded by roughly
    (fragment_length / read_length) entries — single-digit reads typically —
    so per-worker memory stays constant in chunk size and coverage instead
    of scaling linearly with them.
  - The function returns nothing now; single_runner just calls it and lets
    OutputFileWriter.flush_and_close_files close the BAM at end of chunk.

Drops the BAM-write loop and the reads_to_write accumulation from
single_runner — they were the only consumers.

Validated on ecoli 10x 4-thread:
  ecoli_se wall: 2:49.97 -> 2:08.38  (1.32x faster, CPU 272% -> 317%)
  ecoli_pe wall: 2:47.84 -> 2:14.01  (1.25x faster, CPU 274% -> 294%)
Single-thread peak RSS: 385 MB -> 177 MB (54% reduction; the per-chunk read
buffer was the dominant per-worker memory cost).
BAM coordinate-sorted on both SE and PE, 0 sort violations on 928k records.

Cumulative vs original baseline:
  ecoli_se 4-thread 10x: 14:55.7 -> 2:08.4 (6.98x speedup, 2.46x less RSS)
  ecoli_pe 4-thread 10x: 14:45.5 -> 2:14.0 (6.60x speedup, 2.46x less RSS)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously merge_bam called pysam.merge on overlapping per-chunk BAMs, which
required BGZF decompression and a k-way heap merge of all records. At
supercomputer scale on large genomes that step becomes the dominant cost
(scales with output size, not core count) and would dwarf the parallel
read-generation phase.

Two coordinated changes make a raw byte-level concatenation correct:

1. cover_dataset now takes responsibility_length, the number of bases at the
   start of the chunk's reference window that this chunk owns for read1
   placement. r1.position is clipped to [0, responsibility_length); the
   trailing overlap region serves only as reference context for boundary-
   spanning reads. number_reads is scaled by responsibility_length rather
   than span_length, eliminating the ~0.06% over-coverage that used to
   appear in chunk-overlap regions. The runner computes
   responsibility_length per chunk as the distance to the next chunk's
   start (or the full remaining length for the final chunk).

2. For PE, cover_dataset additionally caps fragment-end e at
   responsibility_length + read_len so that read2.position cannot extend
   past responsibility_length. Without this clamp the few PE fragments with
   long fragment_length near a chunk boundary produced read2 records that
   landed in the next chunk's range, breaking cat-stitched sort order (27
   violations on 928k records in initial testing). With the clamp,
   per-worker BAMs concatenate into a strictly coordinate-sorted whole.

stitch_outputs.merge_bam now calls pysam.cat instead of pysam.merge. cat
does a raw BGZF concatenation (no decompression / re-encode) and is bounded
by raw disk I/O, not BGZF rate. The per-chunk batching loop for FD-limit
management is no longer needed.

Validated on ecoli 10x 4-thread:
  ecoli_se: 2:08.4 -> 1:52.7  (1.14x faster, 223 MB -> 175 MB RSS)
  ecoli_pe: 2:14.0 -> 2:00.5  (1.11x faster, 223 MB -> 175 MB RSS, 0 sort
                              violations on both)
BAM is coordinate-sorted, .bai index produced as before by runner.py.
Record counts match expected (0.06% fewer than pre-fix because chunk-overlap
duplicates are gone — actual correctness improvement).

The relative speedup of cat over merge grows with output size: at ecoli
scale the BAM is 115 MB and stitch was already a modest fraction; for a
human 30x BAM (~100 GB), the same change extrapolates to dropping stitch
from hours to a few minutes — bounded by parallel filesystem bandwidth
instead of BGZF re-encode rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two changes that matter most at large-genome / supercomputer scale, where
defaults sized for small bacterial runs fall apart.

1. Auto-tune parallel_block_size

   Changed the default from a hardcoded 500_000 bp to 0, which is now a
   sentinel meaning "auto-tune". The runner computes the chunk size from
   total genome length and thread count, targeting 8 chunks per thread:

     chunk_size = total_genome / (threads * 8)
              clamped to [100 kb, 50 Mb]

   For ecoli (14 Mb, 4 threads) this gives ~435 kb chunks — essentially
   unchanged from the old hardcoded 500 kb. For human-scale runs (3 Gb,
   64 threads) it gives ~5.9 Mb chunks (≈520 total) instead of the old
   500 kb (≈6,000 total): fewer files to merge at stitch time, fewer
   process-startup overheads, better load balance across workers.

   Users with `parallel_block_size: <positive int>` in their YAML get their
   explicit value as before. Genome length is read from the FASTA index
   (.fai) via pysam.FastaFile — fast even on large genomes.

2. Parallelize stitch_outputs.main

   The four stitch steps (fq1 concat, fq2 concat, vcf merge, bam cat) now
   run concurrently in a ThreadPoolExecutor. All four spend most of their
   time in C-level I/O that releases the GIL (zlib for gzip, BGZF for
   pysam.cat). Total stitch wall ≈ max(steps) instead of sum(steps).

   On the local ecoli benchmark the parallel stitch shows ~flat wall (the
   single-disk I/O contention makes the steps effectively serial), but at
   supercomputer scale on a parallel filesystem the overlap becomes real.
   No regression observed.

True byte-level streaming of pysam.cat output as workers complete remains
future work — would require manual BGZF block manipulation to strip
per-chunk BAM headers, and the win is modest at single-node scale (saves
~one chunk's worth of cat time at the end). The cat-stitch from the prior
commit already eliminates the dominant scaling concern.

Validated on ecoli 10x 4-thread:
  ecoli_se: 1:52.7 -> 1:56.9 (within noise; SE has less to parallelize)
  ecoli_pe: 2:00.5 -> 1:44.3 (1.16x faster, CPU% 310 -> 353)
  Auto-tuned chunk size: 435,154 bp for ecoli's 13.9 Mb genome.
  BAM: 0 sort violations on both SE and PE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…angeLog

README: Added a "Multi-node deployment on HPC clusters" subsection with a
SLURM array-job example for whole-genome simulation. NEAT itself runs
single-node (multiprocessing.Pool); for cross-node scaling, users dispatch
one task per chromosome and concatenate outputs. The cat-based stitch
introduced in this release means the final whole-genome concatenation is
just `samtools cat` of per-chromosome BAMs — no sort needed.

Also updated the parallelizing-simulation section to mention auto-tuned
chunk sizes (the default behavior now).

ChangeLog: Wrote the v4.4.3 entry summarizing the session's work, with a
before/after benchmark table on ecoli 10x. Headline numbers:

  ecoli SE wall:    14:55 -> 1:57   (7.6x)
  ecoli PE wall:    14:46 -> 1:44   (8.5x)
  Peak RSS:         549 MB -> 175 MB (3.1x less)
  Peak heap:        1.27 GB -> 0.32 GB (4x less)
  pysam.sort:       gone
  PairwiseAligner:  99% gone

vs NEAT 2.1 single-threaded:
  SE wall: 12:28 -> 1:57 (6.4x), 53% less total CPU
  PE wall: 20:12 -> 1:44 (11.6x), 70% less total CPU

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…te-level FASTQ concat

Three small but additive improvements to the read-generation and stitch paths,
all behaviour-preserving:

1. Drop two `deepcopy(self.reference_segment)` calls in read.py.
   Biopython `Seq` is immutable; `apply_mutations`/`apply_errors` build their own
   `MutableSeq` working copies. The deepcopies just paid for a fresh allocation
   per read that nothing relied on. `convert_masking` and `finalize_read_and_write`
   now alias the masked reference directly.

2. Rewrite `apply_errors` as a single ascending-position pass.
   The previous implementation iterated errors in descending order doing one
   `np.concatenate` and one `MutableSeq` slice/concat per error — quadratic in
   errors-per-read. The new pass walks ascending, accumulating unchanged-range
   chunks plus error alternates into two lists (sequence + quality scores),
   then joins/concats once at the end. Quality-array semantics are preserved
   exactly (including the prior convention that an insertion anchor's original
   quality is replaced by `alt_len - 1` low-quality scores). Linear in total
   read length regardless of error count.

3. Byte-level FASTQ concat in stitch.
   The previous `concat` decompressed each per-chunk FASTQ via `gzip.open(rt)`
   and re-compressed into the destination gzip writer. Concatenated gzip
   streams are themselves valid gzip files (gzip spec §2.2), so we skip zlib
   entirely and just copy raw bytes. The destination's gzip handle is closed
   first so the byte-level write can truncate and overwrite cleanly.

Validated on ecoli 10x 4-thread under machine-busy conditions:
  Polish off:  ecoli_se 3:05.5
  Polish on:   ecoli_se 2:28.6  (~20% faster on a busy machine; comparable to
                                 v10's quiet-machine result of 1:56.9)

c_elegans 10x 4-thread (scale-test, 97 Mb genome, 35 auto-tuned chunks):
  19:16 wall, 304 MB peak RSS, 6,685,764 records, 0 sort violations.
  Stitch step: 5.3 s parallel. Confirms the cat-stitch architecture scales
  from ecoli's 14 Mb up an order of magnitude with no surprises.

Adds a c_elegans config to comparison/configs/ for future regression runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v4.4.3 entry now includes:
- c_elegans 10x SE scale-test (97 Mb genome, ~7× ecoli): 19:16 wall,
  304 MB peak RSS, 6.7M records, 0 sort violations, 5.3 s stitch.
- Polish details: one-pass apply_errors (eliminates quadratic per-error
  rebuild), deepcopy removals in convert_masking and finalize_read_and_write,
  byte-level FASTQ concat at stitch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…READMEtable

Re-ran ecoli SE+PE 4-thread 10x on a quiet machine after all the polish work
landed. The ChangeLog table now reflects the actual v4.4.3 final numbers:

  ecoli_se wall: 14:55 -> 1:35  (9.4x faster, was 7.6x with pre-polish numbers)
  ecoli_pe wall: 14:46 -> 1:35  (9.4x faster, was 8.5x)
  ecoli_se CPU:  3,227 s -> 331 s  (9.7x less)
  ecoli_pe CPU:  3,168 s -> 338 s  (9.4x less)

vs NEAT 2.1 single-threaded baseline:
  SE: 7.9x faster, 56% less CPU
  PE: 12.8x faster, 72% less CPU

Also added a note at the top of README "Estimated runtimes" pointing readers
to ChangeLog v4.4.3 for current numbers — the v4.4.0 runtimes in that
section are now ~5-10x off from current performance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three sets of tests were exercising contracts that the v4.4.3 work changed:

1. tests/test_read_simulator/test_generate_reads.py (9 tests)
   generate_reads no longer returns a list of (Read, Read|None) pairs — it
   streams Read objects directly to the output_file_writer as part of the
   bounded-memory rework. Added a small `_CollectingOFW` stand-in that
   captures write_bam_record calls; the affected tests now enable
   produce_bam=True, pass a CollectingOFW, and assert on ofw.bam_records.
   For paired-end tests, the pair-structure assertions were replaced with
   forward/reverse-count assertions since pairs are no longer kept together
   in memory (mates are interleaved by position via a heap buffer).

2. tests/test_read_simulator/test_options.py (1 test)
   test_default_values expected parallel_block_size == 500000. The v4.4.3
   default is 0 — a sentinel that triggers auto-tuning from genome length
   and thread count in runner.py. Test updated to assert the sentinel.

3. tests/test_read_simulator/test_stitch_outputs.py (11 tests)
   - concat() now takes a destination Path (not a gzip handle) and writes
     raw bytes — concatenated gzip streams are valid gzip per the spec. The
     concat tests now pass a path and read it back via gzip.open.
   - merge_bam now calls pysam.cat (not pysam.merge + pysam.sort). The four
     old merge/sort/temp-file/chunk-batching tests were replaced with three
     cat-based equivalents (call_count, output path, full-input ordering).
   - The main() tests that asserted against StringIO buffers were updated
     to read back from the actual destination paths, since byte-level
     concat writes through the path, not through the handle.

All 599 tests pass locally (`pytest tests/`); should unblock the GitHub
Actions failures on develop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- pyproject.toml: 4.4.2 -> 4.4.3 for the release.
- ChangeLog.md: behaviour -> behavior (consistency with the rest of the file).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joshfactorial joshfactorial merged commit 799a67e into main May 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant