ncsa · joshfactorial · May 19, 2026 · May 18, 2026 · May 18, 2026 · May 18, 2026
diff --git a/ChangeLog.md b/ChangeLog.md
@@ -1,3 +1,100 @@
+# NEAT v4.4.3
+Major performance and memory overhaul focused on making NEAT viable for large
+genomes on supercomputing hardware. No user-visible API changes (other than
+chunk size now auto-tuning by default).
+
+**Benchmark (ecoli 10× coverage, 4 threads, identical configs):**
+
+| Metric                  | v4.4.2     | v4.4.3     | Improvement     |
+|-------------------------|------------|------------|-----------------|
+| ecoli SE wall time      | 14:55      | 1:35       | 9.4× faster     |
+| ecoli PE wall time      | 14:46      | 1:35       | 9.4× faster     |
+| ecoli SE total CPU      | 3,227 s    | 331 s      | 9.7× less       |
+| ecoli PE total CPU      | 3,168 s    | 338 s      | 9.4× less       |
+| Peak resident memory    | 549 MB     | 175 MB     | 3.1× less       |
+| Peak heap (memray)      | 1.27 GB    | 0.32 GB    | 4× less         |
+| Per-worker memory       | O(N×cov)   | O(1)       | bounded         |
+| `pysam.sort` calls      | 2          | 0          | gone            |
+| BAM correctness         | 0.06% dups | strict     | fixed           |
+
+**Versus NEAT 2.1 (single-threaded baseline):**
+- SE: 12:28 → 1:35 (7.9× faster, 56% less CPU)
+- PE: 20:12 → 1:35 (12.8× faster, 72% less CPU)
+
+**Scale-test (c_elegans 10× coverage, 4 threads, 100 Mb genome — ~7× the
+ecoli reference):**
+
+| Metric                  | Value                                                   |
+|-------------------------|---------------------------------------------------------|
+| Wall time               | 19:16                                                   |
+| Total CPU               | 4,085 s                                                 |
+| Peak resident memory    | 304 MB                                                  |
+| BAM records             | 6,685,764                                               |
+| BAM sort violations     | 0                                                       |
+| Stitch step (parallel)  | 5.3 s                                                   |
+| Auto-tuned chunk size   | 3.1 Mb (35 chunks)                                      |
+
+Scaling behavior vs ecoli is ~linear in genome size as expected. The stitch
+step is bounded by raw disk I/O via `pysam.cat`, so it stays at single-digit
+seconds even as the BAM grows. Per-worker peak RSS is 304 MB ÷ 4 ≈ 76 MB,
+which is the reference segment + models — independent of chunk size and
+coverage.
+
+**What changed in the hot path:**
+- Vectorized error sampling in `get_sequencing_errors` — replaced a ~150-iteration
+  per-read Python loop with batched numpy. Eliminated 28M `np.prod` calls per
+  185k-read run.
+- Vectorized `get_quality_scores` — replaced per-base scalar `rng.normal` with
+  one batched call.
+- Replaced per-read `PairwiseAligner.align()` in `make_cigar` with a direct
+  walker that builds the CIGAR from known error/mutation positions in O(L).
+  99% of reads now skip alignment entirely.
+- Rewrote `apply_errors` as a single ascending-position pass. The previous
+  implementation did one `np.concatenate` and one `MutableSeq` slice/concat
+  per error — quadratic in errors-per-read. The new pass is linear regardless
+  of error count.
+- Removed redundant `deepcopy(self.reference_segment)` calls in
+  `convert_masking` and `finalize_read_and_write`. Biopython `Seq` is
+  immutable; the downstream operations make their own working copies.
+
+**What changed in the I/O path:**
+- Removed both `pysam.sort` calls. Per-worker BAMs are emitted coordinate-sorted
+  by construction; `pysam.merge` of sorted inputs already produces sorted output.
+  The final sort allocated a 1 GB buffer that dominated peak memory.
+- Replaced `pysam.merge` with `pysam.cat` for the final stitch. cat does a raw
+  BGZF concatenation (no decompression / re-encode), bounded by raw disk I/O
+  instead of BGZF rate. At human-30× scale this is the difference between a
+  multi-hour stitch and a multi-minute one.
+- Each chunk now owns a non-overlapping reference range for read1 placement
+  (`responsibility_length`), enabling the cat-based stitch and eliminating
+  ~0.06% over-coverage in chunk-overlap regions.
+- Streamed FASTQ and BAM records directly to output during read generation.
+  Workers no longer accumulate `reads_to_write` — per-worker memory is now
+  bounded by reference segment + models, not by chunk size × coverage.
+- Stitch steps (FASTQ concat, VCF dedup, BAM cat) now run concurrently in
+  threads. On a single-disk system the wall is bounded by the BAM cat alone;
+  on parallel filesystems the overlap is more pronounced.
+- FASTQ stitch is now byte-level: per-chunk gzip streams are concatenated
+  without decompression / re-encode (concatenated gzip streams form a valid
+  gzip file per the spec).
+
+**Defaults and ergonomics:**
+- `parallel_block_size` now auto-tunes from genome length and thread count
+  (target: ~8 chunks per thread). For small bacterial genomes this matches the
+  old hardcoded 500 kb; for human-scale genomes it produces ~6 Mb chunks
+  instead of ~500 kb, dramatically reducing stitch overhead. Specify the option
+  explicitly to override.
+- FASTQ output is no longer shuffled; reads come out in the natural sampling
+  order. Pipe through `seqkit shuffle` if you need a uniform shuffle (documented
+  in README).
+- Added a "Multi-node deployment on HPC clusters" section to the README
+  showing a SLURM array-job pattern for whole-genome simulation across nodes.
+
+**Caveats:**
+- Several of the vectorization fixes change how the PRNG stream is consumed.
+  Same seed will produce statistically equivalent reads, but not bit-identical
+  to v4.4.2. Re-baseline any regression tests that compared exact output.
+
 # NEAT v4.4.2
 - Added GC bias modeling to generate reads and a function to create a GC bias model from real data.
 - Added improvements and efficiency upgrades to generate-reads.

diff --git a/README.md b/README.md
@@ -251,8 +251,25 @@ Features:
 - Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
 - Create paired tumour/normal datasets using characteristics learned from real tumour data
 
+### Output ordering
+
+The BAM file NEAT produces is coordinate-sorted (by construction at write time — no separate sort pass is run, which used to allocate ~1 GB of sort buffer at stitch time).
+
+The FASTQ files are written in the natural fragment-sampling order. This is *roughly* random — fragments are drawn from batched random positions — but is not a strict uniform shuffle. If your downstream tooling assumes a real-sequencer-style shuffle, pipe the FASTQ through [`seqkit shuffle`](https://bioinf.shenwei.me/seqkit/usage/#shuffle):
+
+```sh
+# Single-end
+seqkit shuffle reads.fastq.gz -o reads.shuffled.fastq.gz
+
+# Paired-end — use a shared seed so the two files stay mate-aligned
+seqkit shuffle -s 42 reads_r1.fastq.gz -o reads_r1.shuffled.fastq.gz
+seqkit shuffle -s 42 reads_r2.fastq.gz -o reads_r2.shuffled.fastq.gz
+```
+
 ### Estimated runtimes
 
+> **Note:** The tables below are from the original NEAT 4.4 (v4.4.0) benchmark. NEAT v4.4.3 is roughly **9× faster on multi-threaded ecoli** and uses **~3× less memory** thanks to the performance work landed in that release. The relative shape of the tables (size scaling, contig vs. size mode tradeoffs) remains accurate, but absolute runtimes should be divided by ~5–10× for v4.4.3+. See ChangeLog v4.4.3 for a detailed before/after table on ecoli and a c_elegans scale-test.
+
 To give users a sense of how long `neat read-simulator` runs may take, we benchmarked NEAT 4.4 on several reference genomes. All runs were paired-end, with read length of 150 bp, coverage of 10, fragment mean of 300 bp, and fragment standard deviation of 50 bp. Runtimes are reported as the average across three unique runs (`Avg. time (ms)`) and the corresponding runtime in minutes. Cells marked with N/A indicate that NEAT was not able to run to completion.
 
 Benchmarks were run on a System76 Meerkat with a 13th Gen Intel Core i3-1315U (8 logical cores, up to 4.50 GHz) and 16 GiB RAM, using a 512 GB SSD and Ubuntu 24.04.3 LTS (Linux kernel 6.14). Actual runtimes will vary depending on your hardware.
@@ -347,7 +364,7 @@ neat read-simulator                 \
 
 ```
 ### Parallelizing simulation
-In this case, you would want to split the contig into blocks, rather than reading by contig. Even in single-threaded mode, this is likely to be faster. The default block size of 500,000 yields results quickly on a variety of datasets and can be easily modified to meet your requirements.
+Split the contig into blocks rather than reading by contig. Even in single-threaded mode this is likely to be faster. The chunk size auto-tunes from total genome length and thread count, targeting roughly 8 chunks per thread for good load balancing — on small bacterial genomes you get ~500 kb chunks (similar to NEAT's old hardcoded default), on human-scale genomes you get ~6 Mb chunks (a few hundred total instead of thousands). Specify `parallel_block_size` explicitly if you want to override.
 
 Also, we demonstrate the situation where you do not want any logs produced:
 
@@ -361,7 +378,8 @@ fragment_mean: 350
 fragment_st_dev: 50
 threads: 12
 parallel_mode: size
-parallel_block_size: 500000
+# parallel_block_size omitted: auto-tuned from genome length and thread count.
+# Set explicitly (e.g. parallel_block_size: 1000000) only if you have a reason.
 ```
 Then run with the command:
 ```
@@ -371,6 +389,61 @@ neat read-simulator                 \
         -o /home/me/simulated_reads/
 ```
 
+### Multi-node deployment on HPC clusters
+
+NEAT runs on a single node using Python's `multiprocessing`. To use multiple nodes on a supercomputer, dispatch one NEAT job per contig (or contig group) as a job-array element and concatenate the outputs afterwards. Each array task gets its own node and uses all available cores on it. NEAT itself doesn't need to know about the cluster.
+
+SLURM example with one task per human chromosome (24 tasks):
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=neat-array
+#SBATCH --array=1-24
+#SBATCH --cpus-per-task=64
+#SBATCH --mem=64G
+#SBATCH --time=04:00:00
+
+CHROM=$(awk "NR==$SLURM_ARRAY_TASK_ID" chroms.txt)   # e.g. chr1, chr2, ...
+samtools faidx hg38.fa "$CHROM" > "ref_${CHROM}.fa"   # one chromosome per task
+
+cat > "config_${CHROM}.yml" <<YAML
+reference: ref_${CHROM}.fa
+read_len: 150
+coverage: 30
+paired_ended: true
+fragment_mean: 400
+fragment_st_dev: 50
+produce_bam: true
+produce_fastq: true
+produce_vcf: true
+threads: ${SLURM_CPUS_PER_TASK}
+parallel_mode: size
+YAML
+
+neat read-simulator -c "config_${CHROM}.yml" \
+                    -o "out/${CHROM}/" \
+                    -p "${CHROM}"
+```
+
+After the array completes, concatenate the per-chromosome outputs:
+
+```bash
+# BAMs are already coordinate-sorted within each chromosome; cat in chromosome order
+# produces a globally-sorted whole-genome BAM with no further sort needed.
+samtools cat -o all.bam out/chr1/chr1_golden.bam out/chr2/chr2_golden.bam ... out/chrY/chrY_golden.bam
+samtools index all.bam
+
+# FASTQs: gzip-concat preserves a valid gzip stream
+cat out/*/chr*_r1.fastq.gz > all_r1.fastq.gz
+cat out/*/chr*_r2.fastq.gz > all_r2.fastq.gz
+
+# VCFs: use bcftools concat for proper merging
+bcftools concat -o all.vcf.gz -Oz out/*/chr*_golden.vcf.gz
+bcftools index all.vcf.gz
+```
+
+This gives you whole-genome simulation in roughly `single-chromosome-time / nodes` wall time. For a 30× human genome with 24 nodes (each 64 cores), a per-chromosome run takes ~10 min and the array finishes in ~10 min wall, plus a few minutes for the final concat.
+
 ### Insert specific variants
 Simulate a whole genome dataset with only the variants in the provided VCF file using `-v` and setting mutation rate to 0 with `-M`.
 

diff --git a/neat/models/error_models.py b/neat/models/error_models.py
@@ -81,32 +81,27 @@ def get_quality_scores(
         :return: An array of quality scores.
         """
         if self.uniform_quality_score:
-            return np.array([self.uniform_quality_score] * length)
-        else:
-            if length == model_read_length:
-                quality_index_map = np.arange(model_read_length)
-            else:
-                # This is basically a way to evenly spread the distribution across the number of bases in the read
-                quality_index_map = np.array(
-                    [max([0, model_read_length * n // length]) for n in range(length)]
-                )
-
-            temp_qual_array = []
-            for i in quality_index_map:
-                score = rng.normal(
-                    self.quality_score_probabilities[i][0],
-                    scale=self.quality_score_probabilities[i][1]
-                )
-                # make sure score is in range and an int
-                score = round(score)
-                if score > 42:
-                    score = 42
-                if score < 1:
-                    score = 1
+            return np.full(length, self.uniform_quality_score, dtype=int)
 
-                temp_qual_array.append(score)
+        # Map each position in the read onto a row of the (model_read_length, 2) score
+        # parameters table. When read length matches the model exactly this is just an
+        # identity; otherwise we evenly spread the model distribution across the read.
+        if length == model_read_length:
+            quality_index_map = np.arange(model_read_length, dtype=np.int64)
+        else:
+            quality_index_map = np.maximum(
+                0, (model_read_length * np.arange(length, dtype=np.int64)) // length
+            )
 
-            return np.array(temp_qual_array)
+        # Batched per-position normal draws — one rng.normal call returning `length` draws
+        # with element-wise (mean, scale) parameters. Replaces a ~150-iteration Python
+        # loop calling rng.normal scalar-per-base. Same statistical distribution but the
+        # PRNG stream is consumed in a different order, so seeded outputs are not
+        # bit-identical to the prior scalar-loop implementation.
+        means = self.quality_score_probabilities[quality_index_map, 0]
+        scales = self.quality_score_probabilities[quality_index_map, 1]
+        scores = rng.normal(means, scales)
+        return np.clip(np.rint(scores).astype(int), 1, 42)
 
 
 class MarkovQualityModel:
@@ -182,30 +177,34 @@ def get_sequencing_errors(
         :return: Modified sequence and associated quality scores
         """
 
-        error_indexes = []
         introduced_errors = []
-        # pre-compute the error rate for each quality score. This is the inverse of the phred score equation
-        quality_score_error_rate: dict[int, float] = {x: 10. ** (-x / 10) for x in quality_scores}
 
         # The use case here would be someone running a simulation where they want no sequencing errors.
         # No need to run any loops in this case.
         if self.average_error == 0:
             return introduced_errors
-        else:
-            i = len(quality_scores)
-            while len(error_indexes) < num_errors and i > 0:
-                index = rng.choice(list(range(len(quality_scores))))
-                if rng.random() < quality_score_error_rate[quality_scores[index]]:
-                    error_indexes.append(index)
-                i -= 1
-            # Fallback: if quality scores are too high to naturally reach num_errors,
-            # force errors at positions with at-or-below-median quality scores.
-            # Using <= so that uniform quality arrays (all scores equal) always make progress.
+
+        n = len(quality_scores)
+        # Batched rejection sampling: draw n candidate indices and n uniform deviates in two numpy
+        # calls, then accept the first num_errors candidates where the deviate is below the
+        # quality-derived error rate. Equivalent in distribution to the per-iteration scalar loop
+        # but ~150x cheaper in Python overhead. Statistical caveat: this changes the order in
+        # which the underlying PRNG stream is consumed, so seeded runs are not bit-identical to
+        # the prior interleaved-draw implementation.
+        candidate_indices = rng.integers(n, size=n)
+        candidate_randoms = rng.random(size=n)
+        rates_at_candidates = 10.0 ** (-quality_scores[candidate_indices].astype(float) / 10.0)
+        accepted = candidate_indices[candidate_randoms < rates_at_candidates]
+        error_indexes = accepted[:num_errors].tolist()
+
+        if len(error_indexes) < num_errors:
+            # Fallback: if quality scores are too high to naturally reach num_errors, force errors
+            # at positions with at-or-below-median quality scores. Using <= so that uniform
+            # quality arrays (all scores equal) always make progress.
             median_score = median(quality_scores)
-            while len(error_indexes) < num_errors:
-                index = rng.integers(len(quality_scores))
-                if quality_scores[index] <= median_score:
-                    error_indexes.append(index)
+            eligible = np.flatnonzero(quality_scores <= median_score)
+            needed = num_errors - len(error_indexes)
+            error_indexes.extend(rng.choice(eligible, size=needed, replace=True).tolist())
 
         total_indel_length = 0
         # To prevent deletion collisions