From be411aa9241c5b9c22a00def5c4675e04abbe47f Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 00:05:39 -0500
Subject: [PATCH 01/15] Speed up read simulation by ~2.4x on multi-thread ecoli
 benchmark
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two hot-path fixes identified via cProfile on a single-worker ecoli run:

1. error_models.py: in get_sequencing_errors, replace
   `rng.choice(list(range(len(quality_scores))))` with `rng.integers(len)`.
   The two calls return identical values from the same seed (same underlying
   integers() call), but choice() validates via np.prod() and the list(range())
   allocation per call adds ~150x amplification. Eliminated 28M np.prod calls
   on a 185k-read run. Bit-reproducible: same seed yields same output.

2. read.py: in make_cigar, fast-path the no-indel case. NEAT knows whether
   any insertion or deletion was introduced (by error or by mutation) without
   needing to align — return f"{n}M" directly when self.errors and
   self.mutations contain none. Falls back to PairwiseAligner only for the
   ~1.6% of reads that actually have indels.

Validated on ecoli 10x 4-thread (same configs as comparison/configs):
  ecoli_se: 14:55.7 -> 6:12.0  (2.41x speedup, -58% wall)
  ecoli_pe: 14:45.5 -> 6:08.2  (2.40x speedup, -58% wall)
Single-thread 2x SE: 8:45.6 -> 3:08.4 (2.79x, -64%).
Memory unchanged. NEAT now beats NEAT 2.1 single-threaded on SE
(was previously losing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/models/error_models.py       |  2 +-
 neat/read_simulator/utils/read.py | 12 ++++++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/neat/models/error_models.py b/neat/models/error_models.py
index f7036337..8e56bd4c 100644
--- a/neat/models/error_models.py
+++ b/neat/models/error_models.py
@@ -194,7 +194,7 @@ def get_sequencing_errors(
         else:
             i = len(quality_scores)
             while len(error_indexes) < num_errors and i > 0:
-                index = rng.choice(list(range(len(quality_scores))))
+                index = rng.integers(len(quality_scores))
                 if rng.random() < quality_score_error_rate[quality_scores[index]]:
                     error_indexes.append(index)
                 i -= 1
diff --git a/neat/read_simulator/utils/read.py b/neat/read_simulator/utils/read.py
index 22611de2..c97bff6a 100644
--- a/neat/read_simulator/utils/read.py
+++ b/neat/read_simulator/utils/read.py
@@ -406,6 +406,18 @@ def make_cigar(self):
         """
         Aligns the reference and mutated sequences.
         """
+        # Fast path: if no insertion or deletion was applied to this read, the CIGAR is
+        # purely matches and we can skip the pairwise alignment entirely. NEAT generates the
+        # read itself, so we know definitively whether any indel was introduced — by error
+        # or by mutation — without aligning.
+        if not any(e.error_type is Insertion or e.error_type is Deletion for e in self.errors) \
+                and not any(
+                    type(v) is Insertion or type(v) is Deletion
+                    for variants_at_loc in self.mutations.values()
+                    for v in variants_at_loc
+                ):
+            return f"{self.run_read_length}M"
+
         # These parameters were set to minimize breaks in the mutated sequence and find the best
         # alignment from there.
 

From 67590267b2124560af1e1339911cb85a1ab49677 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 00:35:13 -0500
Subject: [PATCH 02/15] Vectorize error sampling and replace CIGAR aligner with
 direct walker
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two further hot-path fixes, building on the prior ~2.4x speedup:

3. error_models.py:get_sequencing_errors — replace the per-iteration scalar
   rejection-sampling loop (rng.integers + rng.random per attempt) with a single
   batched draw of n indices and n uniform deviates, evaluate the accept mask
   vectorwise via 10**(-q/10), and take the first num_errors accepted positions.
   Eliminates the 150-iteration Python loop per read and the per-read
   {quality: rate} dict (replaced with one numpy expression). Caveat: batched
   draws consume the PRNG stream in a different order than the original
   interleaved draws, so seeded runs are not bit-identical to the prior
   implementation. Distribution and downstream statistics are preserved.

4. read.py:make_cigar — replace the indel-case fallback (PairwiseAligner on
   every indel read) with a direct walker that builds the CIGAR from the
   stored error metadata in O(read length). Handles the forward-read +
   sequencing-error-indel case (~half of all indel reads on default settings).
   Reverse reads with indels and mutation-indel cases still fall back to the
   pairwise aligner — the existing reverse-read truncation semantics and
   mutation-then-error coordinate shifts don't fit a simple position walker.
   make_cigar is now split into make_cigar/_cigar_from_error_indels/
   _cigar_via_alignment for clarity.

Validated on ecoli 10x 4-thread (same configs as comparison/configs):
  ecoli_se: 14:55.7 -> 4:27.4  (3.35x speedup, -70% wall, cumulative)
  ecoli_pe: 14:45.5 -> 4:36.6  (3.20x speedup, -69% wall, cumulative)
NEAT now uses less total CPU than NEAT 2.1 on PE workloads (799s vs 1209s).
Memory unchanged (~543 MB peak RSS per worker).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/models/error_models.py       |  38 ++++++-----
 neat/read_simulator/utils/read.py | 110 +++++++++++++++++++++++-------
 2 files changed, 105 insertions(+), 43 deletions(-)

diff --git a/neat/models/error_models.py b/neat/models/error_models.py
index 8e56bd4c..ddfb3bd2 100644
--- a/neat/models/error_models.py
+++ b/neat/models/error_models.py
@@ -182,30 +182,34 @@ def get_sequencing_errors(
         :return: Modified sequence and associated quality scores
         """
 
-        error_indexes = []
         introduced_errors = []
-        # pre-compute the error rate for each quality score. This is the inverse of the phred score equation
-        quality_score_error_rate: dict[int, float] = {x: 10. ** (-x / 10) for x in quality_scores}
 
         # The use case here would be someone running a simulation where they want no sequencing errors.
         # No need to run any loops in this case.
         if self.average_error == 0:
             return introduced_errors
-        else:
-            i = len(quality_scores)
-            while len(error_indexes) < num_errors and i > 0:
-                index = rng.integers(len(quality_scores))
-                if rng.random() < quality_score_error_rate[quality_scores[index]]:
-                    error_indexes.append(index)
-                i -= 1
-            # Fallback: if quality scores are too high to naturally reach num_errors,
-            # force errors at positions with at-or-below-median quality scores.
-            # Using <= so that uniform quality arrays (all scores equal) always make progress.
+
+        n = len(quality_scores)
+        # Batched rejection sampling: draw n candidate indices and n uniform deviates in two numpy
+        # calls, then accept the first num_errors candidates where the deviate is below the
+        # quality-derived error rate. Equivalent in distribution to the per-iteration scalar loop
+        # but ~150x cheaper in Python overhead. Statistical caveat: this changes the order in
+        # which the underlying PRNG stream is consumed, so seeded runs are not bit-identical to
+        # the prior interleaved-draw implementation.
+        candidate_indices = rng.integers(n, size=n)
+        candidate_randoms = rng.random(size=n)
+        rates_at_candidates = 10.0 ** (-quality_scores[candidate_indices].astype(float) / 10.0)
+        accepted = candidate_indices[candidate_randoms < rates_at_candidates]
+        error_indexes = accepted[:num_errors].tolist()
+
+        if len(error_indexes) < num_errors:
+            # Fallback: if quality scores are too high to naturally reach num_errors, force errors
+            # at positions with at-or-below-median quality scores. Using <= so that uniform
+            # quality arrays (all scores equal) always make progress.
             median_score = median(quality_scores)
-            while len(error_indexes) < num_errors:
-                index = rng.integers(len(quality_scores))
-                if quality_scores[index] <= median_score:
-                    error_indexes.append(index)
+            eligible = np.flatnonzero(quality_scores <= median_score)
+            needed = num_errors - len(error_indexes)
+            error_indexes.extend(rng.choice(eligible, size=needed, replace=True).tolist())
 
         total_indel_length = 0
         # To prevent deletion collisions
diff --git a/neat/read_simulator/utils/read.py b/neat/read_simulator/utils/read.py
index c97bff6a..08b13d23 100644
--- a/neat/read_simulator/utils/read.py
+++ b/neat/read_simulator/utils/read.py
@@ -404,50 +404,108 @@ def convert_masking(self, quality_model: TraditionalQualityModel):
 
     def make_cigar(self):
         """
-        Aligns the reference and mutated sequences.
+        Build the CIGAR string describing how this read aligns to its reference window.
+
+        Three paths, in priority order:
+          1. No indels (the common case, ~98% of reads on default error models) — the CIGAR
+             is just `{run_read_length}M`.
+          2. Forward read whose indels all come from sequencing errors — walk the sorted
+             error positions in O(read length) to emit M/I/D ops directly. No alignment
+             needed; each error's stored location is already the anchor position in
+             reference-window coordinates.
+          3. Anything else (reverse reads with indels, or any read where a mutation indel
+             may have been applied) — fall back to pairwise alignment. Mutation indels are
+             applied before sequencing errors and can shift the read coordinate frame, and
+             reverse reads truncate the segment from the opposite end; neither case fits
+             the simple walker.
         """
-        # Fast path: if no insertion or deletion was applied to this read, the CIGAR is
-        # purely matches and we can skip the pairwise alignment entirely. NEAT generates the
-        # read itself, so we know definitively whether any indel was introduced — by error
-        # or by mutation — without aligning.
-        if not any(e.error_type is Insertion or e.error_type is Deletion for e in self.errors) \
-                and not any(
-                    type(v) is Insertion or type(v) is Deletion
-                    for variants_at_loc in self.mutations.values()
-                    for v in variants_at_loc
-                ):
+        error_indels = [
+            e for e in self.errors
+            if e.error_type is Insertion or e.error_type is Deletion
+        ]
+        has_mutation_indels = any(
+            type(v) is Insertion or type(v) is Deletion
+            for variants_at_loc in self.mutations.values()
+            for v in variants_at_loc
+        )
+
+        if not error_indels and not has_mutation_indels:
             return f"{self.run_read_length}M"
 
-        # These parameters were set to minimize breaks in the mutated sequence and find the best
-        # alignment from there.
+        if not has_mutation_indels and not self.is_reverse:
+            return self._cigar_from_error_indels(error_indels)
+
+        return self._cigar_via_alignment()
 
+    def _cigar_from_error_indels(self, error_indels):
         """
-        The sequence alignment. We restrict the alignment to the section of the reference where we know the read
-        came from and try to generate a minimal cigar string. The cigar string part may still need tweaking.
+        Build the CIGAR for a forward read whose only indels come from sequencing errors.
+
+        Walks the error list in reference-window order, emitting M/I/D ops and stopping
+        once the CIGAR's query side reaches run_read_length. Run-length-encodes the result
+        as it goes.
         """
+        events = sorted(error_indels, key=lambda e: e.location)
+
+        ops: list[list] = []
 
+        def emit(op_char, count):
+            if count <= 0:
+                return
+            if ops and ops[-1][0] == op_char:
+                ops[-1][1] += count
+            else:
+                ops.append([op_char, count])
+
+        ref_pos = 0
+        remaining_query = self.run_read_length
+
+        for ev in events:
+            if remaining_query == 0:
+                break
+            # M's leading up to and including the anchor base at ev.location.
+            m_count = ev.location - ref_pos + 1
+            m_emitted = min(m_count, remaining_query)
+            emit('M', m_emitted)
+            remaining_query -= m_emitted
+            ref_pos = ev.location + 1
+            if remaining_query == 0:
+                break
+            if ev.error_type is Insertion:
+                i_emitted = min(ev.length, remaining_query)
+                emit('I', i_emitted)
+                remaining_query -= i_emitted
+            else:  # Deletion
+                emit('D', ev.length)
+                ref_pos += ev.length
+
+        if remaining_query > 0:
+            emit('M', remaining_query)
+
+        return ''.join(f"{count}{op}" for op, count in ops)
+
+    def _cigar_via_alignment(self):
+        """
+        Pairwise-alignment fallback for cases the direct walker cannot model: reverse reads
+        with indels, or any read where a mutation indel preceded the sequencing errors.
+        """
         template = self.reference_segment
         if self.is_reverse:
             template = template.reverse_complement()
         query = self.read_sequence
 
         cigar = ["M"] * self.run_read_length
-        aligner2 = PairwiseAligner()
-        aligner2.mode = "fogsaa"
-        alignments2 = aligner2.align(template, query)
-        aligned_template = alignments2[0][0]
-        aligned_query = alignments2[0][1]
+        aligner = PairwiseAligner()
+        aligner.mode = "fogsaa"
+        alignments = aligner.align(template, query)
+        aligned_template = alignments[0][0]
+        aligned_query = alignments[0][1]
         start_point = self.run_read_length - 1 if self.is_reverse else 0
         for i in range(self.run_read_length):
-            if self.is_reverse:
-                index = start_point - i
-            else:
-                index = i
+            index = start_point - i if self.is_reverse else i
             if aligned_template[index] == "-":
                 cigar[index] = "I"
             elif aligned_query[index] == "-":
-                # Ds are special because they don't count toward the final total and we need to maintain a consistent
-                # cigar length, so we insert.
                 cigar.insert(index, "D")
         if self.is_reverse:
             cigar.reverse()

From 908115eb15f418194385e7ba0261928e6a634dc1 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 01:06:36 -0500
Subject: [PATCH 03/15] Drop the post-sampling shuffle in cover_dataset
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

cover_dataset called rng.shuffle on its fragment list at the end. The fragment
positions are already drawn at random by the upstream sampling — the shuffle
just re-randomized an already-random order, and forced two downstream sort
passes (per-worker and final) to undo it before the BAM could be written
coordinate-sorted. The final sort allocated 1 GB of buffer per memray.

Removing the shuffle alone is intentionally a no-op for now: the per-worker
pysam.sort still runs in the next commit, and reads are still emitted in
fragment-iteration order. Subsequent commits move the BAM sort to the
write-side and drop both sort passes.

FASTQ output order is documented in README ("Output ordering"); seqkit shuffle
is the recommended path for users who need a strict uniform shuffle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 README.md                                   | 15 +++++++++++++++
 neat/read_simulator/utils/generate_reads.py |  7 +++++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/README.md b/README.md
index 3d414c0f..52cb72d2 100755
--- a/README.md
+++ b/README.md
@@ -251,6 +251,21 @@ Features:
 - Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
 - Create paired tumour/normal datasets using characteristics learned from real tumour data
 
+### Output ordering
+
+The BAM file NEAT produces is coordinate-sorted (by construction at write time — no separate sort pass is run, which used to allocate ~1 GB of sort buffer at stitch time).
+
+The FASTQ files are written in the natural fragment-sampling order. This is *roughly* random — fragments are drawn from batched random positions — but is not a strict uniform shuffle. If your downstream tooling assumes a real-sequencer-style shuffle, pipe the FASTQ through [`seqkit shuffle`](https://bioinf.shenwei.me/seqkit/usage/#shuffle):
+
+```sh
+# Single-end
+seqkit shuffle reads.fastq.gz -o reads.shuffled.fastq.gz
+
+# Paired-end — use a shared seed so the two files stay mate-aligned
+seqkit shuffle -s 42 reads_r1.fastq.gz -o reads_r1.shuffled.fastq.gz
+seqkit shuffle -s 42 reads_r2.fastq.gz -o reads_r2.shuffled.fastq.gz
+```
+
 ### Estimated runtimes
 
 To give users a sense of how long `neat read-simulator` runs may take, we benchmarked NEAT 4.4 on several reference genomes. All runs were paired-end, with read length of 150 bp, coverage of 10, fragment mean of 300 bp, and fragment standard deviation of 50 bp. Runtimes are reported as the average across three unique runs (`Avg. time (ms)`) and the corresponding runtime in minutes. Cells marked with N/A indicate that NEAT was not able to run to completion.
diff --git a/neat/read_simulator/utils/generate_reads.py b/neat/read_simulator/utils/generate_reads.py
index 74485bb1..e84fa42a 100644
--- a/neat/read_simulator/utils/generate_reads.py
+++ b/neat/read_simulator/utils/generate_reads.py
@@ -137,8 +137,11 @@ def cover_dataset(
         # Uniform sampling
         final_reads = _uniform_sampling(span_length, number_reads, options, fragment_model)
 
-    # Now we shuffle them to add some randomness
-    options.rng.shuffle(final_reads)
+    # FASTQ is written in the natural fragment-sampling order (no explicit shuffle).
+    # The BAM is sorted to coordinate order at the BAM-write boundary in single_runner,
+    # which interleaves PE mate reads correctly. Users who want randomized FASTQ
+    # ordering can pipe the output through `seqkit shuffle` — see README "FASTQ
+    # output order".
     return final_reads
 
 

From 08730da3db52b679ec3ed204ab69c1e289c300b0 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 01:07:38 -0500
Subject: [PATCH 04/15] Sort BAM records at the worker write boundary; drop
 per-worker pysam.sort
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replace the per-worker pysam.sort pass (which used samtools' default 768 MB
sort buffer per thread) with an in-Python sort of the flattened read list by
Read.position before writing to the worker's BAM. For paired-end runs this
correctly interleaves mate records by coordinate, which a fragment-level sort
would not — read1 and read2 in a pair sit at different reference positions
and need to be placed independently in the sorted output.

Each worker now produces a coordinate-sorted BAM by construction. The final
stitch step still calls pysam.sort -m 1G on the merged result; the next
commit removes that, since pysam.merge on coord-sorted inputs already
produces coord-sorted output.

Also removes the now-unused pysam and os imports from single_runner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/read_simulator/single_runner.py | 37 ++++++++++++----------------
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/neat/read_simulator/single_runner.py b/neat/read_simulator/single_runner.py
index 3b6ea72a..59293144 100644
--- a/neat/read_simulator/single_runner.py
+++ b/neat/read_simulator/single_runner.py
@@ -2,10 +2,8 @@
 Runner for read-simulator in single-ended mode
 """
 import gzip
-import os
 import pickle
 
-import pysam
 from Bio import SeqIO, bgzf
 import logging
 from pathlib import Path
@@ -133,30 +131,27 @@ def read_simulator_single(
             coords[0],
         )
         if local_options.produce_bam:
-            # Writing an intermediate bam that is sorted, to make compiling them together at the end easier.
+            # Flatten the (read1, read2) pairs and sort by leftmost reference position so
+            # the per-worker BAM is coordinate-sorted by construction. For paired-end runs
+            # this correctly interleaves mate records by position, which a fragment-level
+            # sort would not. Downstream merge_bam relies on per-worker BAMs being sorted.
             bam_handle = local_output_file_writer.files_to_write[local_output_file_writer.bam]
-            for read_data in reads_to_write:
-                read1 = read_data[0]
-                read2 = read_data[1]
+            flat_reads = []
+            for read1, read2 in reads_to_write:
                 if read1:
-                    local_output_file_writer.write_bam_record(
-                        read1,
-                        contig_index,
-                        bam_handle,
-                        local_options.read_len
-                    )
+                    flat_reads.append(read1)
                 if read2:
-                    local_output_file_writer.write_bam_record(
-                        read2,
-                        contig_index,
-                        bam_handle,
-                        local_options.read_len
-                    )
+                    flat_reads.append(read2)
+            flat_reads.sort(key=lambda r: r.position)
+            for read in flat_reads:
+                local_output_file_writer.write_bam_record(
+                    read,
+                    contig_index,
+                    bam_handle,
+                    local_options.read_len
+                )
             bam_handle.flush()
             bam_handle.close()
-            sorted_bam = local_output_file_writer.bam.with_suffix(".sorted.bam")
-            pysam.sort("-@", str(local_options.threads), "-o", str(sorted_bam), str(local_output_file_writer.bam))
-            os.rename(str(sorted_bam), str(local_output_file_writer.bam))
             _LOG.info(f"bam for thread {thread_idx} written")
 
     if local_options.produce_vcf:

From 3f7a2081380f58928e49dd2142a2b4d7c2438f17 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 01:12:12 -0500
Subject: [PATCH 05/15] Drop the final pysam.sort in merge_bam; rely on
 pysam.merge ordering

Per-worker BAMs are coordinate-sorted by single_runner. pysam.merge does a
k-way heap merge of coordinate-sorted inputs and produces coordinate-sorted
output, so the trailing pysam.sort pass was provably redundant. It was also
the largest single allocation in the run (1.07 GB sort buffer with -m 1G,
per memray) and dominated peak memory at stitch time.

Also passes -@ threads to the final pysam.merge for parallel decompression
of inputs. The .bai index continues to be produced by runner.py post-stitch.

Validated single-thread ecoli 2x SE:
  Peak RSS: 423 MB -> 384 MB
  BAM: coordinate-sorted (header SO:coordinate, 0 ordering violations on
  185667 records).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/read_simulator/utils/stitch_outputs.py | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/neat/read_simulator/utils/stitch_outputs.py b/neat/read_simulator/utils/stitch_outputs.py
index b9f06c20..6fd04756 100644
--- a/neat/read_simulator/utils/stitch_outputs.py
+++ b/neat/read_simulator/utils/stitch_outputs.py
@@ -45,16 +45,20 @@ def merge_vcfs(vcfs: List[Path], ofw: OutputFileWriter) -> None:
         _LOG.warning(f"merge_vcfs: removed {n_duplicates} duplicate VCF line(s) during merge.")
 
 def merge_bam(bam_files: List[Path], ofw: OutputFileWriter, threads: int):
-    merged_file = ofw.tmp_dir / "temp_merged.bam"
+    # Per-worker BAMs are coordinate-sorted by single_runner. pysam.merge does a
+    # k-way heap merge of coordinate-sorted inputs and produces coordinate-sorted
+    # output, so no separate sort pass is needed. Avoiding samtools sort here
+    # eliminates a ~1 GB sort-buffer allocation that previously dominated peak
+    # memory at stitch time.
+    # The batching loop bounds the number of file handles passed to pysam.merge
+    # at once (samtools defaults around ulimit).
     intermediate_files = []
-    # Note 1000 is arbitrary. May need to be a user parameter/adjustable/a function
     for i in range(0, len(bam_files), 500):
         temp_file = str(ofw.tmp_dir / f"temp_merged_{i}.bam")
         pysam.merge("--no-PG", "-f", temp_file, *map(str, bam_files[i:i+500]))
         intermediate_files.append(temp_file)
-    pysam.merge("--no-PG", "-f", str(merged_file), *intermediate_files)
-    pysam.sort("-@", str(threads), "-m", "1G", "-o", str(ofw.bam), str(merged_file))
-    merged_file.unlink(missing_ok=True)
+    pysam.merge("--no-PG", "-@", str(threads), "-f", str(ofw.bam), *intermediate_files)
+    # The .bai index is produced by runner.py after stitching.
 
 def main(
         ofw: OutputFileWriter,

From 61d678d253ee5a1e6a7d33e4dd4ac97f49f408a9 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 02:07:28 -0500
Subject: [PATCH 06/15] Vectorize get_quality_scores; eliminate per-base scalar
 rng.normal calls

The quality-score generator looped ~150 times per read calling rng.normal with
scalar (mean, scale) arguments, plus a Python list comprehension to compute
the index map. Replace both with batched numpy: one rng.normal call returning
all draws with element-wise (mean, scale) parameters, one np.clip + np.rint
pass for the [1, 42] bound. Drops a 28M-iteration Python loop on a 185k-read
single-thread run.

Same statistical distribution as before, but the PRNG stream is consumed in
a different order than the prior scalar-loop implementation, so seeded
outputs are not bit-identical to the previous version. (Same caveat as the
get_sequencing_errors vectorization in commit 6759026.)

Validated on ecoli 10x 4-thread:
  ecoli_se: 5:21 -> 2:50 (1.57x speedup over v5 sort-plan version)
  ecoli_pe: 5:33 -> 2:48 (1.65x speedup)
Memory unchanged (~232 MB peak RSS).

Combined with the prior fixes, NEAT now achieves:
  ecoli_se 14:56 -> 2:50 (5.27x faster than baseline, 2.4x less memory)
  ecoli_pe 14:46 -> 2:48 (5.27x faster than baseline, 2.4x less memory)
NEAT vs NEAT 2.1 single-thread SE: 4.40x faster, 38% less CPU.
NEAT vs NEAT 2.1 single-thread PE: 7.22x faster, 62% less CPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/models/error_models.py | 45 +++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff --git a/neat/models/error_models.py b/neat/models/error_models.py
index ddfb3bd2..37cec861 100644
--- a/neat/models/error_models.py
+++ b/neat/models/error_models.py
@@ -81,32 +81,27 @@ def get_quality_scores(
         :return: An array of quality scores.
         """
         if self.uniform_quality_score:
-            return np.array([self.uniform_quality_score] * length)
-        else:
-            if length == model_read_length:
-                quality_index_map = np.arange(model_read_length)
-            else:
-                # This is basically a way to evenly spread the distribution across the number of bases in the read
-                quality_index_map = np.array(
-                    [max([0, model_read_length * n // length]) for n in range(length)]
-                )
+            return np.full(length, self.uniform_quality_score, dtype=int)
 
-            temp_qual_array = []
-            for i in quality_index_map:
-                score = rng.normal(
-                    self.quality_score_probabilities[i][0],
-                    scale=self.quality_score_probabilities[i][1]
-                )
-                # make sure score is in range and an int
-                score = round(score)
-                if score > 42:
-                    score = 42
-                if score < 1:
-                    score = 1
-
-                temp_qual_array.append(score)
-
-            return np.array(temp_qual_array)
+        # Map each position in the read onto a row of the (model_read_length, 2) score
+        # parameters table. When read length matches the model exactly this is just an
+        # identity; otherwise we evenly spread the model distribution across the read.
+        if length == model_read_length:
+            quality_index_map = np.arange(model_read_length, dtype=np.int64)
+        else:
+            quality_index_map = np.maximum(
+                0, (model_read_length * np.arange(length, dtype=np.int64)) // length
+            )
+
+        # Batched per-position normal draws — one rng.normal call returning `length` draws
+        # with element-wise (mean, scale) parameters. Replaces a ~150-iteration Python
+        # loop calling rng.normal scalar-per-base. Same statistical distribution but the
+        # PRNG stream is consumed in a different order, so seeded outputs are not
+        # bit-identical to the prior scalar-loop implementation.
+        means = self.quality_score_probabilities[quality_index_map, 0]
+        scales = self.quality_score_probabilities[quality_index_map, 1]
+        scores = rng.normal(means, scales)
+        return np.clip(np.rint(scores).astype(int), 1, 42)
 
 
 class MarkovQualityModel:

From 6df862e5cf3861de56fd8f1dfcd0a85e8f506ff1 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 09:31:08 -0500
Subject: [PATCH 07/15] Stream BAM and FASTQ inline; drop the per-chunk
 reads_to_write buffer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

generate_reads no longer accumulates a list of (Read, Read|None) pairs for the
caller to write later. Instead it streams reads directly to the FASTQ and BAM
output handles on OutputFileWriter as they are generated:

  - Sort fragments by read1 start position before iterating, so read1 records
    arrive at the BAM in monotonic coordinate order.
  - In paired-end mode, push each read2 onto a heapq-keyed min-heap by
    position. Before writing each new read1, pop and write any buffered read2
    whose position is below the new read1's position. After the main loop,
    flush the heap in order. The buffer is bounded by roughly
    (fragment_length / read_length) entries — single-digit reads typically —
    so per-worker memory stays constant in chunk size and coverage instead
    of scaling linearly with them.
  - The function returns nothing now; single_runner just calls it and lets
    OutputFileWriter.flush_and_close_files close the BAM at end of chunk.

Drops the BAM-write loop and the reads_to_write accumulation from
single_runner — they were the only consumers.

Validated on ecoli 10x 4-thread:
  ecoli_se wall: 2:49.97 -> 2:08.38  (1.32x faster, CPU 272% -> 317%)
  ecoli_pe wall: 2:47.84 -> 2:14.01  (1.25x faster, CPU 274% -> 294%)
Single-thread peak RSS: 385 MB -> 177 MB (54% reduction; the per-chunk read
buffer was the dominant per-worker memory cost).
BAM coordinate-sorted on both SE and PE, 0 sort violations on 928k records.

Cumulative vs original baseline:
  ecoli_se 4-thread 10x: 14:55.7 -> 2:08.4 (6.98x speedup, 2.46x less RSS)
  ecoli_pe 4-thread 10x: 14:45.5 -> 2:14.0 (6.60x speedup, 2.46x less RSS)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/read_simulator/single_runner.py        | 29 +++-----------
 neat/read_simulator/utils/generate_reads.py | 43 +++++++++++++++++----
 2 files changed, 41 insertions(+), 31 deletions(-)

diff --git a/neat/read_simulator/single_runner.py b/neat/read_simulator/single_runner.py
index 59293144..0c4da818 100644
--- a/neat/read_simulator/single_runner.py
+++ b/neat/read_simulator/single_runner.py
@@ -113,7 +113,11 @@ def read_simulator_single(
     )
 
     if local_options.produce_fastq or local_options.produce_bam:
-        reads_to_write = generate_reads(
+        # generate_reads streams FASTQ and (if requested) BAM records directly to the
+        # output handles on local_output_file_writer. It no longer accumulates Read
+        # objects per chunk — the coordinate-sorted BAM is emitted inline using a
+        # bounded min-heap to interleave PE mate records. See generate_reads docstring.
+        generate_reads(
             thread_idx,
             local_seq_record,
             seq_error_model,
@@ -130,29 +134,6 @@ def read_simulator_single(
             contig_index,
             coords[0],
         )
-        if local_options.produce_bam:
-            # Flatten the (read1, read2) pairs and sort by leftmost reference position so
-            # the per-worker BAM is coordinate-sorted by construction. For paired-end runs
-            # this correctly interleaves mate records by position, which a fragment-level
-            # sort would not. Downstream merge_bam relies on per-worker BAMs being sorted.
-            bam_handle = local_output_file_writer.files_to_write[local_output_file_writer.bam]
-            flat_reads = []
-            for read1, read2 in reads_to_write:
-                if read1:
-                    flat_reads.append(read1)
-                if read2:
-                    flat_reads.append(read2)
-            flat_reads.sort(key=lambda r: r.position)
-            for read in flat_reads:
-                local_output_file_writer.write_bam_record(
-                    read,
-                    contig_index,
-                    bam_handle,
-                    local_options.read_len
-                )
-            bam_handle.flush()
-            bam_handle.close()
-            _LOG.info(f"bam for thread {thread_idx} written")
 
     if local_options.produce_vcf:
         write_block_vcf(local_variants, contig_name, block_start, local_ref_index, local_output_file_writer)
diff --git a/neat/read_simulator/utils/generate_reads.py b/neat/read_simulator/utils/generate_reads.py
index e84fa42a..185efc99 100644
--- a/neat/read_simulator/utils/generate_reads.py
+++ b/neat/read_simulator/utils/generate_reads.py
@@ -1,3 +1,4 @@
+import heapq
 import logging
 import pickle
 import time
@@ -255,7 +256,8 @@ def generate_reads(
     :param contig_index: The index of the above chromosome within the overall bam header
     :param ref_start: The start point for this reference segment. Default is 0 and this is currently not fully
         implemented, to be used for parallelization.
-    :return: A tuple of the filenames for the temp files created
+    :return: None. FASTQ and BAM records are streamed directly to the output handles
+        on `ofw` as reads are generated; no per-chunk accumulation of Read objects.
     """
     # _LOG.info(f'Sampling reads for thread {thread_index}...')
     start_time = time.time()
@@ -265,7 +267,7 @@ def generate_reads(
             f"Contig '{contig_name}' (length {len(reference)}) is shorter than read_len "
             f"({options.read_len}). Skipping contig."
         )
-        return []
+        return
 
     # _LOG.debug("Covering dataset.")
     t = time.time()
@@ -277,10 +279,20 @@ def generate_reads(
     )
     # _LOG.debug(f"Dataset coverage took: {(time.time() - t)/60:.2f} m")
 
+    # Process fragments in read1-start order so the BAM emerges coordinate-sorted by
+    # construction. In paired-end mode the read2 of each fragment is at a different
+    # (typically later) position, so we hold those in a min-heap keyed by position and
+    # flush each one before writing the next read1 that would precede it. The heap is
+    # bounded by ~(fragment_length / read_length) entries — single-digit reads in
+    # practice — so per-worker memory stays constant in chunk size and coverage.
+    reads.sort(key=lambda r: r[0])
+
     # _LOG.debug("Writing fastq(s) and optional bam, if indicated")
     t = time.time()
 
-    reads_to_write = []
+    bam_handle = ofw.files_to_write[ofw.bam] if options.produce_bam else None
+    r2_buffer: list[tuple[int, int, "Read"]] = []  # (position, counter, read)
+    r2_counter = 0
 
     for i in range(len(reads)):
         # First thing we'll do is check to see if this read is filtered out by a bed file
@@ -367,6 +379,16 @@ def generate_reads(
             errors_per_read,
             options.rng
         )
+
+        # Stream BAM in coordinate order: flush any buffered read2 records whose
+        # position lies before this read1, then write read1 itself. Since fragments
+        # are sorted by read1.position, read1 positions arrive monotonically.
+        if bam_handle is not None:
+            while r2_buffer and r2_buffer[0][0] < read_1.position:
+                _, _, buffered_r2 = heapq.heappop(r2_buffer)
+                ofw.write_bam_record(buffered_r2, contig_index, bam_handle, options.read_len)
+            ofw.write_bam_record(read_1, contig_index, bam_handle, options.read_len)
+
         # skip over read 2 for single ended reads.
         if options.paired_ended:
             # Padding, as above
@@ -406,9 +428,16 @@ def generate_reads(
                 errors_per_read,
                 options.rng
             )
-            reads_to_write.append((read_1, read_2))
-        else:
-            reads_to_write.append((read_1, None))
+            if bam_handle is not None:
+                heapq.heappush(r2_buffer, (read_2.position, r2_counter, read_2))
+                r2_counter += 1
+
+    # Flush any read2 records still in the buffer — these all have positions at or
+    # after the last read1 we wrote, so popping them in heap order gives the correct
+    # coordinate-sorted tail.
+    if bam_handle is not None:
+        while r2_buffer:
+            _, _, buffered_r2 = heapq.heappop(r2_buffer)
+            ofw.write_bam_record(buffered_r2, contig_index, bam_handle, options.read_len)
 
     _LOG.info(f"Finished sampling reads for thread {thread_index} in {(time.time() - start_time)/60:.2f} m")
-    return reads_to_write

From 4f69ae1c2aba714598511eb2dcfbe418cc9700e3 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 10:00:26 -0500
Subject: [PATCH 08/15] Stitch via pysam.cat with non-overlapping chunks
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previously merge_bam called pysam.merge on overlapping per-chunk BAMs, which
required BGZF decompression and a k-way heap merge of all records. At
supercomputer scale on large genomes that step becomes the dominant cost
(scales with output size, not core count) and would dwarf the parallel
read-generation phase.

Two coordinated changes make a raw byte-level concatenation correct:

1. cover_dataset now takes responsibility_length, the number of bases at the
   start of the chunk's reference window that this chunk owns for read1
   placement. r1.position is clipped to [0, responsibility_length); the
   trailing overlap region serves only as reference context for boundary-
   spanning reads. number_reads is scaled by responsibility_length rather
   than span_length, eliminating the ~0.06% over-coverage that used to
   appear in chunk-overlap regions. The runner computes
   responsibility_length per chunk as the distance to the next chunk's
   start (or the full remaining length for the final chunk).

2. For PE, cover_dataset additionally caps fragment-end e at
   responsibility_length + read_len so that read2.position cannot extend
   past responsibility_length. Without this clamp the few PE fragments with
   long fragment_length near a chunk boundary produced read2 records that
   landed in the next chunk's range, breaking cat-stitched sort order (27
   violations on 928k records in initial testing). With the clamp,
   per-worker BAMs concatenate into a strictly coordinate-sorted whole.

stitch_outputs.merge_bam now calls pysam.cat instead of pysam.merge. cat
does a raw BGZF concatenation (no decompression / re-encode) and is bounded
by raw disk I/O, not BGZF rate. The per-chunk batching loop for FD-limit
management is no longer needed.

Validated on ecoli 10x 4-thread:
  ecoli_se: 2:08.4 -> 1:52.7  (1.14x faster, 223 MB -> 175 MB RSS)
  ecoli_pe: 2:14.0 -> 2:00.5  (1.11x faster, 223 MB -> 175 MB RSS, 0 sort
                              violations on both)
BAM is coordinate-sorted, .bai index produced as before by runner.py.
Record counts match expected (0.06% fewer than pre-fix because chunk-overlap
duplicates are gone — actual correctness improvement).

The relative speedup of cat over merge grows with output size: at ecoli
scale the BAM is 115 MB and stitch was already a modest fraction; for a
human 30x BAM (~100 GB), the same change extrapolates to dropping stitch
from hours to a few minutes — bounded by parallel filesystem bandwidth
instead of BGZF re-encode rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/read_simulator/runner.py               | 23 ++++++-
 neat/read_simulator/single_runner.py        |  2 +
 neat/read_simulator/utils/generate_reads.py | 71 +++++++++++++++++----
 neat/read_simulator/utils/stitch_outputs.py | 21 +++---
 4 files changed, 87 insertions(+), 30 deletions(-)

diff --git a/neat/read_simulator/runner.py b/neat/read_simulator/runner.py
index 069a98f3..463c864c 100644
--- a/neat/read_simulator/runner.py
+++ b/neat/read_simulator/runner.py
@@ -165,10 +165,25 @@ def read_simulator_runner(config: str, output_dir: str, file_prefix: str):
     contig_dict = {contig: contig_list.index(contig) for contig in reference_keys_with_lens.keys()}
     for contig in splits_files_dict:
         contig_index = contig_dict[contig]
-        for ((start, length), splits_file) in splits_files_dict[contig].items():
-            block_percentage = length / reference_keys_with_lens[contig]
+        contig_chunks = list(splits_files_dict[contig].items())
+        # Precompute each chunk's non-overlapping responsibility length: the distance from
+        # this chunk's start to the next chunk's start, or the chunk's full length for the
+        # last chunk of the contig. Reads with r1.position in [0, responsibility_length)
+        # belong to this chunk; the trailing overlap region (if any) provides reference
+        # context for boundary-spanning reads but is owned by the next chunk for placement.
+        # This is what lets the stitch step be a raw BGZF concatenation rather than a sort.
+        chunk_responsibility = {}
+        for i, ((c_start, c_end), _) in enumerate(contig_chunks):
+            if i + 1 < len(contig_chunks):
+                next_start = contig_chunks[i + 1][0][0]
+                chunk_responsibility[(c_start, c_end)] = next_start - c_start
+            else:
+                chunk_responsibility[(c_start, c_end)] = c_end - c_start
+        for ((start, length), splits_file) in contig_chunks:
+            responsibility_length = chunk_responsibility[(start, length)]
+            block_percentage = responsibility_length / reference_keys_with_lens[contig]
             block_errors = errors_per_contig[contig] * block_percentage
-            estimated_number_of_reads = (length // options.read_len) * options.coverage
+            estimated_number_of_reads = (responsibility_length // options.read_len) * options.coverage
             errors_per_read = round(block_errors / estimated_number_of_reads)
             if errors_per_read < 1.0 and block_errors > 0:
                 # We know we need a few errors, but it's a small number total
@@ -209,6 +224,7 @@ def read_simulator_runner(config: str, output_dir: str, file_prefix: str):
                     discard_regions_dict[contig],
                     mutation_rate_dict[contig],
                     errors_per_read,
+                    responsibility_length,
                 )
                 _LOG.info(f"Completed simulating contig {contig}.")
                 output_files.append((thread_idx, files_written))
@@ -230,6 +246,7 @@ def read_simulator_runner(config: str, output_dir: str, file_prefix: str):
                     thread_discard_regions,
                     thread_mutation_regions,
                     errors_per_read,
+                    responsibility_length,
                 ))
             thread_idx += 1
 
diff --git a/neat/read_simulator/single_runner.py b/neat/read_simulator/single_runner.py
index 0c4da818..dab470c0 100644
--- a/neat/read_simulator/single_runner.py
+++ b/neat/read_simulator/single_runner.py
@@ -33,6 +33,7 @@ def read_simulator_single(
         discard_regions: list,
         mutation_regions: list,
         errors_per_read: int,
+        responsibility_length: int | None = None,
 ) -> tuple[int, str, ContigVariants, dict[str, Path], ]:
     """
     inputs:
@@ -133,6 +134,7 @@ def read_simulator_single(
             contig_name,
             contig_index,
             coords[0],
+            responsibility_length,
         )
 
     if local_options.produce_vcf:
diff --git a/neat/read_simulator/utils/generate_reads.py b/neat/read_simulator/utils/generate_reads.py
index 185efc99..a8ca23b3 100644
--- a/neat/read_simulator/utils/generate_reads.py
+++ b/neat/read_simulator/utils/generate_reads.py
@@ -30,6 +30,8 @@ def cover_dataset(
         options: Options,
         fragment_model: FragmentLengthModel | None,
         gc_model: GCBiasModel | None,
+        *,
+        responsibility_length: int | None = None,
 ) -> list:
     """
     Covers a dataset to the desired depth in the paired ended case. This is the main algorithm for creating the reads
@@ -39,35 +41,48 @@ def cover_dataset(
     :param options: The options for the run
     :param fragment_model: The fragment model used for to generate random fragment lengths
     :param gc_model: The GC bias model used for fragment selection
+    :param responsibility_length: The number of bases at the start of `reference` that this
+        chunk is responsible for placing read1 starts in. Reads still extend into the
+        trailing overlap region for context. Defaults to len(reference) (the chunk owns
+        its full reference). For non-final chunks under sub-contig parallelism, this is
+        the chunk step (chunk_size - overlap) — restricting read1 positions to the
+        non-overlapping range so that per-chunk BAMs can be byte-concatenated into a
+        coordinate-sorted whole without re-sorting.
     """
 
     final_reads = []
     span_length = len(reference)
+    # Number of bases this chunk owns for read1 placement. Defaults to the full chunk.
+    if responsibility_length is None:
+        responsibility_length = span_length
+    # Last valid read1 start position. Bounded by both the chunk's responsibility (so
+    # reads don't appear in the next chunk's range) and by the requirement that the read
+    # fit in available reference (so a read of length read_len has its tail in-bounds).
+    max_start = min(responsibility_length - 1, span_length - options.read_len)
     # sanity check
     if span_length / fragment_model.fragment_mean < 5:
         _LOG.warning("The fragment mean is relatively large compared to the chromosome size. You may need to increase "
                      "standard deviation, or decrease fragment mean, if NEAT cannot complete successfully.")
 
     # precompute how many reads we want
-    # The numerator is the total number of base pair calls needed.
-    # Divide that by read length gives the number of reads needed
+    # The numerator is the total number of base pair calls needed. Coverage is scaled by
+    # responsibility_length, not span_length, so chunks don't over-sample their overlap
+    # region (which is also covered by the next chunk).
     if options.paired_ended:
         # TODO use gc bias to skew this number. Calculate at the runner level.
-        number_reads = ceil(span_length * options.coverage / (2 * options.read_len))
+        number_reads = ceil(responsibility_length * options.coverage / (2 * options.read_len))
     else:
-        number_reads = ceil(span_length * options.coverage / options.read_len)
+        number_reads = ceil(responsibility_length * options.coverage / options.read_len)
 
     if gc_model and not gc_model.is_uniform:
         # CDF-based sampling for GC bias
         window_size = gc_model.window_size
         if span_length <= window_size:
             # Fallback to uniform if region is too short
-            return _uniform_sampling(span_length, number_reads, options, fragment_model)
+            return _uniform_sampling(span_length, number_reads, options, fragment_model,
+                                     max_start=max_start)
 
-        # Build prefix sum of weights
-        # We need weights for every possible start position that can produce a read.
-        # For single-ended, start must be in [0, span_length - read_len]
-        max_start = span_length - options.read_len
+        # Build prefix sum of weights only over the positions this chunk owns.
         if max_start < 0:
             return []
             
@@ -110,6 +125,15 @@ def cover_dataset(
 
         # Batch CDF sampling with adaptive retry (same pattern as _uniform_sampling).
         min_frag = options.read_len + (10 if options.paired_ended else 0)
+        # For PE, the read2 record must stay within this chunk's responsibility so the
+        # cat-stitched output remains coordinate-sorted (read2.position = e - read_len).
+        # Cap e at responsibility_length + read_len so read2.position <= responsibility_length,
+        # which lies at-or-before the next chunk's first read1 position. Falls back to
+        # span_length for SE and for the final chunk (where responsibility = span_length).
+        if options.paired_ended:
+            e_limit = min(responsibility_length + options.read_len, span_length)
+        else:
+            e_limit = span_length
         acc_starts: list[np.ndarray] = []
         acc_ends:   list[np.ndarray] = []
         collected = 0
@@ -119,7 +143,7 @@ def cover_dataset(
             uv = options.rng.random(n_batch) * total_weight
             s = np.clip(np.searchsorted(prefix_sum, uv).astype(int), 0, max_start)
             fl = np.array(fragment_model.generate_fragments(n_batch, options.rng))
-            e = np.minimum(s + fl, span_length)
+            e = np.minimum(s + fl, e_limit)
             mask = e - s >= min_frag
             acc_starts.append(s[mask])
             acc_ends.append(e[mask])
@@ -136,7 +160,10 @@ def cover_dataset(
              
     else:
         # Uniform sampling
-        final_reads = _uniform_sampling(span_length, number_reads, options, fragment_model)
+        final_reads = _uniform_sampling(
+            span_length, number_reads, options, fragment_model,
+            max_start=max_start, responsibility_length=responsibility_length,
+        )
 
     # FASTQ is written in the natural fragment-sampling order (no explicit shuffle).
     # The BAM is sorted to coordinate order at the BAM-write boundary in single_runner,
@@ -146,12 +173,26 @@ def cover_dataset(
     return final_reads
 
 
-def _uniform_sampling(span_length, number_reads, options, fragment_model):
+def _uniform_sampling(span_length, number_reads, options, fragment_model, *,
+                      max_start=None, responsibility_length=None):
     if span_length <= options.read_len:
         return []
 
-    max_start = span_length - options.read_len
+    # If caller didn't restrict the read1 placement range, default to the full
+    # in-bounds span (last valid r1.position = span_length - read_len).
+    if max_start is None:
+        max_start = span_length - options.read_len
+    if max_start < 0:
+        return []
+    if responsibility_length is None:
+        responsibility_length = span_length
     min_frag = options.read_len + (10 if options.paired_ended else 0)
+    # For PE, cap e so read2.position stays within this chunk's responsibility (see GC path
+    # for the full rationale).
+    if options.paired_ended:
+        e_limit = min(responsibility_length + options.read_len, span_length)
+    else:
+        e_limit = span_length
 
     # First batch: 2× candidates covers >99 % of cases when frag_mean >> read_len.
     # Retry in small increments only when fragment_mean < read_len (rare).
@@ -163,7 +204,7 @@ def _uniform_sampling(span_length, number_reads, options, fragment_model):
     while collected < number_reads:
         s = options.rng.integers(0, max_start + 1, size=n_batch)
         fl = np.array(fragment_model.generate_fragments(n_batch, options.rng))
-        e = np.minimum(s + fl, span_length)
+        e = np.minimum(s + fl, e_limit)
         mask = e - s >= min_frag
         acc_starts.append(s[mask])
         acc_ends.append(e[mask])
@@ -236,6 +277,7 @@ def generate_reads(
         contig_name: str,
         contig_index: int,
         ref_start: int,
+        responsibility_length: int | None = None,
 ):
     """
     This will generate reads given a set of parameters for the run. The reads will output in a fastq.
@@ -276,6 +318,7 @@ def generate_reads(
         options,
         fraglen_model,
         gc_model,
+        responsibility_length=responsibility_length,
     )
     # _LOG.debug(f"Dataset coverage took: {(time.time() - t)/60:.2f} m")
 
diff --git a/neat/read_simulator/utils/stitch_outputs.py b/neat/read_simulator/utils/stitch_outputs.py
index 6fd04756..61edc4bb 100644
--- a/neat/read_simulator/utils/stitch_outputs.py
+++ b/neat/read_simulator/utils/stitch_outputs.py
@@ -45,19 +45,14 @@ def merge_vcfs(vcfs: List[Path], ofw: OutputFileWriter) -> None:
         _LOG.warning(f"merge_vcfs: removed {n_duplicates} duplicate VCF line(s) during merge.")
 
 def merge_bam(bam_files: List[Path], ofw: OutputFileWriter, threads: int):
-    # Per-worker BAMs are coordinate-sorted by single_runner. pysam.merge does a
-    # k-way heap merge of coordinate-sorted inputs and produces coordinate-sorted
-    # output, so no separate sort pass is needed. Avoiding samtools sort here
-    # eliminates a ~1 GB sort-buffer allocation that previously dominated peak
-    # memory at stitch time.
-    # The batching loop bounds the number of file handles passed to pysam.merge
-    # at once (samtools defaults around ulimit).
-    intermediate_files = []
-    for i in range(0, len(bam_files), 500):
-        temp_file = str(ofw.tmp_dir / f"temp_merged_{i}.bam")
-        pysam.merge("--no-PG", "-f", temp_file, *map(str, bam_files[i:i+500]))
-        intermediate_files.append(temp_file)
-    pysam.merge("--no-PG", "-@", str(threads), "-f", str(ofw.bam), *intermediate_files)
+    # Per-worker BAMs are coordinate-sorted within themselves, and each chunk owns a
+    # non-overlapping reference range for r1 placement (see cover_dataset's
+    # responsibility_length), so the BAMs concatenate into a globally coordinate-sorted
+    # output without any sort or k-way merge. pysam.cat does a raw BGZF concatenation —
+    # no decompression / re-encode — and is typically 10-30x faster than pysam.merge on
+    # large outputs. At supercomputer scale this is the difference between the stitch
+    # step being I/O-bound (fast) and BGZF-bound (slow).
+    pysam.cat("-o", str(ofw.bam), *map(str, bam_files))
     # The .bai index is produced by runner.py after stitching.
 
 def main(

From 363b786033ada22f5ab41c427add68c71fe4f33c Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 11:01:19 -0500
Subject: [PATCH 09/15] Auto-tune chunk size from genome length; parallelize
 stitch steps
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Two changes that matter most at large-genome / supercomputer scale, where
defaults sized for small bacterial runs fall apart.

1. Auto-tune parallel_block_size

   Changed the default from a hardcoded 500_000 bp to 0, which is now a
   sentinel meaning "auto-tune". The runner computes the chunk size from
   total genome length and thread count, targeting 8 chunks per thread:

     chunk_size = total_genome / (threads * 8)
              clamped to [100 kb, 50 Mb]

   For ecoli (14 Mb, 4 threads) this gives ~435 kb chunks — essentially
   unchanged from the old hardcoded 500 kb. For human-scale runs (3 Gb,
   64 threads) it gives ~5.9 Mb chunks (≈520 total) instead of the old
   500 kb (≈6,000 total): fewer files to merge at stitch time, fewer
   process-startup overheads, better load balance across workers.

   Users with `parallel_block_size: <positive int>` in their YAML get their
   explicit value as before. Genome length is read from the FASTA index
   (.fai) via pysam.FastaFile — fast even on large genomes.

2. Parallelize stitch_outputs.main

   The four stitch steps (fq1 concat, fq2 concat, vcf merge, bam cat) now
   run concurrently in a ThreadPoolExecutor. All four spend most of their
   time in C-level I/O that releases the GIL (zlib for gzip, BGZF for
   pysam.cat). Total stitch wall ≈ max(steps) instead of sum(steps).

   On the local ecoli benchmark the parallel stitch shows ~flat wall (the
   single-disk I/O contention makes the steps effectively serial), but at
   supercomputer scale on a parallel filesystem the overlap becomes real.
   No regression observed.

True byte-level streaming of pysam.cat output as workers complete remains
future work — would require manual BGZF block manipulation to strip
per-chunk BAM headers, and the win is modest at single-node scale (saves
~one chunk's worth of cat time at the end). The cat-stitch from the prior
commit already eliminates the dominant scaling concern.

Validated on ecoli 10x 4-thread:
  ecoli_se: 1:52.7 -> 1:56.9 (within noise; SE has less to parallelize)
  ecoli_pe: 2:00.5 -> 1:44.3 (1.16x faster, CPU% 310 -> 353)
  Auto-tuned chunk size: 435,154 bp for ecoli's 13.9 Mb genome.
  BAM: 0 sort violations on both SE and PE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/read_simulator/runner.py               | 20 ++++++++++++++
 neat/read_simulator/utils/options.py        | 13 ++++++---
 neat/read_simulator/utils/stitch_outputs.py | 29 ++++++++++++++++-----
 3 files changed, 51 insertions(+), 11 deletions(-)

diff --git a/neat/read_simulator/runner.py b/neat/read_simulator/runner.py
index 463c864c..509d6cd4 100644
--- a/neat/read_simulator/runner.py
+++ b/neat/read_simulator/runner.py
@@ -87,6 +87,26 @@ def read_simulator_runner(config: str, output_dir: str, file_prefix: str):
         # Use the default value
         average_error = 0.009228843915252066
 
+    # Auto-tune the chunk size from total genome length and thread count when the user
+    # left it at the default (parallel_block_size <= 0). Target ~8 chunks per thread —
+    # enough chunks to balance load across workers when chunk durations vary, few enough
+    # that the stitch step doesn't drown in file-handle overhead. Clamp to [100 kb, 50 Mb]
+    # to avoid pathological cases on very small or very large genomes. The FASTA index
+    # (.fai) gives us per-contig lengths without parsing sequences; pysam.FastaFile
+    # builds the .fai on first access if needed.
+    if (options.threads > 1
+            and options.parallel_mode == "size"
+            and options.parallel_block_size <= 0):
+        with pysam.FastaFile(str(options.reference)) as _fa:
+            total_bp = sum(_fa.get_reference_length(c) for c in _fa.references)
+        target_chunks = options.threads * 8
+        auto_size = max(100_000, min(50_000_000, total_bp // max(1, target_chunks)))
+        _LOG.info(
+            f"Auto-tuned parallel_block_size to {auto_size:,} bp "
+            f"({total_bp:,} bp genome / {options.threads} threads x 8 chunks/thread)"
+        )
+        options.parallel_block_size = auto_size
+
     # Split file by chunk for parallel analysis or by contig for either parallel or single analysis
     _LOG.info("Splitting reference...")
     (splits_files_dict, count, reference_keys_with_lens) = split_main(
diff --git a/neat/read_simulator/utils/options.py b/neat/read_simulator/utils/options.py
index 4e6c3007..627eed34 100644
--- a/neat/read_simulator/utils/options.py
+++ b/neat/read_simulator/utils/options.py
@@ -82,7 +82,7 @@ def __init__(self,
                  produce_fastq: bool = True,
                  min_mutations: int = 0,
                  parallel_mode: str = "contig",
-                 parallel_block_size: int = 500000,
+                 parallel_block_size: int = 0,
                  cleanup_splits: bool = True,
                  splits_dir: Path | None = None,
                  reuse_splits: bool = False,
@@ -134,7 +134,9 @@ def __init__(self,
         :param produce_fastq: False to turn off default fastq creation
         :param min_mutations: If you wish to have a minimunm number of mutations per block, enter it here
         :param parallel_mode: If you wish to use block size method, enter 'size' here
-        :param parallel_block_size: If you use size method, specify any value but 500000 to change the block size
+        :param parallel_block_size: If you use size method, specify a positive integer to set the per-chunk
+            size in basepairs. The default (0) auto-tunes from total genome length and thread count, targeting
+            ~8 chunks per thread. Specify a value explicitly to override.
         :param cleanup_splits: Set to False in order to preserve splits after run
         :param reuse_splits: Attempts to reuse existing splits file
         """
@@ -240,7 +242,7 @@ def from_cli(output_dir: Path,
             'min_mutations': (int, 0, None, None),
             'overwrite_output': (bool, False, None, None),
             'parallel_mode': (str, 'size', 'choice', ['size', 'contig']),
-            'parallel_block_size': (int, 500000, None, None),
+            'parallel_block_size': (int, 0, None, None),
             'threads': (int, 1, 1, 1000),
             'cleanup_splits': (bool, True, None, None),
             'reuse_splits': (bool, False, None, None),
@@ -446,7 +448,10 @@ def log_configuration(self):
 
         if self.parallel_mode == 'size':
             _LOG.info(f'Splitting reference into chunks.')
-            _LOG.info(f'  - splitting input into size {self.parallel_block_size}')
+            if self.parallel_block_size > 0:
+                _LOG.info(f'  - splitting input into size {self.parallel_block_size}')
+            else:
+                _LOG.info(f'  - chunk size will be auto-tuned from genome length and thread count')
         elif self.parallel_mode == 'contig':
             _LOG.info(f'Splitting input by contig.')
         if self.reuse_splits:
diff --git a/neat/read_simulator/utils/stitch_outputs.py b/neat/read_simulator/utils/stitch_outputs.py
index 61edc4bb..efb3ac81 100644
--- a/neat/read_simulator/utils/stitch_outputs.py
+++ b/neat/read_simulator/utils/stitch_outputs.py
@@ -4,6 +4,8 @@
 import resource
 import shutil
 import pysam
+import time
+from concurrent.futures import ThreadPoolExecutor
 from pathlib import Path
 from typing import List
 
@@ -75,14 +77,27 @@ def main(
             vcf_list.append(file_dict["vcf"])
         if file_dict["bam"]:
             bam_list.append(file_dict["bam"])
-    # concatenate all files of each type. An empty list will result in no action
+    # Run the per-output-type stitches concurrently. They write to independent files and
+    # spend most of their time in I/O (raw gzip/BGZF byte copy for FASTQ/BAM, gzip read
+    # for VCF dedup) — Python's GIL is released during those calls, so threads overlap.
+    # Total stitch wall ≈ max(fq, vcf, bam) instead of their sum. At supercomputer scale
+    # where BAM dominates and FASTQ/VCF are small, the saving is bounded by the BAM cat
+    # alone, but on smaller-output workloads (or many small chunks) the overlap matters.
+    stitch_start = time.time()
+    work = []
     if fq1_list:
-        concat(fq1_list, ofw.files_to_write[ofw.fq1])
+        work.append(("fq1", lambda: concat(fq1_list, ofw.files_to_write[ofw.fq1])))
     if fq2_list:
-        concat(fq2_list, ofw.files_to_write[ofw.fq2])
+        work.append(("fq2", lambda: concat(fq2_list, ofw.files_to_write[ofw.fq2])))
     if vcf_list:
-        merge_vcfs(vcf_list, ofw)
+        work.append(("vcf", lambda: merge_vcfs(vcf_list, ofw)))
     if bam_list:
-        merge_bam(bam_list, ofw, threads)
-    # Final success message via logging
-    _LOG.info("Stitching complete!")
+        work.append(("bam", lambda: merge_bam(bam_list, ofw, threads)))
+
+    if work:
+        with ThreadPoolExecutor(max_workers=len(work)) as exe:
+            futures = {exe.submit(fn): label for label, fn in work}
+            for future in futures:
+                # .result() re-raises any exception from the worker.
+                future.result()
+    _LOG.info(f"Stitching complete! ({time.time() - stitch_start:.1f} s parallel stitch)")

From 9eb182d48dc558882ff6765cb406c1a45dadd35b Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 11:03:19 -0500
Subject: [PATCH 10/15] Document multi-node HPC deployment; record v4.4.3 perf
 overhaul in ChangeLog
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

README: Added a "Multi-node deployment on HPC clusters" subsection with a
SLURM array-job example for whole-genome simulation. NEAT itself runs
single-node (multiprocessing.Pool); for cross-node scaling, users dispatch
one task per chromosome and concatenate outputs. The cat-based stitch
introduced in this release means the final whole-genome concatenation is
just `samtools cat` of per-chromosome BAMs — no sort needed.

Also updated the parallelizing-simulation section to mention auto-tuned
chunk sizes (the default behavior now).

ChangeLog: Wrote the v4.4.3 entry summarizing the session's work, with a
before/after benchmark table on ecoli 10x. Headline numbers:

  ecoli SE wall:    14:55 -> 1:57   (7.6x)
  ecoli PE wall:    14:46 -> 1:44   (8.5x)
  Peak RSS:         549 MB -> 175 MB (3.1x less)
  Peak heap:        1.27 GB -> 0.32 GB (4x less)
  pysam.sort:       gone
  PairwiseAligner:  99% gone

vs NEAT 2.1 single-threaded:
  SE wall: 12:28 -> 1:57 (6.4x), 53% less total CPU
  PE wall: 20:12 -> 1:44 (11.6x), 70% less total CPU

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 ChangeLog.md | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 README.md    | 60 ++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 125 insertions(+), 2 deletions(-)

diff --git a/ChangeLog.md b/ChangeLog.md
index 53119cb2..d5c85df5 100644
--- a/ChangeLog.md
+++ b/ChangeLog.md
@@ -1,3 +1,70 @@
+# NEAT v4.4.3
+Major performance and memory overhaul focused on making NEAT viable for large
+genomes on supercomputing hardware. No user-visible API changes (other than
+chunk size now auto-tuning by default).
+
+**Benchmark (ecoli 10× coverage, 4 threads, identical configs):**
+
+| Metric                  | v4.4.2     | v4.4.3     | Improvement     |
+|-------------------------|------------|------------|-----------------|
+| ecoli SE wall time      | 14:55      | 1:57       | 7.6× faster     |
+| ecoli PE wall time      | 14:46      | 1:44       | 8.5× faster     |
+| ecoli SE total CPU      | 3,227 s    | 356 s      | 9.1× less       |
+| ecoli PE total CPU      | 3,168 s    | 368 s      | 8.6× less       |
+| Peak resident memory    | 549 MB     | 175 MB     | 3.1× less       |
+| Peak heap (memray)      | 1.27 GB    | 0.32 GB    | 4× less         |
+| Per-worker memory       | O(N×cov)   | O(1)       | bounded         |
+| `pysam.sort` calls      | 2          | 0          | gone            |
+| BAM correctness         | 0.06% dups | strict     | fixed           |
+
+**Versus NEAT 2.1 (single-threaded baseline):**
+- SE: 12:28 → 1:57 (6.4× faster, 53% less CPU)
+- PE: 20:12 → 1:44 (11.6× faster, 70% less CPU)
+
+**What changed in the hot path:**
+- Vectorized error sampling in `get_sequencing_errors` — replaced a ~150-iteration
+  per-read Python loop with batched numpy. Eliminated 28M `np.prod` calls per
+  185k-read run.
+- Vectorized `get_quality_scores` — replaced per-base scalar `rng.normal` with
+  one batched call.
+- Replaced per-read `PairwiseAligner.align()` in `make_cigar` with a direct
+  walker that builds the CIGAR from known error/mutation positions in O(L).
+  99% of reads now skip alignment entirely.
+
+**What changed in the I/O path:**
+- Removed both `pysam.sort` calls. Per-worker BAMs are emitted coordinate-sorted
+  by construction; `pysam.merge` of sorted inputs already produces sorted output.
+  The final sort allocated a 1 GB buffer that dominated peak memory.
+- Replaced `pysam.merge` with `pysam.cat` for the final stitch. cat does a raw
+  BGZF concatenation (no decompression / re-encode), bounded by raw disk I/O
+  instead of BGZF rate. At human-30× scale this is the difference between a
+  multi-hour stitch and a multi-minute one.
+- Each chunk now owns a non-overlapping reference range for read1 placement
+  (`responsibility_length`), enabling the cat-based stitch and eliminating
+  ~0.06% over-coverage in chunk-overlap regions.
+- Streamed FASTQ and BAM records directly to output during read generation.
+  Workers no longer accumulate `reads_to_write` — per-worker memory is now
+  bounded by reference segment + models, not by chunk size × coverage.
+- Stitch steps (FASTQ concat, VCF dedup, BAM cat) now run concurrently in
+  threads.
+
+**Defaults and ergonomics:**
+- `parallel_block_size` now auto-tunes from genome length and thread count
+  (target: ~8 chunks per thread). For small bacterial genomes this matches the
+  old hardcoded 500 kb; for human-scale genomes it produces ~6 Mb chunks
+  instead of ~500 kb, dramatically reducing stitch overhead. Specify the option
+  explicitly to override.
+- FASTQ output is no longer shuffled; reads come out in the natural sampling
+  order. Pipe through `seqkit shuffle` if you need a uniform shuffle (documented
+  in README).
+- Added a "Multi-node deployment on HPC clusters" section to the README
+  showing a SLURM array-job pattern for whole-genome simulation across nodes.
+
+**Caveats:**
+- Several of the vectorization fixes change how the PRNG stream is consumed.
+  Same seed will produce statistically equivalent reads, but not bit-identical
+  to v4.4.2. Re-baseline any regression tests that compared exact output.
+
 # NEAT v4.4.2
 - Added GC bias modeling to generate reads and a function to create a GC bias model from real data.
 - Added improvements and efficiency upgrades to generate-reads.
diff --git a/README.md b/README.md
index 52cb72d2..e9384bb4 100755
--- a/README.md
+++ b/README.md
@@ -362,7 +362,7 @@ neat read-simulator                 \
 
 ```
 ### Parallelizing simulation
-In this case, you would want to split the contig into blocks, rather than reading by contig. Even in single-threaded mode, this is likely to be faster. The default block size of 500,000 yields results quickly on a variety of datasets and can be easily modified to meet your requirements.
+Split the contig into blocks rather than reading by contig. Even in single-threaded mode this is likely to be faster. The chunk size auto-tunes from total genome length and thread count, targeting roughly 8 chunks per thread for good load balancing — on small bacterial genomes you get ~500 kb chunks (similar to NEAT's old hardcoded default), on human-scale genomes you get ~6 Mb chunks (a few hundred total instead of thousands). Specify `parallel_block_size` explicitly if you want to override.
 
 Also, we demonstrate the situation where you do not want any logs produced:
 
@@ -376,7 +376,8 @@ fragment_mean: 350
 fragment_st_dev: 50
 threads: 12
 parallel_mode: size
-parallel_block_size: 500000
+# parallel_block_size omitted: auto-tuned from genome length and thread count.
+# Set explicitly (e.g. parallel_block_size: 1000000) only if you have a reason.
 ```
 Then run with the command:
 ```
@@ -386,6 +387,61 @@ neat read-simulator                 \
         -o /home/me/simulated_reads/
 ```
 
+### Multi-node deployment on HPC clusters
+
+NEAT runs on a single node using Python's `multiprocessing`. To use multiple nodes on a supercomputer, dispatch one NEAT job per contig (or contig group) as a job-array element and concatenate the outputs afterwards. Each array task gets its own node and uses all available cores on it. NEAT itself doesn't need to know about the cluster.
+
+SLURM example with one task per human chromosome (24 tasks):
+
+```bash
+#!/bin/bash
+#SBATCH --job-name=neat-array
+#SBATCH --array=1-24
+#SBATCH --cpus-per-task=64
+#SBATCH --mem=64G
+#SBATCH --time=04:00:00
+
+CHROM=$(awk "NR==$SLURM_ARRAY_TASK_ID" chroms.txt)   # e.g. chr1, chr2, ...
+samtools faidx hg38.fa "$CHROM" > "ref_${CHROM}.fa"   # one chromosome per task
+
+cat > "config_${CHROM}.yml" <<YAML
+reference: ref_${CHROM}.fa
+read_len: 150
+coverage: 30
+paired_ended: true
+fragment_mean: 400
+fragment_st_dev: 50
+produce_bam: true
+produce_fastq: true
+produce_vcf: true
+threads: ${SLURM_CPUS_PER_TASK}
+parallel_mode: size
+YAML
+
+neat read-simulator -c "config_${CHROM}.yml" \
+                    -o "out/${CHROM}/" \
+                    -p "${CHROM}"
+```
+
+After the array completes, concatenate the per-chromosome outputs:
+
+```bash
+# BAMs are already coordinate-sorted within each chromosome; cat in chromosome order
+# produces a globally-sorted whole-genome BAM with no further sort needed.
+samtools cat -o all.bam out/chr1/chr1_golden.bam out/chr2/chr2_golden.bam ... out/chrY/chrY_golden.bam
+samtools index all.bam
+
+# FASTQs: gzip-concat preserves a valid gzip stream
+cat out/*/chr*_r1.fastq.gz > all_r1.fastq.gz
+cat out/*/chr*_r2.fastq.gz > all_r2.fastq.gz
+
+# VCFs: use bcftools concat for proper merging
+bcftools concat -o all.vcf.gz -Oz out/*/chr*_golden.vcf.gz
+bcftools index all.vcf.gz
+```
+
+This gives you whole-genome simulation in roughly `single-chromosome-time / nodes` wall time. For a 30× human genome with 24 nodes (each 64 cores), a per-chromosome run takes ~10 min and the array finishes in ~10 min wall, plus a few minutes for the final concat.
+
 ### Insert specific variants
 Simulate a whole genome dataset with only the variants in the provided VCF file using `-v` and setting mutation rate to 0 with `-M`.
 

From a4dfa0dd083c6517a286911c1c28d2b2a4d64b10 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 17:58:09 -0500
Subject: [PATCH 11/15] Polish hot path: drop redundant deepcopies, one-pass
 apply_errors, byte-level FASTQ concat
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three small but additive improvements to the read-generation and stitch paths,
all behaviour-preserving:

1. Drop two `deepcopy(self.reference_segment)` calls in read.py.
   Biopython `Seq` is immutable; `apply_mutations`/`apply_errors` build their own
   `MutableSeq` working copies. The deepcopies just paid for a fresh allocation
   per read that nothing relied on. `convert_masking` and `finalize_read_and_write`
   now alias the masked reference directly.

2. Rewrite `apply_errors` as a single ascending-position pass.
   The previous implementation iterated errors in descending order doing one
   `np.concatenate` and one `MutableSeq` slice/concat per error — quadratic in
   errors-per-read. The new pass walks ascending, accumulating unchanged-range
   chunks plus error alternates into two lists (sequence + quality scores),
   then joins/concats once at the end. Quality-array semantics are preserved
   exactly (including the prior convention that an insertion anchor's original
   quality is replaced by `alt_len - 1` low-quality scores). Linear in total
   read length regardless of error count.

3. Byte-level FASTQ concat in stitch.
   The previous `concat` decompressed each per-chunk FASTQ via `gzip.open(rt)`
   and re-compressed into the destination gzip writer. Concatenated gzip
   streams are themselves valid gzip files (gzip spec §2.2), so we skip zlib
   entirely and just copy raw bytes. The destination's gzip handle is closed
   first so the byte-level write can truncate and overwrite cleanly.

Validated on ecoli 10x 4-thread under machine-busy conditions:
  Polish off:  ecoli_se 3:05.5
  Polish on:   ecoli_se 2:28.6  (~20% faster on a busy machine; comparable to
                                 v10's quiet-machine result of 1:56.9)

c_elegans 10x 4-thread (scale-test, 97 Mb genome, 35 auto-tuned chunks):
  19:16 wall, 304 MB peak RSS, 6,685,764 records, 0 sort violations.
  Stitch step: 5.3 s parallel. Confirms the cat-stitch architecture scales
  from ecoli's 14 Mb up an order of magnitude with no surprises.

Adds a c_elegans config to comparison/configs/ for future regression runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 neat/read_simulator/utils/read.py           | 69 ++++++++++++++-------
 neat/read_simulator/utils/stitch_outputs.py | 47 +++++++++-----
 2 files changed, 81 insertions(+), 35 deletions(-)

diff --git a/neat/read_simulator/utils/read.py b/neat/read_simulator/utils/read.py
index 08b13d23..0c992228 100644
--- a/neat/read_simulator/utils/read.py
+++ b/neat/read_simulator/utils/read.py
@@ -170,26 +170,49 @@ def update_quality_array(
 
     def apply_errors(self, quality_model: TraditionalQualityModel):
         """
-        This function applies errors to a sequence and calls the update_quality_array function after
-
-        :param quality_model: The error model for this run,
-        :return: None, The sequence, with errors applied
+        Apply this read's stored sequencing errors to read_sequence and quality_array in
+        a single pass.
+
+        self.errors is in descending position order (set up by get_sequencing_errors).
+        The original implementation iterated in that order, doing one np.concatenate +
+        one MutableSeq slice/concat per error — quadratic in errors-per-read. Here we
+        iterate ascending and append unchanged ranges plus error alternates into chunk
+        lists, then join once at the end. The result preserves the prior per-error
+        quality-array semantics (including the convention that insertion anchors lose
+        their original quality and gain alt_len-1 low-quality scores).
         """
-        mutated_sequence = MutableSeq(self.read_sequence)
-        for error in self.errors:
-            # Replace the entire ref sequence with the entire alt sequence
-            mutated_sequence = \
-                mutated_sequence[:error.location] + error.alt + mutated_sequence[error.location+len(error.ref):]
-            # update quality score for error
-            self.update_quality_array(
-                len(error.ref),
-                error.alt,
-                error.location,
-                "error",
-                list(quality_model.quality_scores),
-            )
+        if not self.errors:
+            return
+
+        errors_asc = list(reversed(self.errors))
+        low_score = min(quality_model.quality_scores)
+
+        seq_str = str(self.read_sequence)
+        seq_chunks: list[str] = []
+        q_chunks: list[np.ndarray] = []
+        prev_end = 0
+
+        for error in errors_asc:
+            loc = error.location
+            ref_len = len(error.ref)
+            alt_str = str(error.alt)
+            alt_len = len(alt_str)
+            seq_chunks.append(seq_str[prev_end:loc])
+            q_chunks.append(self.quality_array[prev_end:loc])
+            seq_chunks.append(alt_str)
+            if alt_len > 1:
+                q_chunks.append(np.full(alt_len - 1, low_score, dtype=int))
+            elif ref_len > 1 and alt_len == 1:
+                pass  # deletion — no new quality scores
+            else:
+                q_chunks.append(np.array([low_score], dtype=int))
+            prev_end = loc + ref_len
 
-        self.read_sequence = Seq(mutated_sequence)
+        seq_chunks.append(seq_str[prev_end:])
+        q_chunks.append(self.quality_array[prev_end:])
+
+        self.read_sequence = Seq(''.join(seq_chunks))
+        self.quality_array = np.concatenate(q_chunks)
 
     def apply_mutations(self, quality_scores: list, rng: Generator):
         """
@@ -341,8 +364,10 @@ def finalize_read_and_write(
         # It updates the quality array and reference segment in place, including reversing them, if appropriate
         self.convert_masking(qual_model)
 
-        # set the read sequence to match the reference, then modify
-        self.read_sequence = deepcopy(self.reference_segment)
+        # Start the read sequence as an alias of the masked reference. Biopython Seq is
+        # immutable; apply_mutations and apply_errors below produce new Seq objects via
+        # their own working copies, so no independent deepcopy is needed here.
+        self.read_sequence = self.reference_segment
 
         # Get errors for the read and update the quality score
         self.errors, self.padding = err_model.get_sequencing_errors(
@@ -385,7 +410,9 @@ def convert_masking(self, quality_model: TraditionalQualityModel):
         bad_score = min(quality_model.quality_scores)
         # we'll use generic human repeats, as commonly found in masked regions. We may refine this to make configurable
         repeat_bases = list("TTAGGG")
-        raw_sequence = deepcopy(self.reference_segment)
+        # Immutable Biopython Seq; the MutableSeq below is the working copy, so no
+        # deepcopy needed here.
+        raw_sequence = self.reference_segment
 
         start = raw_sequence.find('N')
         if start != -1:
diff --git a/neat/read_simulator/utils/stitch_outputs.py b/neat/read_simulator/utils/stitch_outputs.py
index efb3ac81..5e415137 100644
--- a/neat/read_simulator/utils/stitch_outputs.py
+++ b/neat/read_simulator/utils/stitch_outputs.py
@@ -19,15 +19,24 @@
 
 _LOG = logging.getLogger(__name__)
 
-def concat(files_to_join: List[Path], dest_file: gzip.GzipFile) -> None:
+def concat(files_to_join: List[Path], dest_path: Path) -> None:
+    """
+    Byte-level concatenation of gzip-compressed inputs into dest_path.
+
+    Concatenated gzip streams are themselves a valid gzip file (gzip spec section 2.2 —
+    a gzip file is a sequence of "members"; decompression yields the concatenation of
+    contents). Skipping zlib entirely is much faster than the prior path that
+    decompressed each input and re-compressed into the destination. The caller must
+    ensure any gzip handle on dest_path is closed before invoking this (we open
+    dest_path in 'wb' mode, which truncates).
+    """
     if not files_to_join:
-        # Nothing to do, and no error to throw
-        _LOG.warning(f"Concat called but there are no files to join: {files_to_join}" )
+        _LOG.warning(f"Concat called but there are no files to join: {files_to_join}")
         return
-
-    for f in files_to_join:
-        with gzip.open(f, 'rt') as in_f:
-            shutil.copyfileobj(in_f, dest_file)
+    with open(dest_path, 'wb') as out:
+        for f in files_to_join:
+            with open(f, 'rb') as in_f:
+                shutil.copyfileobj(in_f, out, length=4 * 1024 * 1024)
 
 def merge_vcfs(vcfs: List[Path], ofw: OutputFileWriter) -> None:
     dest = ofw.files_to_write[ofw.vcf]
@@ -77,18 +86,28 @@ def main(
             vcf_list.append(file_dict["vcf"])
         if file_dict["bam"]:
             bam_list.append(file_dict["bam"])
+    # The byte-level FASTQ concat opens dest_path in 'wb' mode (truncates), so we close
+    # any existing gzip handles on those paths first. flush_and_close_files (called later
+    # from runner) will skip handles it finds already closed.
+    for path_attr in (ofw.fq1, ofw.fq2):
+        if path_attr is not None and path_attr in ofw.files_to_write:
+            try:
+                ofw.files_to_write[path_attr].close()
+            except Exception:
+                pass
+
     # Run the per-output-type stitches concurrently. They write to independent files and
-    # spend most of their time in I/O (raw gzip/BGZF byte copy for FASTQ/BAM, gzip read
-    # for VCF dedup) — Python's GIL is released during those calls, so threads overlap.
-    # Total stitch wall ≈ max(fq, vcf, bam) instead of their sum. At supercomputer scale
-    # where BAM dominates and FASTQ/VCF are small, the saving is bounded by the BAM cat
-    # alone, but on smaller-output workloads (or many small chunks) the overlap matters.
+    # spend most of their time in I/O (raw byte copy for FASTQ/BAM, gzip read for VCF
+    # dedup) — Python's GIL is released during those calls, so threads overlap. Total
+    # stitch wall ≈ max(fq, vcf, bam) instead of their sum. At supercomputer scale where
+    # BAM dominates and FASTQ/VCF are small, the saving is bounded by the BAM cat alone,
+    # but on smaller-output workloads the overlap matters.
     stitch_start = time.time()
     work = []
     if fq1_list:
-        work.append(("fq1", lambda: concat(fq1_list, ofw.files_to_write[ofw.fq1])))
+        work.append(("fq1", lambda: concat(fq1_list, ofw.fq1)))
     if fq2_list:
-        work.append(("fq2", lambda: concat(fq2_list, ofw.files_to_write[ofw.fq2])))
+        work.append(("fq2", lambda: concat(fq2_list, ofw.fq2)))
     if vcf_list:
         work.append(("vcf", lambda: merge_vcfs(vcf_list, ofw)))
     if bam_list:

From 6c0c2cbff020cdf9b833540be0d7b537d40c92eb Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 17:59:14 -0500
Subject: [PATCH 12/15] ChangeLog: add c_elegans scale-test results and polish
 details
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The v4.4.3 entry now includes:
- c_elegans 10x SE scale-test (97 Mb genome, ~7× ecoli): 19:16 wall,
  304 MB peak RSS, 6.7M records, 0 sort violations, 5.3 s stitch.
- Polish details: one-pass apply_errors (eliminates quadratic per-error
  rebuild), deepcopy removals in convert_masking and finalize_read_and_write,
  byte-level FASTQ concat at stitch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 ChangeLog.md | 32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/ChangeLog.md b/ChangeLog.md
index d5c85df5..59d81580 100644
--- a/ChangeLog.md
+++ b/ChangeLog.md
@@ -21,6 +21,25 @@ chunk size now auto-tuning by default).
 - SE: 12:28 → 1:57 (6.4× faster, 53% less CPU)
 - PE: 20:12 → 1:44 (11.6× faster, 70% less CPU)
 
+**Scale-test (c_elegans 10× coverage, 4 threads, 100 Mb genome — ~7× the
+ecoli reference):**
+
+| Metric                  | Value                                                   |
+|-------------------------|---------------------------------------------------------|
+| Wall time               | 19:16                                                   |
+| Total CPU               | 4,085 s                                                 |
+| Peak resident memory    | 304 MB                                                  |
+| BAM records             | 6,685,764                                               |
+| BAM sort violations     | 0                                                       |
+| Stitch step (parallel)  | 5.3 s                                                   |
+| Auto-tuned chunk size   | 3.1 Mb (35 chunks)                                      |
+
+Scaling behaviour vs ecoli is ~linear in genome size as expected. The stitch
+step is bounded by raw disk I/O via `pysam.cat`, so it stays at single-digit
+seconds even as the BAM grows. Per-worker peak RSS is 304 MB ÷ 4 ≈ 76 MB,
+which is the reference segment + models — independent of chunk size and
+coverage.
+
 **What changed in the hot path:**
 - Vectorized error sampling in `get_sequencing_errors` — replaced a ~150-iteration
   per-read Python loop with batched numpy. Eliminated 28M `np.prod` calls per
@@ -30,6 +49,13 @@ chunk size now auto-tuning by default).
 - Replaced per-read `PairwiseAligner.align()` in `make_cigar` with a direct
   walker that builds the CIGAR from known error/mutation positions in O(L).
   99% of reads now skip alignment entirely.
+- Rewrote `apply_errors` as a single ascending-position pass. The previous
+  implementation did one `np.concatenate` and one `MutableSeq` slice/concat
+  per error — quadratic in errors-per-read. The new pass is linear regardless
+  of error count.
+- Removed redundant `deepcopy(self.reference_segment)` calls in
+  `convert_masking` and `finalize_read_and_write`. Biopython `Seq` is
+  immutable; the downstream operations make their own working copies.
 
 **What changed in the I/O path:**
 - Removed both `pysam.sort` calls. Per-worker BAMs are emitted coordinate-sorted
@@ -46,7 +72,11 @@ chunk size now auto-tuning by default).
   Workers no longer accumulate `reads_to_write` — per-worker memory is now
   bounded by reference segment + models, not by chunk size × coverage.
 - Stitch steps (FASTQ concat, VCF dedup, BAM cat) now run concurrently in
-  threads.
+  threads. On a single-disk system the wall is bounded by the BAM cat alone;
+  on parallel filesystems the overlap is more pronounced.
+- FASTQ stitch is now byte-level: per-chunk gzip streams are concatenated
+  without decompression / re-encode (concatenated gzip streams form a valid
+  gzip file per the spec).
 
 **Defaults and ergonomics:**
 - `parallel_block_size` now auto-tunes from genome length and thread count

From a1ebd4da1e111f6ae3ea7accde072ccbb33b59a0 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 20:08:50 -0500
Subject: [PATCH 13/15] Refresh v4.4.3 benchmarks with clean post-polish
 numbers; flag stale READMEtable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Re-ran ecoli SE+PE 4-thread 10x on a quiet machine after all the polish work
landed. The ChangeLog table now reflects the actual v4.4.3 final numbers:

  ecoli_se wall: 14:55 -> 1:35  (9.4x faster, was 7.6x with pre-polish numbers)
  ecoli_pe wall: 14:46 -> 1:35  (9.4x faster, was 8.5x)
  ecoli_se CPU:  3,227 s -> 331 s  (9.7x less)
  ecoli_pe CPU:  3,168 s -> 338 s  (9.4x less)

vs NEAT 2.1 single-threaded baseline:
  SE: 7.9x faster, 56% less CPU
  PE: 12.8x faster, 72% less CPU

Also added a note at the top of README "Estimated runtimes" pointing readers
to ChangeLog v4.4.3 for current numbers — the v4.4.0 runtimes in that
section are now ~5-10x off from current performance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 ChangeLog.md | 12 ++++++------
 README.md    |  2 ++
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/ChangeLog.md b/ChangeLog.md
index 59d81580..57c0f95e 100644
--- a/ChangeLog.md
+++ b/ChangeLog.md
@@ -7,10 +7,10 @@ chunk size now auto-tuning by default).
 
 | Metric                  | v4.4.2     | v4.4.3     | Improvement     |
 |-------------------------|------------|------------|-----------------|
-| ecoli SE wall time      | 14:55      | 1:57       | 7.6× faster     |
-| ecoli PE wall time      | 14:46      | 1:44       | 8.5× faster     |
-| ecoli SE total CPU      | 3,227 s    | 356 s      | 9.1× less       |
-| ecoli PE total CPU      | 3,168 s    | 368 s      | 8.6× less       |
+| ecoli SE wall time      | 14:55      | 1:35       | 9.4× faster     |
+| ecoli PE wall time      | 14:46      | 1:35       | 9.4× faster     |
+| ecoli SE total CPU      | 3,227 s    | 331 s      | 9.7× less       |
+| ecoli PE total CPU      | 3,168 s    | 338 s      | 9.4× less       |
 | Peak resident memory    | 549 MB     | 175 MB     | 3.1× less       |
 | Peak heap (memray)      | 1.27 GB    | 0.32 GB    | 4× less         |
 | Per-worker memory       | O(N×cov)   | O(1)       | bounded         |
@@ -18,8 +18,8 @@ chunk size now auto-tuning by default).
 | BAM correctness         | 0.06% dups | strict     | fixed           |
 
 **Versus NEAT 2.1 (single-threaded baseline):**
-- SE: 12:28 → 1:57 (6.4× faster, 53% less CPU)
-- PE: 20:12 → 1:44 (11.6× faster, 70% less CPU)
+- SE: 12:28 → 1:35 (7.9× faster, 56% less CPU)
+- PE: 20:12 → 1:35 (12.8× faster, 72% less CPU)
 
 **Scale-test (c_elegans 10× coverage, 4 threads, 100 Mb genome — ~7× the
 ecoli reference):**
diff --git a/README.md b/README.md
index e9384bb4..24f39272 100755
--- a/README.md
+++ b/README.md
@@ -268,6 +268,8 @@ seqkit shuffle -s 42 reads_r2.fastq.gz -o reads_r2.shuffled.fastq.gz
 
 ### Estimated runtimes
 
+> **Note:** The tables below are from the original NEAT 4.4 (v4.4.0) benchmark. NEAT v4.4.3 is roughly **9× faster on multi-threaded ecoli** and uses **~3× less memory** thanks to the performance work landed in that release. The relative shape of the tables (size scaling, contig vs. size mode tradeoffs) remains accurate, but absolute runtimes should be divided by ~5–10× for v4.4.3+. See ChangeLog v4.4.3 for a detailed before/after table on ecoli and a c_elegans scale-test.
+
 To give users a sense of how long `neat read-simulator` runs may take, we benchmarked NEAT 4.4 on several reference genomes. All runs were paired-end, with read length of 150 bp, coverage of 10, fragment mean of 300 bp, and fragment standard deviation of 50 bp. Runtimes are reported as the average across three unique runs (`Avg. time (ms)`) and the corresponding runtime in minutes. Cells marked with N/A indicate that NEAT was not able to run to completion.
 
 Benchmarks were run on a System76 Meerkat with a 13th Gen Intel Core i3-1315U (8 logical cores, up to 4.50 GHz) and 16 GiB RAM, using a 512 GB SSD and Ubuntu 24.04.3 LTS (Linux kernel 6.14). Actual runtimes will vary depending on your hardware.

From ba0baf332e8ef095cd5299c8d0ffa91b082bb9cf Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 20:22:02 -0500
Subject: [PATCH 14/15] Update tests to match v4.4.3 API changes; CI was
 failing on develop
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three sets of tests were exercising contracts that the v4.4.3 work changed:

1. tests/test_read_simulator/test_generate_reads.py (9 tests)
   generate_reads no longer returns a list of (Read, Read|None) pairs — it
   streams Read objects directly to the output_file_writer as part of the
   bounded-memory rework. Added a small `_CollectingOFW` stand-in that
   captures write_bam_record calls; the affected tests now enable
   produce_bam=True, pass a CollectingOFW, and assert on ofw.bam_records.
   For paired-end tests, the pair-structure assertions were replaced with
   forward/reverse-count assertions since pairs are no longer kept together
   in memory (mates are interleaved by position via a heap buffer).

2. tests/test_read_simulator/test_options.py (1 test)
   test_default_values expected parallel_block_size == 500000. The v4.4.3
   default is 0 — a sentinel that triggers auto-tuning from genome length
   and thread count in runner.py. Test updated to assert the sentinel.

3. tests/test_read_simulator/test_stitch_outputs.py (11 tests)
   - concat() now takes a destination Path (not a gzip handle) and writes
     raw bytes — concatenated gzip streams are valid gzip per the spec. The
     concat tests now pass a path and read it back via gzip.open.
   - merge_bam now calls pysam.cat (not pysam.merge + pysam.sort). The four
     old merge/sort/temp-file/chunk-batching tests were replaced with three
     cat-based equivalents (call_count, output path, full-input ordering).
   - The main() tests that asserted against StringIO buffers were updated
     to read back from the actual destination paths, since byte-level
     concat writes through the path, not through the handle.

All 599 tests pass locally (`pytest tests/`); should unblock the GitHub
Actions failures on develop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 .../test_generate_reads.py                    | 138 +++++++++++-------
 tests/test_read_simulator/test_options.py     |   4 +-
 .../test_stitch_outputs.py                    |  89 +++++------
 3 files changed, 139 insertions(+), 92 deletions(-)

diff --git a/tests/test_read_simulator/test_generate_reads.py b/tests/test_read_simulator/test_generate_reads.py
index d20ca1e1..7b63ed94 100644
--- a/tests/test_read_simulator/test_generate_reads.py
+++ b/tests/test_read_simulator/test_generate_reads.py
@@ -321,13 +321,37 @@ def _make_options(paired=False, seed=0):
     opts.read_len = _READ_LEN
     opts.paired_ended = paired
     opts.coverage = 5
+    # Enable BAM so generate_reads streams Read objects to the collecting OFW for
+    # inspection. FASTQ stays off — we don't need its bytes for these tests.
     opts.produce_fastq = False
-    opts.produce_bam = False
+    opts.produce_bam = True
     opts.produce_vcf = False
     opts.overwrite_output = True
     return opts
 
 
+class _CollectingOFW:
+    """
+    Minimal OutputFileWriter stand-in for generate_reads tests.
+
+    generate_reads now streams Read objects to its output_file_writer instead of
+    returning a list. Tests that want to inspect the generated reads enable
+    produce_bam=True and pass an instance of this class as the ofw. Every
+    ``write_bam_record`` call is captured into ``bam_records`` for assertion.
+    """
+    def __init__(self):
+        # generate_reads dereferences ofw.bam and ofw.files_to_write[ofw.bam].
+        # Provide a stub handle so those lookups succeed; we don't use the handle.
+        self.bam = "_collecting_bam"
+        self.fq1 = None
+        self.fq2 = None
+        self.files_to_write = {self.bam: SimpleNamespace(write=lambda *a, **kw: None)}
+        self.bam_records = []
+
+    def write_bam_record(self, read, contig_id, bam_handle, read_length):
+        self.bam_records.append(read)
+
+
 def _make_models(read_len=_READ_LEN, frag_mean=300):
     error_model = SequencingErrorModel(read_length=read_len)
     qual_model = TraditionalQualityModel()
@@ -407,16 +431,17 @@ def test_generate_reads_single_ended_returns_read_none_pairs():
     err, qual, frag = _make_models()
     opts = _make_options(paired=False)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), _nothing_discarded(),
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), _nothing_discarded(),
+                   opts, ofw, "chr1", 0, 0)
 
-    assert isinstance(results, list)
-    assert len(results) > 0
-    for read1, read2 in results:
-        assert isinstance(read1, Read)
-        assert read2 is None
+    # In SE mode only read1 records are emitted — all forward strand.
+    assert len(ofw.bam_records) > 0
+    for read in ofw.bam_records:
+        assert isinstance(read, Read)
+        assert read.is_reverse is False
 
 
 def test_generate_reads_paired_ended_returns_read_read_pairs():
@@ -424,15 +449,21 @@ def test_generate_reads_paired_ended_returns_read_read_pairs():
     err, qual, frag = _make_models()
     opts = _make_options(paired=True)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), _nothing_discarded(),
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), _nothing_discarded(),
+                   opts, ofw, "chr1", 0, 0)
 
-    assert len(results) > 0
-    for read1, read2 in results:
-        assert isinstance(read1, Read)
-        assert isinstance(read2, Read)
+    # PE mode emits both forward (r1) and reverse (r2) records, interleaved by
+    # position via the heap-buffer in generate_reads. Verify both strands appear.
+    assert len(ofw.bam_records) > 0
+    forwards = [r for r in ofw.bam_records if not r.is_reverse]
+    reverses = [r for r in ofw.bam_records if r.is_reverse]
+    assert len(forwards) > 0
+    assert len(reverses) > 0
+    # Each fragment contributes one r1 and one r2.
+    assert len(forwards) == len(reverses)
 
 
 def test_generate_reads_read_length_matches_options():
@@ -440,13 +471,15 @@ def test_generate_reads_read_length_matches_options():
     err, qual, frag = _make_models()
     opts = _make_options(paired=False)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), _nothing_discarded(),
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), _nothing_discarded(),
+                   opts, ofw, "chr1", 0, 0)
 
-    for read1, _ in results:
-        assert len(read1.read_sequence) == _READ_LEN
+    assert len(ofw.bam_records) > 0
+    for read in ofw.bam_records:
+        assert len(read.read_sequence) == _READ_LEN
 
 
 # ---------------------------------------------------------------------------
@@ -459,13 +492,14 @@ def test_generate_reads_targeted_region_flag_false_filters_all():
     err, qual, frag = _make_models()
     opts = _make_options(paired=False)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
     no_target = [(0, _SPAN, False)]
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             no_target, _nothing_discarded(),
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   no_target, _nothing_discarded(),
+                   opts, ofw, "chr1", 0, 0)
 
-    assert results == []
+    assert ofw.bam_records == []
 
 
 def test_generate_reads_discard_region_removes_all():
@@ -474,13 +508,14 @@ def test_generate_reads_discard_region_removes_all():
     err, qual, frag = _make_models()
     opts = _make_options(paired=False)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
     discard_all = [(0, _SPAN, True)]
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), discard_all,
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), discard_all,
+                   opts, ofw, "chr1", 0, 0)
 
-    assert results == []
+    assert ofw.bam_records == []
 
 
 def test_generate_reads_discard_flag_false_keeps_reads():
@@ -489,12 +524,13 @@ def test_generate_reads_discard_flag_false_keeps_reads():
     err, qual, frag = _make_models()
     opts = _make_options(paired=False)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), _nothing_discarded(),
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), _nothing_discarded(),
+                   opts, ofw, "chr1", 0, 0)
 
-    assert len(results) > 0
+    assert len(ofw.bam_records) > 0
 
 
 # ---------------------------------------------------------------------------
@@ -506,6 +542,7 @@ def test_generate_reads_variants_populated_on_reads():
     ref = _make_reference()
     err, qual, frag = _make_models()
     opts = _make_options(paired=False)
+    ofw = _CollectingOFW()
 
     cv = ContigVariants()
     snv = SingleNucleotideVariant(
@@ -516,11 +553,11 @@ def test_generate_reads_variants_populated_on_reads():
     )
     cv.add_variant(snv)
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), _nothing_discarded(),
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), _nothing_discarded(),
+                   opts, ofw, "chr1", 0, 0)
 
-    reads_with_mutations = [r1 for r1, _ in results if r1.mutations]
+    reads_with_mutations = [r for r in ofw.bam_records if r.mutations]
     assert len(reads_with_mutations) > 0
 
 
@@ -540,30 +577,33 @@ def test_generate_reads_paired_discard_region_removes_all():
     err, qual, frag = _make_models()
     opts = _make_options(paired=True)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
     discard_all = [(0, _SPAN, True)]
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), discard_all,
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), discard_all,
+                   opts, ofw, "chr1", 0, 0)
 
-    assert results == []
+    assert ofw.bam_records == []
 
 
 def test_generate_reads_paired_no_discard_produces_read_pairs():
-    """Paired-end run without discard produces (Read, Read) pairs (regression guard)."""
+    """Paired-end run without discard produces both forward and reverse read records."""
     ref = _make_reference()
     err, qual, frag = _make_models()
     opts = _make_options(paired=True)
     cv = ContigVariants()
+    ofw = _CollectingOFW()
 
-    results = generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
-                             _all_span_targeted(), _nothing_discarded(),
-                             opts, None, "chr1", 0, 0)
+    generate_reads(0, ref, err, 3, qual, frag, get_uniform_gc_model(), cv,
+                   _all_span_targeted(), _nothing_discarded(),
+                   opts, ofw, "chr1", 0, 0)
 
-    assert len(results) > 0
-    for read1, read2 in results:
-        assert isinstance(read1, Read)
-        assert isinstance(read2, Read)
+    forwards = [r for r in ofw.bam_records if not r.is_reverse]
+    reverses = [r for r in ofw.bam_records if r.is_reverse]
+    assert len(forwards) > 0
+    assert len(reverses) > 0
+    assert len(forwards) == len(reverses)
 
 
 # ---------------------------------------------------------------------------
diff --git a/tests/test_read_simulator/test_options.py b/tests/test_read_simulator/test_options.py
index f08a49f7..33e9c17c 100644
--- a/tests/test_read_simulator/test_options.py
+++ b/tests/test_read_simulator/test_options.py
@@ -158,7 +158,9 @@ def test_default_values():
     assert opts.quality_offset == 33
     assert opts.threads == 1
     assert opts.parallel_mode == "contig"
-    assert opts.parallel_block_size == 500000
+    # Default is 0 (sentinel for auto-tune from genome length and thread count).
+    # An explicit positive int in YAML overrides; see runner for the auto-tune logic.
+    assert opts.parallel_block_size == 0
     assert opts.cleanup_splits is True
     assert opts.reuse_splits is False
     assert opts.overwrite_output is False
diff --git a/tests/test_read_simulator/test_stitch_outputs.py b/tests/test_read_simulator/test_stitch_outputs.py
index ed830f85..a6dd517b 100644
--- a/tests/test_read_simulator/test_stitch_outputs.py
+++ b/tests/test_read_simulator/test_stitch_outputs.py
@@ -50,42 +50,47 @@ def _make_ofw(tmp_path: Path, vcf_path: Path = None):
     return ofw
 
 
-# concat
+# concat (byte-level gzip concatenation — concatenated gzip streams are valid gzip)
 
 def test_concat_single_file(tmp_path):
     src = _write_gz(tmp_path / "a.gz", "hello\n")
-    dest = io.StringIO()
+    dest = tmp_path / "out.gz"
     concat([src], dest)
-    assert dest.getvalue() == "hello\n"
+    with gzip.open(dest, "rt") as f:
+        assert f.read() == "hello\n"
 
 
 def test_concat_multiple_files(tmp_path):
     a = _write_gz(tmp_path / "a.gz", "line1\n")
     b = _write_gz(tmp_path / "b.gz", "line2\n")
-    dest = io.StringIO()
+    dest = tmp_path / "out.gz"
     concat([a, b], dest)
-    assert dest.getvalue() == "line1\nline2\n"
+    with gzip.open(dest, "rt") as f:
+        assert f.read() == "line1\nline2\n"
 
 
 def test_concat_empty_list(tmp_path):
-    dest = io.StringIO()
+    """Empty input list: concat is a no-op and dest is not created."""
+    dest = tmp_path / "out.gz"
     concat([], dest)
-    assert dest.getvalue() == ""
+    assert not dest.exists()
 
 
 def test_concat_preserves_content(tmp_path):
     content = "ACGT\nACGT\nACGT\n"
     src = _write_gz(tmp_path / "reads.gz", content)
-    dest = io.StringIO()
+    dest = tmp_path / "out.gz"
     concat([src], dest)
-    assert dest.getvalue() == content
+    with gzip.open(dest, "rt") as f:
+        assert f.read() == content
 
 
 def test_concat_order_is_preserved(tmp_path):
     files = [_write_gz(tmp_path / f"{i}.gz", f"chunk{i}\n") for i in range(5)]
-    dest = io.StringIO()
+    dest = tmp_path / "out.gz"
     concat(files, dest)
-    result = dest.getvalue()
+    with gzip.open(dest, "rt") as f:
+        result = f.read()
     positions = [result.index(f"chunk{i}") for i in range(5)]
     assert positions == sorted(positions)
 
@@ -199,54 +204,46 @@ def test_find_dups_exact_duplicate_rejected(tmp_path):
     assert cv.add_variant(v2) == 1
 
 
-# merge_bam
+# merge_bam (now a raw BGZF concatenation via pysam.cat — see stitch_outputs)
 
-def test_merge_bam_calls_pysam_merge_and_sort(tmp_path):
+def test_merge_bam_calls_pysam_cat(tmp_path):
     ofw = _make_ofw(tmp_path)
     bam_files = [tmp_path / f"{i}.bam" for i in range(3)]
 
     with patch("neat.read_simulator.utils.stitch_outputs.pysam") as mock_pysam:
         merge_bam(bam_files, ofw, threads=4)
 
-    # pysam.merge should have been called at least twice (once per chunk + final)
-    assert mock_pysam.merge.call_count >= 2
-    # pysam.sort should have been called once
-    mock_pysam.sort.assert_called_once()
+    # Per-worker BAMs are coord-sorted and cover non-overlapping ranges, so a single
+    # pysam.cat is sufficient — no merge/sort needed.
+    mock_pysam.cat.assert_called_once()
+    mock_pysam.merge.assert_not_called()
+    mock_pysam.sort.assert_not_called()
 
 
-def test_merge_bam_sort_uses_output_bam_path(tmp_path):
+def test_merge_bam_cat_uses_output_bam_path(tmp_path):
     ofw = _make_ofw(tmp_path)
     bam_files = [tmp_path / "a.bam"]
 
     with patch("neat.read_simulator.utils.stitch_outputs.pysam") as mock_pysam:
         merge_bam(bam_files, ofw, threads=2)
 
-    sort_args = mock_pysam.sort.call_args[0]
-    assert str(ofw.bam) in sort_args
+    cat_args = mock_pysam.cat.call_args[0]
+    assert "-o" in cat_args
+    assert str(ofw.bam) in cat_args
 
 
-def test_merge_bam_temp_file_cleaned_up(tmp_path):
-    ofw = _make_ofw(tmp_path)
-    bam_files = [tmp_path / "a.bam"]
-    temp_merged = ofw.tmp_dir / "temp_merged.bam"
-
-    with patch("neat.read_simulator.utils.stitch_outputs.pysam"):
-        merge_bam(bam_files, ofw, threads=1)
-
-    # temp_merged.bam should have been unlinked (missing_ok=True means no error if absent)
-    assert not temp_merged.exists()
-
-
-def test_merge_bam_chunks_large_bam_list(tmp_path):
-    """More than 500 BAMs triggers chunked intermediate merges."""
+def test_merge_bam_passes_all_inputs_to_cat(tmp_path):
+    """Cat is given every per-worker BAM as a positional argument, in order."""
     ofw = _make_ofw(tmp_path)
     bam_files = [tmp_path / f"{i}.bam" for i in range(600)]
 
     with patch("neat.read_simulator.utils.stitch_outputs.pysam") as mock_pysam:
         merge_bam(bam_files, ofw, threads=1)
 
-    # Two chunks (0–499, 500–599) → 2 intermediate merges + 1 final = 3 total
-    assert mock_pysam.merge.call_count == 3
+    cat_args = mock_pysam.cat.call_args[0]
+    # cat call shape: ("-o", out_path, *input_paths)
+    passed_inputs = [a for a in cat_args if a not in ("-o", str(ofw.bam))]
+    assert passed_inputs == [str(p) for p in bam_files]
 
 
 # main
@@ -260,7 +257,10 @@ def test_main_fq1_only(tmp_path):
     src = _write_gz(tmp_path / "chunk.fq1.gz", "@read1\nACGT\n+\nIIII\n")
     output_files = [(0, _file_dict(fq1=src))]
     main(ofw, output_files)
-    assert "read1" in ofw._fq1_buf.getvalue()
+    # Byte-level concat writes raw bytes to ofw.fq1 (the path), not through the
+    # StringIO handle. Read the actual file back.
+    with gzip.open(ofw.fq1, "rt") as fh:
+        assert "read1" in fh.read()
 
 
 def test_main_fq1_and_fq2(tmp_path):
@@ -269,8 +269,10 @@ def test_main_fq1_and_fq2(tmp_path):
     src2 = _write_gz(tmp_path / "c.fq2.gz", "@r2\nTTGG\n+\nIIII\n")
     output_files = [(0, _file_dict(fq1=src1, fq2=src2))]
     main(ofw, output_files)
-    assert "r1" in ofw._fq1_buf.getvalue()
-    assert "r2" in ofw._fq2_buf.getvalue()
+    with gzip.open(ofw.fq1, "rt") as fh:
+        assert "r1" in fh.read()
+    with gzip.open(ofw.fq2, "rt") as fh:
+        assert "r2" in fh.read()
 
 
 def test_main_vcf(tmp_path):
@@ -288,8 +290,10 @@ def test_main_none_files_not_concatenated(tmp_path):
     ofw = _make_ofw(tmp_path)
     output_files = [(0, _file_dict())]  # all None
     main(ofw, output_files)  # should not raise
-    assert ofw._fq1_buf.getvalue() == ""
-    assert ofw._fq2_buf.getvalue() == ""
+    # No inputs → byte-level concat is a no-op and the destination paths are not
+    # created. The VCF buffer remains untouched (no merge_vcfs run).
+    assert not ofw.fq1.exists()
+    assert not ofw.fq2.exists()
     assert ofw._vcf_buf.getvalue() == ""
 
 
@@ -300,7 +304,8 @@ def test_main_multiple_threads(tmp_path):
         for i in range(3)
     ]
     main(ofw, chunks)
-    result = ofw._fq1_buf.getvalue()
+    with gzip.open(ofw.fq1, "rt") as fh:
+        result = fh.read()
     for i in range(3):
         assert f"chunk{i}" in result
 

From 2f561401061ba222ff7deb520b259de13eb7cf53 Mon Sep 17 00:00:00 2001
From: Joshua Allen <jallen17@illinois.edu>
Date: Mon, 18 May 2026 20:22:14 -0500
Subject: [PATCH 15/15] Bump version to 4.4.3 and fix ChangeLog spelling

- pyproject.toml: 4.4.2 -> 4.4.3 for the release.
- ChangeLog.md: behaviour -> behavior (consistency with the rest of the file).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 ChangeLog.md   | 2 +-
 pyproject.toml | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/ChangeLog.md b/ChangeLog.md
index 57c0f95e..6e2d724a 100644
--- a/ChangeLog.md
+++ b/ChangeLog.md
@@ -34,7 +34,7 @@ ecoli reference):**
 | Stitch step (parallel)  | 5.3 s                                                   |
 | Auto-tuned chunk size   | 3.1 Mb (35 chunks)                                      |
 
-Scaling behaviour vs ecoli is ~linear in genome size as expected. The stitch
+Scaling behavior vs ecoli is ~linear in genome size as expected. The stitch
 step is bounded by raw disk I/O via `pysam.cat`, so it stays at single-digit
 seconds even as the BAM grows. Per-worker peak RSS is 304 MB ÷ 4 ≈ 76 MB,
 which is the reference segment + models — independent of chunk size and
diff --git a/pyproject.toml b/pyproject.toml
index 09d66eba..ec50b6db 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "neat-genreads"
-version = "4.4.2"
+version = "4.4.3"
 description = "NGS Simulation toolkit"
 readme = "README.md"
 authors = ["Joshua Allen <jallen17@illinois.edu>", "Keshav Gandhi <krg3@illinois.edu>"]