Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
be411aa
Speed up read simulation by ~2.4x on multi-thread ecoli benchmark
joshfactorial May 18, 2026
6759026
Vectorize error sampling and replace CIGAR aligner with direct walker
joshfactorial May 18, 2026
908115e
Drop the post-sampling shuffle in cover_dataset
joshfactorial May 18, 2026
08730da
Sort BAM records at the worker write boundary; drop per-worker pysam.…
joshfactorial May 18, 2026
3f7a208
Drop the final pysam.sort in merge_bam; rely on pysam.merge ordering
joshfactorial May 18, 2026
61d678d
Vectorize get_quality_scores; eliminate per-base scalar rng.normal calls
joshfactorial May 18, 2026
6df862e
Stream BAM and FASTQ inline; drop the per-chunk reads_to_write buffer
joshfactorial May 18, 2026
4f69ae1
Stitch via pysam.cat with non-overlapping chunks
joshfactorial May 18, 2026
363b786
Auto-tune chunk size from genome length; parallelize stitch steps
joshfactorial May 18, 2026
9eb182d
Document multi-node HPC deployment; record v4.4.3 perf overhaul in Ch…
joshfactorial May 18, 2026
a4dfa0d
Polish hot path: drop redundant deepcopies, one-pass apply_errors, by…
joshfactorial May 18, 2026
6c0c2cb
ChangeLog: add c_elegans scale-test results and polish details
joshfactorial May 18, 2026
a1ebd4d
Refresh v4.4.3 benchmarks with clean post-polish numbers; flag stale …
joshfactorial May 19, 2026
ba0baf3
Update tests to match v4.4.3 API changes; CI was failing on develop
joshfactorial May 19, 2026
2f56140
Bump version to 4.4.3 and fix ChangeLog spelling
joshfactorial May 19, 2026
34a2e7e
Merge branch 'main' into develop
joshfactorial May 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions ChangeLog.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,100 @@
# NEAT v4.4.3
Major performance and memory overhaul focused on making NEAT viable for large
genomes on supercomputing hardware. No user-visible API changes (other than
chunk size now auto-tuning by default).

**Benchmark (ecoli 10× coverage, 4 threads, identical configs):**

| Metric | v4.4.2 | v4.4.3 | Improvement |
|-------------------------|------------|------------|-----------------|
| ecoli SE wall time | 14:55 | 1:35 | 9.4× faster |
| ecoli PE wall time | 14:46 | 1:35 | 9.4× faster |
| ecoli SE total CPU | 3,227 s | 331 s | 9.7× less |
| ecoli PE total CPU | 3,168 s | 338 s | 9.4× less |
| Peak resident memory | 549 MB | 175 MB | 3.1× less |
| Peak heap (memray) | 1.27 GB | 0.32 GB | 4× less |
| Per-worker memory | O(N×cov) | O(1) | bounded |
| `pysam.sort` calls | 2 | 0 | gone |
| BAM correctness | 0.06% dups | strict | fixed |

**Versus NEAT 2.1 (single-threaded baseline):**
- SE: 12:28 → 1:35 (7.9× faster, 56% less CPU)
- PE: 20:12 → 1:35 (12.8× faster, 72% less CPU)

**Scale-test (c_elegans 10× coverage, 4 threads, 100 Mb genome — ~7× the
ecoli reference):**

| Metric | Value |
|-------------------------|---------------------------------------------------------|
| Wall time | 19:16 |
| Total CPU | 4,085 s |
| Peak resident memory | 304 MB |
| BAM records | 6,685,764 |
| BAM sort violations | 0 |
| Stitch step (parallel) | 5.3 s |
| Auto-tuned chunk size | 3.1 Mb (35 chunks) |

Scaling behavior vs ecoli is ~linear in genome size as expected. The stitch
step is bounded by raw disk I/O via `pysam.cat`, so it stays at single-digit
seconds even as the BAM grows. Per-worker peak RSS is 304 MB ÷ 4 ≈ 76 MB,
which is the reference segment + models — independent of chunk size and
coverage.

**What changed in the hot path:**
- Vectorized error sampling in `get_sequencing_errors` — replaced a ~150-iteration
per-read Python loop with batched numpy. Eliminated 28M `np.prod` calls per
185k-read run.
- Vectorized `get_quality_scores` — replaced per-base scalar `rng.normal` with
one batched call.
- Replaced per-read `PairwiseAligner.align()` in `make_cigar` with a direct
walker that builds the CIGAR from known error/mutation positions in O(L).
99% of reads now skip alignment entirely.
- Rewrote `apply_errors` as a single ascending-position pass. The previous
implementation did one `np.concatenate` and one `MutableSeq` slice/concat
per error — quadratic in errors-per-read. The new pass is linear regardless
of error count.
- Removed redundant `deepcopy(self.reference_segment)` calls in
`convert_masking` and `finalize_read_and_write`. Biopython `Seq` is
immutable; the downstream operations make their own working copies.

**What changed in the I/O path:**
- Removed both `pysam.sort` calls. Per-worker BAMs are emitted coordinate-sorted
by construction; `pysam.merge` of sorted inputs already produces sorted output.
The final sort allocated a 1 GB buffer that dominated peak memory.
- Replaced `pysam.merge` with `pysam.cat` for the final stitch. cat does a raw
BGZF concatenation (no decompression / re-encode), bounded by raw disk I/O
instead of BGZF rate. At human-30× scale this is the difference between a
multi-hour stitch and a multi-minute one.
- Each chunk now owns a non-overlapping reference range for read1 placement
(`responsibility_length`), enabling the cat-based stitch and eliminating
~0.06% over-coverage in chunk-overlap regions.
- Streamed FASTQ and BAM records directly to output during read generation.
Workers no longer accumulate `reads_to_write` — per-worker memory is now
bounded by reference segment + models, not by chunk size × coverage.
- Stitch steps (FASTQ concat, VCF dedup, BAM cat) now run concurrently in
threads. On a single-disk system the wall is bounded by the BAM cat alone;
on parallel filesystems the overlap is more pronounced.
- FASTQ stitch is now byte-level: per-chunk gzip streams are concatenated
without decompression / re-encode (concatenated gzip streams form a valid
gzip file per the spec).

**Defaults and ergonomics:**
- `parallel_block_size` now auto-tunes from genome length and thread count
(target: ~8 chunks per thread). For small bacterial genomes this matches the
old hardcoded 500 kb; for human-scale genomes it produces ~6 Mb chunks
instead of ~500 kb, dramatically reducing stitch overhead. Specify the option
explicitly to override.
- FASTQ output is no longer shuffled; reads come out in the natural sampling
order. Pipe through `seqkit shuffle` if you need a uniform shuffle (documented
in README).
- Added a "Multi-node deployment on HPC clusters" section to the README
showing a SLURM array-job pattern for whole-genome simulation across nodes.

**Caveats:**
- Several of the vectorization fixes change how the PRNG stream is consumed.
Same seed will produce statistically equivalent reads, but not bit-identical
to v4.4.2. Re-baseline any regression tests that compared exact output.

# NEAT v4.4.2
- Added GC bias modeling to generate reads and a function to create a GC bias model from real data.
- Added improvements and efficiency upgrades to generate-reads.
Expand Down
77 changes: 75 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,8 +251,25 @@ Features:
- Output a BAM file with the 'golden' set of aligned reads. These indicate where each read originated and how it should be aligned with the reference
- Create paired tumour/normal datasets using characteristics learned from real tumour data

### Output ordering

The BAM file NEAT produces is coordinate-sorted (by construction at write time — no separate sort pass is run, which used to allocate ~1 GB of sort buffer at stitch time).

The FASTQ files are written in the natural fragment-sampling order. This is *roughly* random — fragments are drawn from batched random positions — but is not a strict uniform shuffle. If your downstream tooling assumes a real-sequencer-style shuffle, pipe the FASTQ through [`seqkit shuffle`](https://bioinf.shenwei.me/seqkit/usage/#shuffle):

```sh
# Single-end
seqkit shuffle reads.fastq.gz -o reads.shuffled.fastq.gz

# Paired-end — use a shared seed so the two files stay mate-aligned
seqkit shuffle -s 42 reads_r1.fastq.gz -o reads_r1.shuffled.fastq.gz
seqkit shuffle -s 42 reads_r2.fastq.gz -o reads_r2.shuffled.fastq.gz
```

### Estimated runtimes

> **Note:** The tables below are from the original NEAT 4.4 (v4.4.0) benchmark. NEAT v4.4.3 is roughly **9× faster on multi-threaded ecoli** and uses **~3× less memory** thanks to the performance work landed in that release. The relative shape of the tables (size scaling, contig vs. size mode tradeoffs) remains accurate, but absolute runtimes should be divided by ~5–10× for v4.4.3+. See ChangeLog v4.4.3 for a detailed before/after table on ecoli and a c_elegans scale-test.

To give users a sense of how long `neat read-simulator` runs may take, we benchmarked NEAT 4.4 on several reference genomes. All runs were paired-end, with read length of 150 bp, coverage of 10, fragment mean of 300 bp, and fragment standard deviation of 50 bp. Runtimes are reported as the average across three unique runs (`Avg. time (ms)`) and the corresponding runtime in minutes. Cells marked with N/A indicate that NEAT was not able to run to completion.

Benchmarks were run on a System76 Meerkat with a 13th Gen Intel Core i3-1315U (8 logical cores, up to 4.50 GHz) and 16 GiB RAM, using a 512 GB SSD and Ubuntu 24.04.3 LTS (Linux kernel 6.14). Actual runtimes will vary depending on your hardware.
Expand Down Expand Up @@ -347,7 +364,7 @@ neat read-simulator \

```
### Parallelizing simulation
In this case, you would want to split the contig into blocks, rather than reading by contig. Even in single-threaded mode, this is likely to be faster. The default block size of 500,000 yields results quickly on a variety of datasets and can be easily modified to meet your requirements.
Split the contig into blocks rather than reading by contig. Even in single-threaded mode this is likely to be faster. The chunk size auto-tunes from total genome length and thread count, targeting roughly 8 chunks per thread for good load balancing — on small bacterial genomes you get ~500 kb chunks (similar to NEAT's old hardcoded default), on human-scale genomes you get ~6 Mb chunks (a few hundred total instead of thousands). Specify `parallel_block_size` explicitly if you want to override.

Also, we demonstrate the situation where you do not want any logs produced:

Expand All @@ -361,7 +378,8 @@ fragment_mean: 350
fragment_st_dev: 50
threads: 12
parallel_mode: size
parallel_block_size: 500000
# parallel_block_size omitted: auto-tuned from genome length and thread count.
# Set explicitly (e.g. parallel_block_size: 1000000) only if you have a reason.
```
Then run with the command:
```
Expand All @@ -371,6 +389,61 @@ neat read-simulator \
-o /home/me/simulated_reads/
```

### Multi-node deployment on HPC clusters

NEAT runs on a single node using Python's `multiprocessing`. To use multiple nodes on a supercomputer, dispatch one NEAT job per contig (or contig group) as a job-array element and concatenate the outputs afterwards. Each array task gets its own node and uses all available cores on it. NEAT itself doesn't need to know about the cluster.

SLURM example with one task per human chromosome (24 tasks):

```bash
#!/bin/bash
#SBATCH --job-name=neat-array
#SBATCH --array=1-24
#SBATCH --cpus-per-task=64
#SBATCH --mem=64G
#SBATCH --time=04:00:00

CHROM=$(awk "NR==$SLURM_ARRAY_TASK_ID" chroms.txt) # e.g. chr1, chr2, ...
samtools faidx hg38.fa "$CHROM" > "ref_${CHROM}.fa" # one chromosome per task

cat > "config_${CHROM}.yml" <<YAML
reference: ref_${CHROM}.fa
read_len: 150
coverage: 30
paired_ended: true
fragment_mean: 400
fragment_st_dev: 50
produce_bam: true
produce_fastq: true
produce_vcf: true
threads: ${SLURM_CPUS_PER_TASK}
parallel_mode: size
YAML

neat read-simulator -c "config_${CHROM}.yml" \
-o "out/${CHROM}/" \
-p "${CHROM}"
```

After the array completes, concatenate the per-chromosome outputs:

```bash
# BAMs are already coordinate-sorted within each chromosome; cat in chromosome order
# produces a globally-sorted whole-genome BAM with no further sort needed.
samtools cat -o all.bam out/chr1/chr1_golden.bam out/chr2/chr2_golden.bam ... out/chrY/chrY_golden.bam
samtools index all.bam

# FASTQs: gzip-concat preserves a valid gzip stream
cat out/*/chr*_r1.fastq.gz > all_r1.fastq.gz
cat out/*/chr*_r2.fastq.gz > all_r2.fastq.gz

# VCFs: use bcftools concat for proper merging
bcftools concat -o all.vcf.gz -Oz out/*/chr*_golden.vcf.gz
bcftools index all.vcf.gz
```

This gives you whole-genome simulation in roughly `single-chromosome-time / nodes` wall time. For a 30× human genome with 24 nodes (each 64 cores), a per-chromosome run takes ~10 min and the array finishes in ~10 min wall, plus a few minutes for the final concat.

### Insert specific variants
Simulate a whole genome dataset with only the variants in the provided VCF file using `-v` and setting mutation rate to 0 with `-M`.

Expand Down
81 changes: 40 additions & 41 deletions neat/models/error_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,32 +81,27 @@ def get_quality_scores(
:return: An array of quality scores.
"""
if self.uniform_quality_score:
return np.array([self.uniform_quality_score] * length)
else:
if length == model_read_length:
quality_index_map = np.arange(model_read_length)
else:
# This is basically a way to evenly spread the distribution across the number of bases in the read
quality_index_map = np.array(
[max([0, model_read_length * n // length]) for n in range(length)]
)

temp_qual_array = []
for i in quality_index_map:
score = rng.normal(
self.quality_score_probabilities[i][0],
scale=self.quality_score_probabilities[i][1]
)
# make sure score is in range and an int
score = round(score)
if score > 42:
score = 42
if score < 1:
score = 1
return np.full(length, self.uniform_quality_score, dtype=int)

temp_qual_array.append(score)
# Map each position in the read onto a row of the (model_read_length, 2) score
# parameters table. When read length matches the model exactly this is just an
# identity; otherwise we evenly spread the model distribution across the read.
if length == model_read_length:
quality_index_map = np.arange(model_read_length, dtype=np.int64)
else:
quality_index_map = np.maximum(
0, (model_read_length * np.arange(length, dtype=np.int64)) // length
)

return np.array(temp_qual_array)
# Batched per-position normal draws — one rng.normal call returning `length` draws
# with element-wise (mean, scale) parameters. Replaces a ~150-iteration Python
# loop calling rng.normal scalar-per-base. Same statistical distribution but the
# PRNG stream is consumed in a different order, so seeded outputs are not
# bit-identical to the prior scalar-loop implementation.
means = self.quality_score_probabilities[quality_index_map, 0]
scales = self.quality_score_probabilities[quality_index_map, 1]
scores = rng.normal(means, scales)
return np.clip(np.rint(scores).astype(int), 1, 42)


class MarkovQualityModel:
Expand Down Expand Up @@ -182,30 +177,34 @@ def get_sequencing_errors(
:return: Modified sequence and associated quality scores
"""

error_indexes = []
introduced_errors = []
# pre-compute the error rate for each quality score. This is the inverse of the phred score equation
quality_score_error_rate: dict[int, float] = {x: 10. ** (-x / 10) for x in quality_scores}

# The use case here would be someone running a simulation where they want no sequencing errors.
# No need to run any loops in this case.
if self.average_error == 0:
return introduced_errors
else:
i = len(quality_scores)
while len(error_indexes) < num_errors and i > 0:
index = rng.choice(list(range(len(quality_scores))))
if rng.random() < quality_score_error_rate[quality_scores[index]]:
error_indexes.append(index)
i -= 1
# Fallback: if quality scores are too high to naturally reach num_errors,
# force errors at positions with at-or-below-median quality scores.
# Using <= so that uniform quality arrays (all scores equal) always make progress.

n = len(quality_scores)
# Batched rejection sampling: draw n candidate indices and n uniform deviates in two numpy
# calls, then accept the first num_errors candidates where the deviate is below the
# quality-derived error rate. Equivalent in distribution to the per-iteration scalar loop
# but ~150x cheaper in Python overhead. Statistical caveat: this changes the order in
# which the underlying PRNG stream is consumed, so seeded runs are not bit-identical to
# the prior interleaved-draw implementation.
candidate_indices = rng.integers(n, size=n)
candidate_randoms = rng.random(size=n)
rates_at_candidates = 10.0 ** (-quality_scores[candidate_indices].astype(float) / 10.0)
accepted = candidate_indices[candidate_randoms < rates_at_candidates]
error_indexes = accepted[:num_errors].tolist()

if len(error_indexes) < num_errors:
# Fallback: if quality scores are too high to naturally reach num_errors, force errors
# at positions with at-or-below-median quality scores. Using <= so that uniform
# quality arrays (all scores equal) always make progress.
median_score = median(quality_scores)
while len(error_indexes) < num_errors:
index = rng.integers(len(quality_scores))
if quality_scores[index] <= median_score:
error_indexes.append(index)
eligible = np.flatnonzero(quality_scores <= median_score)
needed = num_errors - len(error_indexes)
error_indexes.extend(rng.choice(eligible, size=needed, replace=True).tolist())

total_indel_length = 0
# To prevent deletion collisions
Expand Down
Loading
Loading